Skip to content

feat: add Spark-compatible xxhash64 function#21967

Draft
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:xxhash64
Draft

feat: add Spark-compatible xxhash64 function#21967
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:xxhash64

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 1, 2026

Which issue does this PR close?

Rationale for this change

Donate the xxhash64 hash function from Comet so that other projects can benefit from it.
The function was initially implemented in Comet by @advancedxy.

This is a continuation of #19627, which has gone stale. To keep the change focused
and easier to review, this PR adds only xxhash64. Murmur3 will follow in a separate PR.

The first commit is the same as the version of #19627 that was previously approved.

The second commit implements optmizations and bug fixes from the latest Comet version.

The third commit is cleanup.

What changes are included in this PR?

  • Add xxhash64(expr1, expr2, ...) to datafusion-spark.
  • Add Rust unit tests for primitives, boundary values, emoji/CJK strings, float -0.0
    normalization, dictionaries (with and without nulls), FixedSizeBinary, Struct, and
    List.
  • Add sqllogictest coverage in datafusion/sqllogictest/test_files/spark/hash/xxhash64.slt
    with values verified against Spark.

Are these changes tested?

Yes, both Rust unit tests and sqllogictest are included.

Are there any user-facing changes?

A new xxhash64 scalar function is available in datafusion-spark.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels May 1, 2026
…ning

Replace the flattened single-macro dispatch with the modular macro structure
from Apache DataFusion Comet. The new macros add:

- Optimized list-of-primitives walker (`hash_list_with_primitive_elements!`
  + `hash_list_primitive!`) that walks the underlying values buffer directly,
  avoiding a `Vec<u64>` allocation per non-null list element and the
  recursive dispatch overhead.
- Monomorphized fast paths for common `Map` (key, value) type combinations.
- Specialized `hash_array_*` macros per category (boolean, primitive,
  primitive_float, small_decimal, decimal) so the dispatch is shorter and
  each path is independently optimizable.
- Per-row scalar return path in `xxhash64` that skips the `Int64Array`
  allocation when `num_rows == 1`.

Fix a bug in the previous list-hashing path: it seeded every element with
the current row hash and then overwrote the row hash with each element's
result, so only the last element actually contributed. Spark chains the
hashes across list elements (`hash(elem_n, hash(elem_n-1, ... seed ...))`).
The SLT expected value for `xxhash64(make_array(1, 2, 3))` is updated
accordingly to match Spark.

Add `Utf8View`, `BinaryView`, and `Null` arms in `create_hashes_internal!`
so the new dispatch covers everything the previous implementation did.
- Reinterpret `Vec<u64>` as `ScalarBuffer<i64>` via `Buffer::from_vec` so
  the result `Int64Array` reuses the hashes allocation instead of walking
  it through `into_iter().map(|h| h as i64).collect()`.
- Drop the unused `'a` lifetime and slice return on `create_xxhash64_hashes`;
  every caller discards the slice.
- Use `key.as_usize()` (matching `datafusion_common::hash_utils`) instead of
  the fallible `key.to_usize().ok_or_else(...)` block — dictionary key types
  always fit in `usize`.
- Drop the `args.args.is_empty()` guard; `Signature::variadic_any` already
  rejects zero-arg calls during type coercion.
- Replace the `test_xxhash64_f32` / `_f64` tests (which hashed
  `if -0.0 == 0.0 && (-0.0).is_sign_negative() { ... }` — a constant-true
  conditional) with tests that exercise the `hash_array_primitive_float!`
  normalization path through `create_xxhash64_hashes`.
@andygrove andygrove requested a review from Jefffrey May 1, 2026 00:33
@andygrove
Copy link
Copy Markdown
Member Author

@advancedxy @shehabgamin fyi

@andygrove andygrove requested a review from comphead May 1, 2026 01:18
@comphead
Copy link
Copy Markdown
Contributor

comphead commented May 1, 2026

Can this help with apache/datafusion-comet#4131 ?

  // FIRST/LAST are order-dependent aggregates whose merge result depends on hash table
  // processing order. In PartialMerge mode, DataFusion's hash table may process rows
  // in a different order than Spark's, so we fall back to Spark for correctness.
  // https://github.com/apache/datafusion-comet/issues/4131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants