feat: add Spark-compatible xxhash64 function by andygrove · Pull Request #21967 · apache/datafusion

andygrove · 2026-05-01T00:06:11Z

Which issue does this PR close?

Partially addresses Implement xxhash algorithms as part of the expression API #14044

Rationale for this change

Donate the xxhash64 hash function from Comet so that other projects can benefit from it.
The function was initially implemented in Comet by @advancedxy.

This is a continuation of #19627, which has gone stale. To keep the change focused
and easier to review, this PR adds only xxhash64. Murmur3 will follow in a separate PR.

The first commit is the same as the version of #19627 that was previously approved.

The second commit implements optmizations and bug fixes from the latest Comet version.

The third commit is cleanup.

What changes are included in this PR?

Add xxhash64(expr1, expr2, ...) to datafusion-spark.
Add Rust unit tests for primitives, boundary values, emoji/CJK strings, float -0.0
normalization, dictionaries (with and without nulls), FixedSizeBinary, Struct, and
List.
Add sqllogictest coverage in datafusion/sqllogictest/test_files/spark/hash/xxhash64.slt
with values verified against Spark.

Are these changes tested?

Yes, both Rust unit tests and sqllogictest are included.

Are there any user-facing changes?

A new xxhash64 scalar function is available in datafusion-spark.

…ning Replace the flattened single-macro dispatch with the modular macro structure from Apache DataFusion Comet. The new macros add: - Optimized list-of-primitives walker (`hash_list_with_primitive_elements!` + `hash_list_primitive!`) that walks the underlying values buffer directly, avoiding a `Vec<u64>` allocation per non-null list element and the recursive dispatch overhead. - Monomorphized fast paths for common `Map` (key, value) type combinations. - Specialized `hash_array_*` macros per category (boolean, primitive, primitive_float, small_decimal, decimal) so the dispatch is shorter and each path is independently optimizable. - Per-row scalar return path in `xxhash64` that skips the `Int64Array` allocation when `num_rows == 1`. Fix a bug in the previous list-hashing path: it seeded every element with the current row hash and then overwrote the row hash with each element's result, so only the last element actually contributed. Spark chains the hashes across list elements (`hash(elem_n, hash(elem_n-1, ... seed ...))`). The SLT expected value for `xxhash64(make_array(1, 2, 3))` is updated accordingly to match Spark. Add `Utf8View`, `BinaryView`, and `Null` arms in `create_hashes_internal!` so the new dispatch covers everything the previous implementation did.

- Reinterpret `Vec<u64>` as `ScalarBuffer<i64>` via `Buffer::from_vec` so the result `Int64Array` reuses the hashes allocation instead of walking it through `into_iter().map(|h| h as i64).collect()`. - Drop the unused `'a` lifetime and slice return on `create_xxhash64_hashes`; every caller discards the slice. - Use `key.as_usize()` (matching `datafusion_common::hash_utils`) instead of the fallible `key.to_usize().ok_or_else(...)` block — dictionary key types always fit in `usize`. - Drop the `args.args.is_empty()` guard; `Signature::variadic_any` already rejects zero-arg calls during type coercion. - Replace the `test_xxhash64_f32` / `_f64` tests (which hashed `if -0.0 == 0.0 && (-0.0).is_sign_negative() { ... }` — a constant-true conditional) with tests that exercise the `hash_array_primitive_float!` normalization path through `create_xxhash64_hashes`.

andygrove · 2026-05-01T00:34:06Z

@advancedxy @shehabgamin fyi

comphead · 2026-05-01T01:24:31Z

Can this help with apache/datafusion-comet#4131 ?

  // FIRST/LAST are order-dependent aggregates whose merge result depends on hash table
  // processing order. In PartialMerge mode, DataFusion's hash table may process rows
  // in a different order than Spark's, so we fall back to Spark for correctness.
  // https://github.com/apache/datafusion-comet/issues/4131

feat: add Spark-compatible xxhash64 function

414e8ff

github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels May 1, 2026

andygrove added 2 commits April 30, 2026 18:19

andygrove requested a review from Jefffrey May 1, 2026 00:33

andygrove requested a review from comphead May 1, 2026 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Spark-compatible xxhash64 function#21967

feat: add Spark-compatible xxhash64 function#21967
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:xxhash64

andygrove commented May 1, 2026 •

edited

Loading

Uh oh!

andygrove commented May 1, 2026

Uh oh!

comphead commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

andygrove commented May 1, 2026

Uh oh!

comphead commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented May 1, 2026 •

edited

Loading