Skip to content

feat: Add Spark-compatible xxhash64 and murmur3 hash functions#19627

Closed
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:spark-hash
Closed

feat: Add Spark-compatible xxhash64 and murmur3 hash functions#19627
andygrove wants to merge 12 commits intoapache:mainfrom
andygrove:spark-hash

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Jan 3, 2026

Which issue does this PR close?

Rationale for this change

Donate some hash functions from Comet so that other projects can benefit from them.

The functions were initially implemented in Comet by @advancedxy

What changes are included in this PR?

I used Claude Code to copy the code from Comet and add slt tests. I manually verified that the expected values match Spark for a few cases just to be sure that the code is correct.

Are these changes tested?

Yes, tests are part of the PR.

Are there any user-facing changes?

No

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels Jan 3, 2026
Comment on lines +39 to +41
SELECT xxhash64('hello');
----
-4367754540140381902
Copy link
Copy Markdown
Member Author

@andygrove andygrove Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT xxhash64('hello')").show()
+--------------------+
|     xxhash64(hello)|
+--------------------+
|-4367754540140381902|
+--------------------+

Comment on lines +23 to +25
SELECT xxhash64(1);
----
-7001672635703045582
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT xxhash64(cast(1 as long))").show()
+---------------------------+
|xxhash64(CAST(1 AS BIGINT))|
+---------------------------+
|       -7001672635703045582|
+---------------------------+

Comment on lines +44 to +46
SELECT hash('hello');
----
-1008564952
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("SELECT hash('hello')").show()
+-----------+
|hash(hello)|
+-----------+
|-1008564952|
+-----------+

@andygrove andygrove marked this pull request as ready for review January 3, 2026 21:19
@andygrove andygrove requested a review from comphead January 3, 2026 21:34
@andygrove
Copy link
Copy Markdown
Member Author

@shehabgamin fyi

}
}

fn hash_column_murmur3(col: &ArrayRef, hashes: &mut [u32]) -> Result<()> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like support for DataType::Dictionary may be missing. In the Sail codebase, we copied the logic from Comet, where the Dictionary type is handled. However, I’m not sure whether Comet’s implementation has changed since we copied it.

In Sail, the relevant logic can be found here:

Based on the attribution comments in those files, the corresponding Comet sources appear to come from the following commit:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the dictionary type, it seems that the FixedSizeBinary is not handled as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll address this in the next few days. Moving to draft for now.

@advancedxy
Copy link
Copy Markdown
Contributor

@andygrove Thanks for pinging and mentioning. The PR is in good shape and I'm glad it's contributed back to the upstream.

Comment thread datafusion/spark/src/function/hash/xxhash64.rs Outdated
Comment thread datafusion/spark/src/function/hash/xxhash64.rs Outdated
Comment on lines +124 to +127
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
XxHash64::oneshot(seed, data.as_ref())
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: u64) -> u64 {
XxHash64::oneshot(seed, data.as_ref())
}
#[inline]
fn spark_compatible_xxhash64<T: AsRef<[u8]>>(data: T, seed: i64) -> i64 {
XxHash64::oneshot(seed as u64, data.as_ref()) as i64
}

I wonder if it's worth doing this to make it easier to compute the resulting i64 array, without needing to convert the u64 hash vec to an i64 vec (unless compiler optimizes this away anyway 🤔 )

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this, and it may make the code a little cleaner, but also increases the size of the diff on this PR. Perhaps we could consider this as a follow-on cleanup?

Comment thread datafusion/spark/src/function/hash/murmur3_hash.rs Outdated
Comment thread datafusion/spark/src/function/hash/murmur3_hash.rs Outdated
Comment on lines +132 to +138
#[inline]
fn mix_k1(mut k1: i32) -> i32 {
k1 = k1.mul_wrapping(0xcc9e2d51u32 as i32);
k1 = k1.rotate_left(15);
k1.mul_wrapping(0x1b873593u32 as i32)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide a link to where this source code was extracted from? I'm assuming it was ported from some other implementation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy do you remember where this came from?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the specific line, we could just leave a comment at the top referencing that file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy do you remember where this came from?

The murmur3 hash was added in the repo in the first initial PR/commit, maybe @sunchao knows/remembers where this is originally came from.

However I believe Spark's murmur3 hash is based Guava's Murmur3_32HashFunction, which this murmur3 hash code might also reference.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I also don't quite remember where does this came from :( I searched a few possible places where I think it could be from, but none of them match.

@andygrove andygrove marked this pull request as draft January 4, 2026 18:37
@danielgafni
Copy link
Copy Markdown

Hey, may I ask if there are plans to support xxhash32 as well?

andygrove and others added 3 commits February 17, 2026 06:48
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use args.number_rows instead of manual row count detection
- Use ColumnarValue::values_to_arrays in murmur3_hash
- Remove big-endian cfg conditional (arrow-rs doesn't target big endian)
- Add source attribution comment for MurmurHash3 algorithm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tests

Extract shared type-dispatch logic into a `create_hashes_internal!` macro
in utils.rs (ported from Comet's hash_funcs/utils.rs), adding null-count
fast paths and support for List, LargeList, FixedSizeList, Struct, and
Map types. Refactor xxhash64 and murmur3_hash to use the shared macro,
reducing ~200 lines of duplicated type dispatch per file. Add unit tests
for boundary values, emoji/CJK strings, float edge cases, struct and list
hashing, and corresponding SLT coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Copy Markdown
Member Author

Hey, may I ask if there are plans to support xxhash32 as well?

I'm not planning on working on this since Spark does not support it.

andygrove and others added 2 commits February 17, 2026 08:00
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Copy Markdown
Member Author

@shehabgamin @advancedxy @Jefffrey @mbutrovich I have addressed most of the feedback and also updated this PR to reflect changes in Comet's implementation since this PR was created, so complex types are now fully supported and there are performance improvements and some macros added to reduce duplicate code. I will mark as ready for review once CI is green.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 17, 2026 15:46
Comment on lines +105 to +113
let result_array = Int32Array::from(result);

if num_rows == 1 {
Ok(ColumnarValue::Scalar(ScalarValue::Int32(Some(
result_array.value(0),
))))
} else {
Ok(ColumnarValue::Array(Arc::new(result_array)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we could move the result array initialization to the else branch to avoid the Int32Array allocation for one element row calls.


// Convert to Int64
let result: Vec<i64> = hashes.into_iter().map(|h| h as i64).collect();
let result_array = Int64Array::from(result);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Copy Markdown
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 minor comments, which could be addressed in a follow-up mr.
LGTM.

Comment on lines +189 to +192
if !first_col {
let unpacked = take(dict_array.values().as_ref(), dict_array.keys(), None)?;
hash_column_murmur3(&unpacked, hashes, false)?;
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the reason for this difference in behaviour for first_col?


/// Spark-compatible murmur3 hash algorithm
#[inline]
pub fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 {
fn spark_compatible_murmur3_hash<T: AsRef<[u8]>>(data: T, seed: u32) -> u32 {

first_col: bool,
) -> Result<()> {
// Handle Dictionary types separately (turbofish syntax not supported in macros)
if let DataType::Dictionary(key_type, _) = col.data_type() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use downcast_dictionary_array here to cut down some boilerplate?

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale PR has not had any activity for some time label Apr 28, 2026
@andygrove
Copy link
Copy Markdown
Member Author

I'm going to open fresh PRs for the two hash functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants