Skip to content

perf: Optimize comparison on nested types#20716

Merged
alamb merged 5 commits intoapache:mainfrom
neilconway:neilc/optimize-struct-cmp-op
Mar 16, 2026
Merged

perf: Optimize comparison on nested types#20716
alamb merged 5 commits intoapache:mainfrom
neilconway:neilc/optimize-struct-cmp-op

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

@neilconway neilconway commented Mar 5, 2026

Which issue does this PR close?

N/A

Rationale for this change

BooleanBuffer::collect_bool() is faster than map().collect(). Per discussion on #20694; originally suggested by @Dandandan.

What changes are included in this PR?

  • Implement optimization
  • Add benchmark for nested struct comparison

Are these changes tested?

Yes, covered by existing tests.

Are there any user-facing changes?

No.

AI usage

Multiple AI tools were used to iterate on this PR. I have reviewed and understand the resulting code.

@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Mar 5, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

Benchmarks:

  group                          base                                   target
  -----                          ----                                   ------
  compare_nested array_array     1.17     52.5±0.78µs        ? ?/sec    1.00     44.8±0.27µs        ? ?/sec
  compare_nested array_scalar    1.12     50.1±0.38µs        ? ?/sec    1.00     44.6±0.30µs        ? ?/sec

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thanks @neilconway

I do wonder why we can't update collect to call collect_bool internally 🤔

@alamb alamb added the performance Make DataFusion faster label Mar 5, 2026
@neilconway
Copy link
Copy Markdown
Contributor Author

@alamb Is this good to land? Let me know if you have any suggestions!

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 16, 2026

@alamb Is this good to land? Let me know if you have any suggestions!

Sorry @neilconway -- it looks good to me -- it got lost in the shuffle. I updated once more to rerun the tests and will merge once those pass

@alamb alamb added this pull request to the merge queue Mar 16, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 16, 2026

Thanks again @neilconway

Merged via the queue into apache:main with commit 4166a6d Mar 16, 2026
37 checks passed
@neilconway neilconway deleted the neilc/optimize-struct-cmp-op branch March 18, 2026 16:00
de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
## Which issue does this PR close?

N/A

## Rationale for this change

`BooleanBuffer::collect_bool()` is faster than `map().collect()`. Per
discussion on apache#20694; originally suggested by @Dandandan.

## What changes are included in this PR?

- Implement optimization
- Add benchmark for nested struct comparison

## Are these changes tested?

Yes, covered by existing tests.

## Are there any user-facing changes?

No.

## AI usage

Multiple AI tools were used to iterate on this PR. I have reviewed and
understand the resulting code.

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Make DataFusion faster physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants