perf: Use Arrow vectorized eq kernel for IN list with column references#20528
perf: Use Arrow vectorized eq kernel for IN list with column references#20528adriangb merged 4 commits intoapache:mainfrom
Conversation
Benchmark resultrun with Int32: 10-20x speedup across all scenarios. Improvement is greater with nulls (up to 20x) since the original row-by-row path has higher per-row null checking overhead. Utf8: 2.3-11x speedup, with larger gains at lower match rates where vectorized comparison dominates over |
neilconway
left a comment
There was a problem hiding this comment.
This is really awesome! Nice improvement. Overall LGTM, other than one obscure issue with REE types.
| // falling back to row-by-row comparator for nested types (Struct, List, etc.) | ||
| // where eq semantics are ambiguous. | ||
| let value = value.into_array(num_rows)?; | ||
| let use_arrow_eq = !value.data_type().is_nested(); |
There was a problem hiding this comment.
Do we know for sure that Arrow's eq kernel will work for all non-nested types? Perhaps that would be a bit fragile if we add new types in the future. I wonder if we should explicitly whitelist the types we know that work?
Digging around a bit, it seems we panic if we try to pass RunEndEncoded types to eq, but REE types aren't considered nested.
There was a problem hiding this comment.
Thanks for the great catch! I've replaced the !is_nested() check with an explicit whitelist (supports_arrow_eq) that only enables Arrow's eq kernel for known-supported types. Please take another look when you have a chance.
| match dt { | ||
| Boolean | Binary | LargeBinary | BinaryView | FixedSizeBinary(_) => true, | ||
| dt if dt.is_primitive() || dt.is_null() || dt.is_string() => true, | ||
| Dictionary(_, v) => supports_arrow_eq(v.as_ref()), | ||
| _ => false, | ||
| } |
There was a problem hiding this comment.
Optional but maybe a bit nicer?
| match dt { | |
| Boolean | Binary | LargeBinary | BinaryView | FixedSizeBinary(_) => true, | |
| dt if dt.is_primitive() || dt.is_null() || dt.is_string() => true, | |
| Dictionary(_, v) => supports_arrow_eq(v.as_ref()), | |
| _ => false, | |
| } | |
| match dt { | |
| Boolean | Binary | LargeBinary | BinaryView | FixedSizeBinary(_) => true, | |
| Dictionary(_, v) => supports_arrow_eq(v.as_ref()), | |
| _ => dt.is_primitive() || dt.is_null() || dt.is_string(), | |
| } |
There was a problem hiding this comment.
Optional but maybe a bit nicer?
Good suggestion, thanks! Much more concise this way. Applied in the latest commit.
|
Hi! @adriangb Thanks! |
| /// binary (Binary/LargeBinary/BinaryView/FixedSizeBinary), Null, and | ||
| /// Dictionary-encoded variants of the above. | ||
| /// Unsupported: nested types (Struct, List, Map, Union) and RunEndEncoded. | ||
| fn supports_arrow_eq(dt: &DataType) -> bool { |
There was a problem hiding this comment.
It's unfortunate arrow itself doesn't expose this check
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark in_list |
|
🤖 |
|
🤖: Benchmark completed Details
|
Those are some pretty crazy improvements for Int32/Utf8 -- nice work I wonder why the others don't show a similar improvement |
adriangb
left a comment
There was a problem hiding this comment.
I'm also noticing there's quite a bit of variability in some of these benchmarks:
in_list/Float32/list=28/nulls=20% 1.49 77.0±0.92µs ? ?/sec 1.00 51.6±0.20µs ? ?/sec
I think this is noise but it's quite a bit of noise. I think these would be good candidates for Codspeed.
That said I feel like this change makes a lot of sense and the numbers generally look good - I propose we go ahead and merge it.
This patch only optimizes the IN LIST path without static filters, which was benchmarked using For IN LIST with a static filter, it uses a hash set for matching and is unchanged in this patch. Therefore, the |
|
Thanks again @zhangxffff and @adriangb -- very nice |
…es (#20694) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #20428 . ## Rationale for this change Third PR in the IN list optimization series (split from #20428): - PR1: benchmarks (#20444, merged) - PR2: Arrow vectorized eq kernel (#20528, merged) - **PR3 (this): short-circuit, collect_bool, and first-expr initialization** <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - **Short-circuit break**: convert `try_fold` to `for` loop; when all non-null rows are already `true`, skip remaining list items (up to 27x faster for match=100%/nulls=0%) - **`BooleanBuffer::collect_bool`**: use in `make_comparator` fallback path for nested types instead `(0..n).map().collect()` (suggested by @Dandandan in #20428 ) - **First-expr initialization**: evaluate the first list expression directly as the accumulator, avoiding a redundant `or_kleene(all_false, rhs)` (suggested by @Dandandan in #20428 ) - **Tests**: added 3 new tests covering short-circuit, short-circuit with nulls, and struct column references (make_comparator fallback path) <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, and add test to cover short-circuit, short-circuit with nulls, and struct column references (make_comparator fallback path) Benchmark result: ``` (zhangxffff) zhangxffff/datafusion@95d3d60664da ~/W/datafusion ((bcc52cd))> critcmp after before group after before ----- ----- ------ in_list_cols/Int32/list=28/match=0%/nulls=0% 1.02 93.8±1.80µs ? ?/sec 1.00 91.8±1.52µs ? ?/sec in_list_cols/Int32/list=28/match=0%/nulls=20% 1.03 105.3±1.95µs ? ?/sec 1.00 102.2±1.59µs ? ?/sec in_list_cols/Int32/list=28/match=100%/nulls=0% 1.00 3.4±0.07µs ? ?/sec 27.14 91.7±1.52µs ? ?/sec in_list_cols/Int32/list=28/match=100%/nulls=20% 1.07 107.7±1.91µs ? ?/sec 1.00 100.4±1.33µs ? ?/sec in_list_cols/Int32/list=28/match=50%/nulls=0% 1.00 50.1±1.15µs ? ?/sec 1.84 92.4±1.36µs ? ?/sec in_list_cols/Int32/list=28/match=50%/nulls=20% 1.05 105.1±1.49µs ? ?/sec 1.00 100.0±0.84µs ? ?/sec in_list_cols/Int32/list=3/match=0%/nulls=0% 1.00 9.9±0.17µs ? ?/sec 1.01 10.1±0.19µs ? ?/sec in_list_cols/Int32/list=3/match=0%/nulls=20% 1.02 11.0±0.18µs ? ?/sec 1.00 10.8±0.16µs ? ?/sec in_list_cols/Int32/list=3/match=100%/nulls=0% 1.00 3.3±0.06µs ? ?/sec 2.95 9.9±0.16µs ? ?/sec in_list_cols/Int32/list=3/match=100%/nulls=20% 1.01 10.9±0.19µs ? ?/sec 1.00 10.8±0.09µs ? ?/sec in_list_cols/Int32/list=3/match=50%/nulls=0% 1.00 10.0±0.17µs ? ?/sec 1.00 9.9±0.18µs ? ?/sec in_list_cols/Int32/list=3/match=50%/nulls=20% 1.05 11.3±0.24µs ? ?/sec 1.00 10.8±0.11µs ? ?/sec in_list_cols/Int32/list=8/match=0%/nulls=0% 1.02 26.7±0.58µs ? ?/sec 1.00 26.2±0.50µs ? ?/sec in_list_cols/Int32/list=8/match=0%/nulls=20% 1.04 29.6±0.57µs ? ?/sec 1.00 28.5±0.45µs ? ?/sec in_list_cols/Int32/list=8/match=100%/nulls=0% 1.00 3.4±0.05µs ? ?/sec 7.78 26.2±0.36µs ? ?/sec in_list_cols/Int32/list=8/match=100%/nulls=20% 1.05 30.0±0.65µs ? ?/sec 1.00 28.7±0.55µs ? ?/sec in_list_cols/Int32/list=8/match=50%/nulls=0% 1.03 26.7±0.59µs ? ?/sec 1.00 26.0±0.37µs ? ?/sec in_list_cols/Int32/list=8/match=50%/nulls=20% 1.04 29.9±0.57µs ? ?/sec 1.00 28.7±0.46µs ? ?/sec in_list_cols/Utf8/list=28/match=0% 1.17 155.0±2.44µs ? ?/sec 1.00 132.8±2.97µs ? ?/sec in_list_cols/Utf8/list=28/match=100% 1.02 726.6±14.54µs ? ?/sec 1.00 712.4±9.09µs ? ?/sec in_list_cols/Utf8/list=28/match=50% 1.02 1070.1±13.06µs ? ?/sec 1.00 1051.8±8.17µs ? ?/sec in_list_cols/Utf8/list=3/match=0% 1.14 16.4±0.37µs ? ?/sec 1.00 14.4±0.22µs ? ?/sec in_list_cols/Utf8/list=3/match=100% 1.02 68.0±1.29µs ? ?/sec 1.00 66.5±0.99µs ? ?/sec in_list_cols/Utf8/list=3/match=50% 1.15 107.6±2.05µs ? ?/sec 1.00 93.6±1.88µs ? ?/sec in_list_cols/Utf8/list=8/match=0% 1.16 44.0±0.61µs ? ?/sec 1.00 37.9±0.95µs ? ?/sec in_list_cols/Utf8/list=8/match=100% 1.00 190.4±2.71µs ? ?/sec 1.03 195.7±2.01µs ? ?/sec in_list_cols/Utf8/list=8/match=50% 1.03 295.9±4.45µs ? ?/sec 1.00 287.3±3.26µs ? ?/sec ``` <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
…es (apache#20528) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Relates to apache#20427 . ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> When the IN list contains column references (e.g. `SELECT * FROM t WHERE a IN (b, c, d, e)`), DataFusion falls back to a row-by-row `make_comparator` path which is significantly slower than it needs to be. Arrow provides SIMD-optimized `eq` kernels that can compare entire arrays in one call. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> - Use Arrow's vectorized `eq` kernel instead of row-by-row `make_comparator` for non-nested types (primitive, string, binary) in the column-reference IN list evaluation path - For nested types (Struct, List, etc.), fall back to `make_comparator` since Arrow's `eq` kernel does not support them - Add 6 unit tests covering the column-reference evaluation path (Int32, Utf8, NOT IN, NULL handling, NaN semantics) ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> Yes. 6 new unit tests added: - `test_in_list_with_columns_int32_scalars` - `test_in_list_with_columns_int32_column_refs` - `test_in_list_with_columns_utf8_column_refs` - `test_in_list_with_columns_negated` - `test_in_list_with_columns_null_in_list` - `test_in_list_with_columns_float_nan` ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> No API changes. Queries with column-reference IN lists will run faster. <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…es (apache#20694) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#20428 . ## Rationale for this change Third PR in the IN list optimization series (split from apache#20428): - PR1: benchmarks (apache#20444, merged) - PR2: Arrow vectorized eq kernel (apache#20528, merged) - **PR3 (this): short-circuit, collect_bool, and first-expr initialization** <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? - **Short-circuit break**: convert `try_fold` to `for` loop; when all non-null rows are already `true`, skip remaining list items (up to 27x faster for match=100%/nulls=0%) - **`BooleanBuffer::collect_bool`**: use in `make_comparator` fallback path for nested types instead `(0..n).map().collect()` (suggested by @Dandandan in apache#20428 ) - **First-expr initialization**: evaluate the first list expression directly as the accumulator, avoiding a redundant `or_kleene(all_false, rhs)` (suggested by @Dandandan in apache#20428 ) - **Tests**: added 3 new tests covering short-circuit, short-circuit with nulls, and struct column references (make_comparator fallback path) <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? Yes, and add test to cover short-circuit, short-circuit with nulls, and struct column references (make_comparator fallback path) Benchmark result: ``` (zhangxffff) zhangxffff/datafusion@95d3d60664da ~/W/datafusion ((bcc52cd))> critcmp after before group after before ----- ----- ------ in_list_cols/Int32/list=28/match=0%/nulls=0% 1.02 93.8±1.80µs ? ?/sec 1.00 91.8±1.52µs ? ?/sec in_list_cols/Int32/list=28/match=0%/nulls=20% 1.03 105.3±1.95µs ? ?/sec 1.00 102.2±1.59µs ? ?/sec in_list_cols/Int32/list=28/match=100%/nulls=0% 1.00 3.4±0.07µs ? ?/sec 27.14 91.7±1.52µs ? ?/sec in_list_cols/Int32/list=28/match=100%/nulls=20% 1.07 107.7±1.91µs ? ?/sec 1.00 100.4±1.33µs ? ?/sec in_list_cols/Int32/list=28/match=50%/nulls=0% 1.00 50.1±1.15µs ? ?/sec 1.84 92.4±1.36µs ? ?/sec in_list_cols/Int32/list=28/match=50%/nulls=20% 1.05 105.1±1.49µs ? ?/sec 1.00 100.0±0.84µs ? ?/sec in_list_cols/Int32/list=3/match=0%/nulls=0% 1.00 9.9±0.17µs ? ?/sec 1.01 10.1±0.19µs ? ?/sec in_list_cols/Int32/list=3/match=0%/nulls=20% 1.02 11.0±0.18µs ? ?/sec 1.00 10.8±0.16µs ? ?/sec in_list_cols/Int32/list=3/match=100%/nulls=0% 1.00 3.3±0.06µs ? ?/sec 2.95 9.9±0.16µs ? ?/sec in_list_cols/Int32/list=3/match=100%/nulls=20% 1.01 10.9±0.19µs ? ?/sec 1.00 10.8±0.09µs ? ?/sec in_list_cols/Int32/list=3/match=50%/nulls=0% 1.00 10.0±0.17µs ? ?/sec 1.00 9.9±0.18µs ? ?/sec in_list_cols/Int32/list=3/match=50%/nulls=20% 1.05 11.3±0.24µs ? ?/sec 1.00 10.8±0.11µs ? ?/sec in_list_cols/Int32/list=8/match=0%/nulls=0% 1.02 26.7±0.58µs ? ?/sec 1.00 26.2±0.50µs ? ?/sec in_list_cols/Int32/list=8/match=0%/nulls=20% 1.04 29.6±0.57µs ? ?/sec 1.00 28.5±0.45µs ? ?/sec in_list_cols/Int32/list=8/match=100%/nulls=0% 1.00 3.4±0.05µs ? ?/sec 7.78 26.2±0.36µs ? ?/sec in_list_cols/Int32/list=8/match=100%/nulls=20% 1.05 30.0±0.65µs ? ?/sec 1.00 28.7±0.55µs ? ?/sec in_list_cols/Int32/list=8/match=50%/nulls=0% 1.03 26.7±0.59µs ? ?/sec 1.00 26.0±0.37µs ? ?/sec in_list_cols/Int32/list=8/match=50%/nulls=20% 1.04 29.9±0.57µs ? ?/sec 1.00 28.7±0.46µs ? ?/sec in_list_cols/Utf8/list=28/match=0% 1.17 155.0±2.44µs ? ?/sec 1.00 132.8±2.97µs ? ?/sec in_list_cols/Utf8/list=28/match=100% 1.02 726.6±14.54µs ? ?/sec 1.00 712.4±9.09µs ? ?/sec in_list_cols/Utf8/list=28/match=50% 1.02 1070.1±13.06µs ? ?/sec 1.00 1051.8±8.17µs ? ?/sec in_list_cols/Utf8/list=3/match=0% 1.14 16.4±0.37µs ? ?/sec 1.00 14.4±0.22µs ? ?/sec in_list_cols/Utf8/list=3/match=100% 1.02 68.0±1.29µs ? ?/sec 1.00 66.5±0.99µs ? ?/sec in_list_cols/Utf8/list=3/match=50% 1.15 107.6±2.05µs ? ?/sec 1.00 93.6±1.88µs ? ?/sec in_list_cols/Utf8/list=8/match=0% 1.16 44.0±0.61µs ? ?/sec 1.00 37.9±0.95µs ? ?/sec in_list_cols/Utf8/list=8/match=100% 1.00 190.4±2.71µs ? ?/sec 1.03 195.7±2.01µs ? ?/sec in_list_cols/Utf8/list=8/match=50% 1.03 295.9±4.45µs ? ?/sec 1.00 287.3±3.26µs ? ?/sec ``` <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
When the IN list contains column references (e.g.
SELECT * FROM t WHERE a IN (b, c, d, e)), DataFusion falls back to a row-by-rowmake_comparatorpath which is significantly slower than it needs to be. Arrow provides SIMD-optimizedeqkernels that can compare entire arrays in one call.What changes are included in this PR?
eqkernel instead of row-by-rowmake_comparatorfor non-nested types (primitive, string, binary) in the column-reference IN list evaluation pathmake_comparatorsince Arrow'seqkernel does not support themAre these changes tested?
Yes. 6 new unit tests added:
test_in_list_with_columns_int32_scalarstest_in_list_with_columns_int32_column_refstest_in_list_with_columns_utf8_column_refstest_in_list_with_columns_negatedtest_in_list_with_columns_null_in_listtest_in_list_with_columns_float_nanAre there any user-facing changes?
No API changes. Queries with column-reference IN lists will run faster.