Adding Use of arrow's has_true() / has_false()#21806
Conversation
|
run benchmarks |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (ca7e0a8) to 067ba4b (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (ca7e0a8) to 067ba4b (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Hi @adriangb , requesting you to review. Thanks |
|
run benchmarks |
|
Hi @adriangb, thanks for the request (#21806 (comment)). You already have 15 pending benchmark job(s), and this request would add 3 more — exceeding the per-user queue limit of 15. Please wait for some of your current runs to finish before submitting more. File an issue against this benchmark runner |
|
run benchmarks |
adriangb
left a comment
There was a problem hiding this comment.
Some more nits: please restore the comments and double check the usage in replace.rs.
Thanks for working on this!
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
It looks like it can't load the lineitem table |
|
run benchmark tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing feat/21784-has-true-has-false (8777c09) to 42cd2fa (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
Benches look good. @raushanprabhakar1 if you can address my feedback we can move to merge this |
maybe we should add a sanity check to the benchmark runner / better error message (so it fails with a message that says the files aren't there rather than failing on the expected output check) 🤔 |
Yes, I’ve run into this issue before. Even with just DataFusion CLI if you pointed to an empty directory, it will give you a table with no columns instead of some sort of error that says there’s no data here. I can’t think of any use case where you’d want to infer a schema with no columns instead of en error saying there’s no data. |
|
Hi @adriangb , Thanks for your feedback, i have addressed all your feedbacks, please do review, Thanks. |
adriangb
left a comment
There was a problem hiding this comment.
Small nit. Otherwise LGTM once CI passes!
Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
|
Thanks @raushanprabhakar1 ! |
Which issue does this PR close?
Closes #21784
Rationale for this change
Apache Arrow added
BooleanArray::has_true()andhas_false()so callers can answer “any true/false?” without a full bit count. That can short-circuit and avoid unnecessary work compared to patterns liketrue_count() == 0ortrue_count() > 0.This PR applies those APIs across DataFusion where the logic is purely existential (or equivalent via null-safe “all true” / “no true” checks), matching the audit suggested in the issue.
What changes are included in this PR?
has_true()/has_false()(andnull_count()where needed), including:array_has, list replace), Sparkarray_containsnull-semantics fast pathevaluate_selection, binary AND/OR short-circuit, CASE/IN list loopsscatterfast pathsmetadata.rs,has_any_exact_match)true_count()/false_count()where an actual count is required (row counts, metrics, selectivity,to_array(n), etc.)arrow::array::Arraywherenull_count()is used onBooleanArrayin trait-heavy pathsAre these changes tested?
Existing tests cover this behavior; the edits are semantics-preserving refactors (same conditions, cheaper primitives). No new tests were added.
Are there any user-facing changes?
No. Behavior should be unchanged; this is an internal performance/clarity improvement.