Add wide-schema benchmark suite for measuring per-file metadata overhead#21970
Open
adriangb wants to merge 1 commit intoapache:mainfrom
Open
Add wide-schema benchmark suite for measuring per-file metadata overhead#21970adriangb wants to merge 1 commit intoapache:mainfrom
adriangb wants to merge 1 commit intoapache:mainfrom
Conversation
d996aee to
147617d
Compare
Adds a new sql_benchmarks suite that isolates the wide-schema scan
overhead in selective parquet queries: the regime where most of the
work is loading footers / column-chunk metadata rather than reading
row data, and that cost scales with the number of column chunks in
the dataset rather than with the number of columns the query touches.
The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP):
- wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload
- narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal
baseline, only meaningful as a comparison point
Both share row count, file count, and per-file row-group structure;
only schema width differs. All 4 queries run on both subgroups so
every wide number has a directly comparable narrow baseline.
A new gen_wide_data binary synthesizes both datasets in ~60 s with no
external data source. The 8-column base schema (id, value, count, ts,
category, flag, status, text) carries deterministic data; copies 2..N
from the suffix-renamed widening are zero-filled (zero rather than
null since reader-side null-array shortcuts mute the slowdown by
~35 %).
Query coverage:
- Q01: filter + project + ORDER BY + LIMIT (TopK shortcut)
- Q02: project 1 column with tight filter + LIMIT 1
- Q03: tight filter + small projection, no sort
- Q04: two low-cardinality string filters + a non-stat-prunable
modulo predicate for tight selectivity (~0.005 % match rate),
project two columns, no LIMIT or ORDER BY
For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown
wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x
while bloom_filter_eval_time and statistics_eval_time stay flat.
bench.sh adds:
- data wide_schema: synthesizes both wide and narrow datasets
- run wide_schema: runs the wide subgroup, then the narrow
baseline subgroup, for query-by-query comparison
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147617d to
63eb746
Compare
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
#21968
Rationale for this change
Adds a benchmark suite that isolates a wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query references. Existing benchmarks don't exercise this shape — most TPC-H/ClickBench queries either touch many columns or filter heavily enough that scan work dominates. We need a focused benchmark so this kind of regression is measurable in CI and so optimizations to the wide-schema scan path can be validated.
What changes are included in this PR?
A new sql_benchmarks suite,
wide_schema/(underbenchmarks/sql_benchmarks/), with two subgroups selected viaBENCH_SUBGROUP:wide— runs against a synthesized wide dataset (1024 cols × 256 files × 50 k rows, ~225 MB). This is the actual workload.narrow— runs the same SQL against an 8-col version of the same dataset (same row count, file count, per-file row-group shape, ~110 MB). This subgroup exists only as a baseline for the wide subgroup; reading its numbers in isolation tells you very little. The per-query wide-vs-narrow ratio is what isolates the schema-width cost.All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline.
A new binary,
gen_wide_data(inbenchmarks/src/bin/), synthesizes both datasets in ~60 s with no external data dependency. The 8-column base schema is generic (id,value,count,ts,category,flag,status,text) and carries deterministic synthetic data; the suffix-renamed copies (id_2,id_3, …,id_128, etc.) are zero-filled. Two design notes:Query coverage:
Q01— filter + project +ORDER BY+LIMIT(TopK shortcut)Q02— project 1 column with a tight filter andLIMIT 1Q03— tight filter + small projection, no sortQ04— two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity, project two columns, noLIMITorORDER BYbench.shadditions:Are these changes tested?
Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache).
Criterion (3 s warmup, 10 samples, median):
Cold-start
datafusion-cli(Q04 shape, median of 3):EXPLAIN ANALYZE phase deltas (Q04, cumulative across 12 scan tasks):
Same qualitative shape:
metadata_load_timeand per-file setup scale with column-chunk count; predicate-evaluation phases stay flat regardless of schema width.cargo fmt --allandcargo clippy --bin gen_wide_data --all-features -- -D warningsare clean.Are there any user-facing changes?
No public API changes. New benchmark suite + new utility binary under
benchmarks/.