Add SQL based benchmarking harness, port tpch to use framework#21707
Add SQL based benchmarking harness, port tpch to use framework#21707alamb merged 40 commits intoapache:mainfrom
Conversation
…k to use this new framework. The README.md includes notes about other benchmarks which will be pushed after the initial work is accepted.
|
Moving back to draft as there are a number of improvements I want to make |
…r refactoring and writing the tests and I have reviewed all changes.
|
This should now be ready for review. |
| The sql benchmarks are organized in sub‑directories that correspond to the benchmark suites that are commonly used | ||
| in the community: | ||
|
|
||
| | Benchmark Suite | Description | |
There was a problem hiding this comment.
This readme covers all the test suites that I've converted for which I'll have additional PR's for each if this PR is merged.
|
I will review this coming week 😄 |
…ven if not in validate mode.
…ution in benchmark files - ${VARIABLE:-default|true value|false value}
adriangb
left a comment
There was a problem hiding this comment.
Generally looks great and is certainly a move in the right direction if we want to be able to run under Codspeed!
Since this doesn't break the current benchmarking setup the only cost to merging this and changing later is code churn, which is low.
My one blocking concern with this is the env vars. I think we should be able to / only support setting them via DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 cargo bench ... and such.
| | PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. | | ||
| | BATCH_SIZE | Batch size when reading CSV or Parquet files. | | ||
| | MEM_POOL_TYPE | The memory pool type to use, should be one of "fair" or "greedy". | | ||
| | MEMORY_LIMIT | Memory limit (e.g. '100M', '1.5G'). If not specified, run all pre-defined memory limits for given query if there's any, otherwise run with no memory limit. | | ||
| | DATAFUSION_RUNTIME_MEMORY_LIMIT | Used if MEMORY_LIMIT is not set. | | ||
| | SORT_SPILL_RESERVATION_BYTES | The amount of memory to reserve for sort spill operations. DataFusion's default value will be used if not specified. | |
There was a problem hiding this comment.
Why are these special? Don't we already have a pattern for setting this via env vars e.g. with DataFusion CLI that we can honor?
There was a problem hiding this comment.
Those are basically what bench.sh already supported iirc ... just documented. Maybe not 'special' but not directly related to the sql benchmarks.
There was a problem hiding this comment.
Do these env vars actually work? Maybe I'm missing something but I don't see how they get picked up.
There was a problem hiding this comment.
It uses the existing code in options.rs via make_ctx() -> args.options.config()? -> self.update_config(config)
There was a problem hiding this comment.
It seems datafusion configurations can already get passed into benchmark runs:
(from ./bench.sh --help)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Supported Configuration (Environment Variables)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
DATA_DIR directory to store datasets
...
DATAFUSION_* Set the given datafusion configuration
I think the existing benchmark runner configurations that duplicate DataFusion core configurations are redundant. We could instead document the relevant config names and link to the configuration reference page. This might be a good cleanup item during the benchmark suite migration.
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
|
I ran And it seems to work well! |
|
run benchmark tpch |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing new_sql_benchmark (238d0a6) to a311d14 (merge-base) diff using: tpch File an issue against this benchmark runner |
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
There was a problem hiding this comment.
Thank you! This looks great.
For future work, I suggest first cleaning up the TPC-H benchmark suite before migrating any other benchmarks.
It would be great to have a single entry point for SQL benchmarks that’s easy to discover and use. A couple of ideas toward that goal:
- Move user-facing documentation to
benchmarks/README.md, and keepbenchmarks/sql_benchmarks/README.mdfocused on internal runner details - Support a local config file (e.g.,
toml) with env vars as overrides. I feel managing many env vars locally is cumbersome.
|
|
||
| load sql_benchmarks/tpch/init/load_${TPCH_FILE_TYPE:-parquet}.sql | ||
|
|
||
| assert I |
There was a problem hiding this comment.
The assertion logic is implemented inside sql_benchmark.rs, perhaps we could leverage sqlogictest runner instead.
There was a problem hiding this comment.
Interesting idea. Followup PR possibly. The assertion code here doesn't handle types at all - only strings whereas the sqllogictests are obviously a bit more advanced. It's mostly meant to validate that a table was loaded since create external table ... doesn't fail if the file being loaded doesn't exist. That was kinda a WTF moment for me.
| | PARTITIONS | Number of partitions to process in parallel. Defaults to number of available cores. | | ||
| | BATCH_SIZE | Batch size when reading CSV or Parquet files. | | ||
| | MEM_POOL_TYPE | The memory pool type to use, should be one of "fair" or "greedy". | | ||
| | MEMORY_LIMIT | Memory limit (e.g. '100M', '1.5G'). If not specified, run all pre-defined memory limits for given query if there's any, otherwise run with no memory limit. | | ||
| | DATAFUSION_RUNTIME_MEMORY_LIMIT | Used if MEMORY_LIMIT is not set. | | ||
| | SORT_SPILL_RESERVATION_BYTES | The amount of memory to reserve for sort spill operations. DataFusion's default value will be used if not specified. | |
There was a problem hiding this comment.
It seems datafusion configurations can already get passed into benchmark runs:
(from ./bench.sh --help)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Supported Configuration (Environment Variables)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
DATA_DIR directory to store datasets
...
DATAFUSION_* Set the given datafusion configuration
I think the existing benchmark runner configurations that duplicate DataFusion core configurations are redundant. We could instead document the relevant config names and link to the configuration reference page. This might be a good cleanup item during the benchmark suite migration.
… an issue with parsing multiple query statements when the queries were inline in the benchmark file (vs external in another file).
|
I've gone ahead and inlined the template file and the query into each of the tpch benchmark files and update the code to allow for multiple statements (previously it was only allowed for external files). Please let me know if this new approach is what is desired. |
# Conflicts: # Cargo.lock
|
I plan to review this later today |
|
I ran out of time today (it is hard to find time for 5K lines!) -- but it is on my list for tomorrow |
|
|
||
| init sql_benchmarks/tpch/init/set_config.sql | ||
|
|
||
| load sql_benchmarks/tpch/init/load_${TPCH_FILE_TYPE:-parquet}.sql |
|
Wohoo!! |
|
I am also trying to capture additional follow on tasks as a list on Please feel free to add your own suggestions |
|
I haven’t investigated yet if this change (possibly by revealing pre existing bugs) or the hacky harness is to blame blame but I’m seeing some benches fail now: #21806 (comment) |
Upstream apache/datafusion#21707 ported `bench.sh run tpch` to a new Criterion-based SQL harness. The new harness reads parquet from a path relative to ${DATAFUSION_DIR}/benchmarks (where data isn't generated in our layout) and writes timings to target/criterion/, which our `bench.sh compare_detail` step doesn't understand. Recent benchmarks were failing because lineitem resolved to an empty external table. The dfbench tpch subcommand still exists upstream, so for the four tpch variants in our allowlist (tpch, tpch10, tpch_mem, tpch_mem10) invoke the prebuilt dfbench binary directly with the same arguments the old run_tpch used and write JSON to where compare_detail expects it. Other benchmarks still go through bench.sh.
Which issue does this PR close?
Rationale for this change
Add a sql based benchmark framework with tpch as the initial benchmark to use this new framework. The README.md includes notes about other benchmarks which will have individual PR's after the initial work is accepted.
What changes are included in this PR?
benchmarking code only.
Are these changes tested?
Yes
Are there any user-facing changes?
benchmarks/bench.sh now uses the new framework for benchmarking tpch
Additional info
AI assisted with refactoring and writing tests. I have reviewed all AI produced code.