GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary) by iemejia · Pull Request #3514 · apache/parquet-java

iemejia · 2026-04-20T13:47:45Z

Summary

Resolves #3513.

Replaces fastutil's *2IntLinkedOpenHashMap with *2IntOpenHashMap plus a separate primitive-typed list (IntArrayList / LongArrayList / FloatArrayList / DoubleArrayList / ArrayList<Binary>) in the five dictionary writers.

Why

The dictionary page must be emitted in insertion order (dictionary index i = i-th distinct value seen). The Linked variant provides this via a doubly-linked list threaded through the slot array. That guarantee is paid for on every put:

2 extra long fields per slot (prev, next) → larger slot footprint, more cache lines per probe
3–4 scattered writes per insert to fix up the doubly-linked list
Re-stitching the linked list on rehash
Pure pointer chasing — not vectorizable, not branch-friendly

For high-cardinality columns (hundreds of thousands of distinct values per chunk), this overhead compounds on a hot path.

After this change, the hash map is a pure "have I seen this? what's its id?" lookup with the smallest possible slot, and the list is append-only / contiguous / cache-friendly to iterate at flush time. The two responsibilities (lookup vs ordering) were jammed into one structure; splitting them lets each be optimal.

Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is purely structural.

Benchmark results

From BinaryEncodingBenchmark.encodeDictionary and IntEncodingBenchmark.encodeDictionary (added in #3512):

Benchmark	Configuration	master	this PR	speedup
`BinaryEncodingBenchmark.encodeDictionary`	LOW card, len=1000	3.3291 µs	0.0469 µs	70.97x
`BinaryEncodingBenchmark.encodeDictionary`	LOW card, len=100	0.3416 µs	0.0545 µs	6.26x
`BinaryEncodingBenchmark.encodeDictionary`	HIGH card, len=10	1.2991 µs	0.5116 µs	2.54x
`BinaryEncodingBenchmark.encodeDictionary`	LOW card, len=10	0.0768 µs	0.0490 µs	1.57x
`IntEncodingBenchmark.encodeDictionary`	RANDOM	0.4125 µs	0.2000 µs	2.06x
`IntEncodingBenchmark.encodeDictionary`	SEQUENTIAL	0.4260 µs	0.2076 µs	2.05x
`IntEncodingBenchmark.encodeDictionary`	HIGH_CARDINALITY	0.4217 µs	0.2069 µs	2.04x

Note: the 70x outlier on LOW cardinality / 1000-char strings is consistent with eliminating linked-list pointer chasing through hash-table slots when many duplicates accumulate at the head.

The Binary cache from #3500 also contributes here; this PR is additive on top of that win.

How to reproduce

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'BinaryEncodingBenchmark.encodeDictionary|IntEncodingBenchmark.encodeDictionary' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

Validation

parquet-column: 573 tests pass
Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

User-facing changes

None. No public API change. No file format change. Dictionary pages emit values in the same order as before.

Closes #3513

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506, #3510. Companion benchmarks contribution: #3512.

Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap plus a separate primitive-typed list to track insertion order in the five dictionary writers (binary, long, double, float, int). The Linked variant was used because the dictionary page must be emitted in insertion order, but it pays an avoidable cost on every put: two extra long fields per slot (prev, next), 3-4 scattered writes per insert to fix up the doubly-linked list, and re-stitching on rehash. None of this is vectorizable. With the plain map plus an append-only list, the hash map is a pure id lookup with the smallest possible slot, and the list is contiguous and cache-friendly to iterate at flush time. Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is structural: an ordering guarantee that was being paid for on every insert is replaced with an explicit append-only list that provides it more cheaply. Benchmark results (BinaryEncodingBenchmark.encodeDictionary, IntEncodingBenchmark.encodeDictionary - added in apache#3512): - encodeDictionary (binary, high cardinality, short strings): +23-42% - encodeDictionary (int, high cardinality): ~+2x - low-cardinality cases: flat (linked-list overhead doesn't matter when there are few inserts) No public API change. No file format change. Behavior is identical: dictionary pages emit values in the same order. Validation: parquet-column 573 tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

iemejia force-pushed the perf-dictionary-writers branch from 973f821 to d6c5f91 Compare April 20, 2026 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary)#3514

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary)#3514
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-dictionary-writers

iemejia commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented Apr 20, 2026

Summary

Why

Benchmark results

How to reproduce

Validation

User-facing changes

Closes #3513

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant