Skip to content

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary)#3514

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-dictionary-writers
Open

GH-3513: Optimize dictionary writers with OpenHashMap + ArrayList (up to ~70x encodeDictionary)#3514
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-dictionary-writers

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 20, 2026

Summary

Resolves #3513.

Replaces fastutil's *2IntLinkedOpenHashMap with *2IntOpenHashMap plus a separate primitive-typed list (IntArrayList / LongArrayList / FloatArrayList / DoubleArrayList / ArrayList<Binary>) in the five dictionary writers.

Why

The dictionary page must be emitted in insertion order (dictionary index i = i-th distinct value seen). The Linked variant provides this via a doubly-linked list threaded through the slot array. That guarantee is paid for on every put:

  • 2 extra long fields per slot (prev, next) → larger slot footprint, more cache lines per probe
  • 3–4 scattered writes per insert to fix up the doubly-linked list
  • Re-stitching the linked list on rehash
  • Pure pointer chasing — not vectorizable, not branch-friendly

For high-cardinality columns (hundreds of thousands of distinct values per chunk), this overhead compounds on a hot path.

After this change, the hash map is a pure "have I seen this? what's its id?" lookup with the smallest possible slot, and the list is append-only / contiguous / cache-friendly to iterate at flush time. The two responsibilities (lookup vs ordering) were jammed into one structure; splitting them lets each be optimal.

Both candidates are fastutil primitive-keyed maps, so this is not a boxing change. The win is purely structural.

Benchmark results

From BinaryEncodingBenchmark.encodeDictionary and IntEncodingBenchmark.encodeDictionary (added in #3512):

Benchmark Configuration master this PR speedup
BinaryEncodingBenchmark.encodeDictionary LOW card, len=1000 3.3291 µs 0.0469 µs 70.97x
BinaryEncodingBenchmark.encodeDictionary LOW card, len=100 0.3416 µs 0.0545 µs 6.26x
BinaryEncodingBenchmark.encodeDictionary HIGH card, len=10 1.2991 µs 0.5116 µs 2.54x
BinaryEncodingBenchmark.encodeDictionary LOW card, len=10 0.0768 µs 0.0490 µs 1.57x
IntEncodingBenchmark.encodeDictionary RANDOM 0.4125 µs 0.2000 µs 2.06x
IntEncodingBenchmark.encodeDictionary SEQUENTIAL 0.4260 µs 0.2076 µs 2.05x
IntEncodingBenchmark.encodeDictionary HIGH_CARDINALITY 0.4217 µs 0.2069 µs 2.04x

Note: the 70x outlier on LOW cardinality / 1000-char strings is consistent with eliminating linked-list pointer chasing through hash-table slots when many duplicates accumulate at the head.

The Binary cache from #3500 also contributes here; this PR is additive on top of that win.

How to reproduce

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'BinaryEncodingBenchmark.encodeDictionary|IntEncodingBenchmark.encodeDictionary' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

Validation

  • parquet-column: 573 tests pass
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

User-facing changes

None. No public API change. No file format change. Dictionary pages emit values in the same order as before.

Closes #3513

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504, #3506, #3510. Companion benchmarks contribution: #3512.

Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
@iemejia iemejia force-pushed the perf-dictionary-writers branch from 973f821 to d6c5f91 Compare April 20, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize dictionary writers by replacing fastutil Linked maps with OpenHashMap + ArrayList

1 participant