fsst: regression test for i32 offset overflow in fsst_compress by mprammer · Pull Request #7832 · vortex-data/vortex

mprammer · 2026-05-07T17:25:16Z

Regression test for #7833 — FSST i32 offset overflow. Wrapped in #[should_panic(expected = "to offset of type i32")] so it merges green; drop the attribute alongside the fix and the trailing assert_eq!(compressed.len(), len) becomes the live assertion.

Gated #[test_with::env(CI)] + #[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)] to match vortex-btrblocks/src/schemes/integer.rs:1113.

🤖 Generated with Claude Code

`fsst_compress_iter` (encodings/fsst/src/compress.rs:72) hardcodes `VarBinBuilder::<i32>` for the compressed output, so any input whose cumulative compressed bytes exceed `i32::MAX` panics in `vortex-array/src/arrays/varbin/builder.rs:62` with Other error: Failed to convert sum of N and M to offset of type i32 Hit in practice on a real >4 GiB string column going through `vxio.write`. The bug isn't in the input-conversion path — that's zero-copy and respects the input offset width — so widening the input to `large_string` (i64 offsets) at the pyarrow side does NOT help; FSST's output builder runs either way. Add a stress regression test that constructs a `VarBinArray<i64>` with ~2.5 GiB of high-entropy ASCII (FSST cannot compress it below the i32 ceiling) and runs `fsst_compress` end-to-end. The test currently panics with the documented message; it's wrapped in `#[should_panic]` so the test passes today and trips when the underlying bug is fixed — at which point the maintainer drops `#[should_panic]` and the trailing `assert_eq!(compressed.len(), len)` becomes the live assertion. Gated with `#[test_with::env(CI)]` + `#[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)]` (matching the precedent in vortex-btrblocks/src/schemes/integer.rs:1113) because the test allocates ~5 GiB peak and runs in ~6 s under release. Verified locally: - `cargo test -p vortex-fsst fsst_compress_offsets` → ignored, because variable CI not found - `CI=1 cargo test --release -p vortex-fsst fsst_compress_offsets` → 1 passed (panics as expected, captured by should_panic) - `CI=1 VORTEX_SKIP_SLOW_TESTS=1 cargo test --release -p vortex-fsst fsst_compress_offsets` → ignored, because variable VORTEX_SKIP_SLOW_TESTS was found - `cargo +nightly fmt --all` clean - `cargo clippy -p vortex-fsst --all-targets --all-features` clean Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

joseph-isaacs · 2026-05-07T17:29:29Z

+    let pool: Vec<u8> = (0..POOL_LEN)
+        .map(|_| *ALPHABET.choose(&mut rng).unwrap())
+        .collect();


do you need this? vs using a single char (so the test runs faster)

Tried first: FSST collapses repetitive bytes, so the output never crosses 2 GiB.

joseph-isaacs · 2026-05-07T17:30:38Z

+    let mut builder = VarBinBuilder::<i64>::with_capacity(N);
+    for i in 0..N {
+        let off = (i.wrapping_mul(31337)) % (POOL_LEN - STRING_LEN);
+        builder.append_value(&pool[off..off + STRING_LEN]);
+    }
+    let array = builder.finish(DType::Utf8(Nullability::NonNullable));
+
+    let compressor = fsst_train_compressor(&array);
+    let len = array.len();
+    let dtype = array.dtype().clone();
+    let mut ctx = LEGACY_SESSION.create_execution_ctx();


could we directly build the fsst array to save some time?

Bug is in fsst_compress_iter (compress.rs:72); constructing the array directly skips the panicking path.

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

gatesn · 2026-05-07T19:30:49Z

Can we just.... also fix it??

connortsui20

we should just fix this

connortsui20 · 2026-05-07T20:16:00Z

We should just write the offsets as i64 (as an aside, why do we use signed integers here and not unsigned?), since the compressor is going to narrow the codes regardless:

let utf8 = data.array_as_utf8().into_owned();
let compressor_fsst = fsst_train_compressor(&utf8);
let fsst = fsst_compress(&utf8, utf8.len(), utf8.dtype(), &compressor_fsst, exec_ctx);

let uncompressed_lengths_primitive = fsst
    .uncompressed_lengths()
    .clone()
    .execute::<PrimitiveArray>(exec_ctx)?
    .narrow()?;
let compressed_original_lengths = compressor.compress_child(
    &uncompressed_lengths_primitive.into_array(),
    &compress_ctx,
    self.id(),
    0,
    exec_ctx,
)?;

let codes_offsets_primitive = fsst
    .codes()
    .offsets()
    .clone()
    .execute::<PrimitiveArray>(exec_ctx)?
    .narrow()?;
let compressed_codes_offsets = compressor.compress_child(
    &codes_offsets_primitive.into_array(),
    &compress_ctx,
    self.id(),
    1,
    exec_ctx,
)?;

Move the regression test from PR #7832's tests_large.rs into the existing tests.rs module. Use #[ignore] instead of test_with env gates since the test allocates ~5 GiB and shouldn't run by default even in CI. Tracks #7833. Signed-off-by: Claude <noreply@anthropic.com>

Adds an `#[ignore]`d regression test for #7833 to the existing `encodings/fsst/src/tests.rs`. The test allocates ~5 GiB total, so it is opt-in via `--ignored`: cargo test --release -p vortex-fsst -- --ignored fsst_compress_offsets This is an alternative to #7832 that keeps the test alongside the other FSST tests instead of introducing a new module, and avoids the `test-with` dev-dependency. Signed-off-by: Claude <noreply@anthropic.com>

joseph-isaacs reviewed May 7, 2026

View reviewed changes

mprammer mentioned this pull request May 7, 2026

FSST: fsst_compress panics on cumulative output >2 GiB (i32 offset overflow) #7833

Open

fsst: link tests_large doc comment to tracking issue #7833

0cfb69f

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>

joseph-isaacs approved these changes May 7, 2026

View reviewed changes

mprammer added the changelog/fix A bug fix label May 7, 2026

connortsui20 requested changes May 7, 2026

View reviewed changes

This was referenced May 7, 2026

[claude] fsst: ignored regression test for i32 offset overflow #7834

Closed

fix(fsst): pick i32 vs i64 codes offsets per call #7836

Closed

mprammer closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fsst: regression test for i32 offset overflow in fsst_compress#7832

fsst: regression test for i32 offset overflow in fsst_compress#7832
mprammer wants to merge 2 commits intodevelopfrom
mp/fsst-i32-overflow-regression-test

mprammer commented May 7, 2026 •

edited

Loading

Uh oh!

joseph-isaacs May 7, 2026

Uh oh!

mprammer May 7, 2026

Uh oh!

joseph-isaacs May 7, 2026

Uh oh!

mprammer May 7, 2026

Uh oh!

gatesn commented May 7, 2026

Uh oh!

connortsui20 left a comment

Uh oh!

connortsui20 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mprammer commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joseph-isaacs May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mprammer May 7, 2026

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mprammer May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gatesn commented May 7, 2026

Uh oh!

connortsui20 left a comment

Choose a reason for hiding this comment

Uh oh!

connortsui20 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mprammer commented May 7, 2026 •

edited

Loading