Skip to content

fsst: regression test for i32 offset overflow in fsst_compress#7832

Closed
mprammer wants to merge 2 commits intodevelopfrom
mp/fsst-i32-overflow-regression-test
Closed

fsst: regression test for i32 offset overflow in fsst_compress#7832
mprammer wants to merge 2 commits intodevelopfrom
mp/fsst-i32-overflow-regression-test

Conversation

@mprammer
Copy link
Copy Markdown

@mprammer mprammer commented May 7, 2026

Regression test for #7833 — FSST i32 offset overflow. Wrapped in #[should_panic(expected = "to offset of type i32")] so it merges green; drop the attribute alongside the fix and the trailing assert_eq!(compressed.len(), len) becomes the live assertion.

Gated #[test_with::env(CI)] + #[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)] to match vortex-btrblocks/src/schemes/integer.rs:1113.

🤖 Generated with Claude Code

`fsst_compress_iter` (encodings/fsst/src/compress.rs:72) hardcodes
`VarBinBuilder::<i32>` for the compressed output, so any input whose
cumulative compressed bytes exceed `i32::MAX` panics in
`vortex-array/src/arrays/varbin/builder.rs:62` with

    Other error: Failed to convert sum of N and M to offset of type i32

Hit in practice on a real >4 GiB string column going through `vxio.write`.
The bug isn't in the input-conversion path — that's zero-copy and respects
the input offset width — so widening the input to `large_string` (i64
offsets) at the pyarrow side does NOT help; FSST's output builder runs
either way.

Add a stress regression test that constructs a `VarBinArray<i64>` with
~2.5 GiB of high-entropy ASCII (FSST cannot compress it below the i32
ceiling) and runs `fsst_compress` end-to-end. The test currently panics
with the documented message; it's wrapped in `#[should_panic]` so the
test passes today and trips when the underlying bug is fixed — at which
point the maintainer drops `#[should_panic]` and the trailing
`assert_eq!(compressed.len(), len)` becomes the live assertion.

Gated with `#[test_with::env(CI)]` + `#[test_with::no_env(VORTEX_SKIP_SLOW_TESTS)]`
(matching the precedent in vortex-btrblocks/src/schemes/integer.rs:1113)
because the test allocates ~5 GiB peak and runs in ~6 s under release.

Verified locally:
- `cargo test -p vortex-fsst fsst_compress_offsets`
    → ignored, because variable CI not found
- `CI=1 cargo test --release -p vortex-fsst fsst_compress_offsets`
    → 1 passed (panics as expected, captured by should_panic)
- `CI=1 VORTEX_SKIP_SLOW_TESTS=1 cargo test --release -p vortex-fsst fsst_compress_offsets`
    → ignored, because variable VORTEX_SKIP_SLOW_TESTS was found
- `cargo +nightly fmt --all` clean
- `cargo clippy -p vortex-fsst --all-targets --all-features` clean

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Comment on lines +53 to +55
let pool: Vec<u8> = (0..POOL_LEN)
.map(|_| *ALPHABET.choose(&mut rng).unwrap())
.collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this? vs using a single char (so the test runs faster)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried first: FSST collapses repetitive bytes, so the output never crosses 2 GiB.

Comment on lines +57 to +67
let mut builder = VarBinBuilder::<i64>::with_capacity(N);
for i in 0..N {
let off = (i.wrapping_mul(31337)) % (POOL_LEN - STRING_LEN);
builder.append_value(&pool[off..off + STRING_LEN]);
}
let array = builder.finish(DType::Utf8(Nullability::NonNullable));

let compressor = fsst_train_compressor(&array);
let len = array.len();
let dtype = array.dtype().clone();
let mut ctx = LEGACY_SESSION.create_execution_ctx();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we directly build the fsst array to save some time?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug is in fsst_compress_iter (compress.rs:72); constructing the array directly skips the panicking path.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
@gatesn
Copy link
Copy Markdown
Contributor

gatesn commented May 7, 2026

Can we just.... also fix it??

@mprammer mprammer added the changelog/fix A bug fix label May 7, 2026
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just fix this

@connortsui20
Copy link
Copy Markdown
Contributor

We should just write the offsets as i64 (as an aside, why do we use signed integers here and not unsigned?), since the compressor is going to narrow the codes regardless:

let utf8 = data.array_as_utf8().into_owned();
let compressor_fsst = fsst_train_compressor(&utf8);
let fsst = fsst_compress(&utf8, utf8.len(), utf8.dtype(), &compressor_fsst, exec_ctx);

let uncompressed_lengths_primitive = fsst
    .uncompressed_lengths()
    .clone()
    .execute::<PrimitiveArray>(exec_ctx)?
    .narrow()?;
let compressed_original_lengths = compressor.compress_child(
    &uncompressed_lengths_primitive.into_array(),
    &compress_ctx,
    self.id(),
    0,
    exec_ctx,
)?;

let codes_offsets_primitive = fsst
    .codes()
    .offsets()
    .clone()
    .execute::<PrimitiveArray>(exec_ctx)?
    .narrow()?;
let compressed_codes_offsets = compressor.compress_child(
    &codes_offsets_primitive.into_array(),
    &compress_ctx,
    self.id(),
    1,
    exec_ctx,
)?;

connortsui20 added a commit that referenced this pull request May 7, 2026
Move the regression test from PR #7832's tests_large.rs into the
existing tests.rs module. Use #[ignore] instead of test_with env gates
since the test allocates ~5 GiB and shouldn't run by default even in
CI.

Tracks #7833.

Signed-off-by: Claude <noreply@anthropic.com>
@mprammer mprammer closed this May 7, 2026
connortsui20 pushed a commit that referenced this pull request May 8, 2026
Adds an `#[ignore]`d regression test for #7833 to the existing
`encodings/fsst/src/tests.rs`. The test allocates ~5 GiB total, so it
is opt-in via `--ignored`:

    cargo test --release -p vortex-fsst -- --ignored fsst_compress_offsets

This is an alternative to #7832 that keeps the test alongside the
other FSST tests instead of introducing a new module, and avoids the
`test-with` dev-dependency.

Signed-off-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants