Skip to content

Add guards around expression depth for Filters and Projectors.#26

Open
lriggs wants to merge 2 commits into
dremio:dremio_27.0_23_19from
lriggs:overflow_fix
Open

Add guards around expression depth for Filters and Projectors.#26
lriggs wants to merge 2 commits into
dremio:dremio_27.0_23_19from
lriggs:overflow_fix

Conversation

@lriggs
Copy link
Copy Markdown
Collaborator

@lriggs lriggs commented May 27, 2026

Expression Guard — change summary

Problem

A Dremio query plan containing an UNPIVOT produced an IS NOT NULL(CASE WHEN key='lit_0' THEN v0 WHEN key='lit_1' THEN v1 … WHEN key='lit_392' THEN v392 ELSE NULL END) expression with 393 arms. When the plan was passed to Gandiva, the executor thread died with a native SIGSEGV (exit 139). The customer hs_err showed:

  • Filter.evaluateJniWrapper.evaluateFilter (i.e. runtime, not compile-time)
  • RSP ~1 MB below the executor thread's 1028 KB stack base
  • SEGV_ACCERR on the guard page exactly at the stack base
  • Instructions at the faulting PC: a function prologue spilling registers into [rsp+0x6e??]/[rsp+0x6f??]/… slots

i.e. one Gandiva-emitted function tried to reserve a stack frame larger than the entire executor thread's stack, and the prologue walked off the bottom on first call.

Root cause

Dremio's planner has a rewrite rule that turns the compact IS NOT NULL(CASE …) form into:

OR(
  AND(c0, isnotnull(v0)),
  AND(NOT(c0), c1, isnotnull(v1)),
  AND(NOT(c0), NOT(c1), c2, isnotnull(v2)),
  …
  AND(NOT(c0)..NOT(c_{N-2}), c_{N-1}, isnotnull(v_{N-1}))
)

For N=393 that materializes 77,421 equal comparisons in one ~4.5 MB expression. Gandiva compiles the whole thing into a single LLVM function; the optimizer can't deduplicate the redundant not(equal) subterms across AND arms, so thousands of SSA values get spill slots in the function's stack frame. The frame size exceeds available thread stack and the prologue faults.

The crash class has two reachable failure modes:

  1. Compile-time native recursion — a deeply nested AST (e.g. nested IfNode chain) causes ExprDecomposer::Visit(IfNode&) and LLVMGenerator::Visitor::Visit(IfDex&) to recurse once per level. Each frame holds several shared_ptrs and lambda captures (~3 KB). Around depth ~360 on a 1 MB stack the recursion blows the stack during Filter.make / Projector.make.
  2. Runtime giant JIT frame — a wide flat tree (the customer's post-rewrite OR-of-ANDs) compiles successfully but produces a JIT'd function whose single stack frame is larger than the calling thread's stack budget. Crashes on first call to Filter.evaluate / Projector.evaluate.

Either mode kills the JVM in native code with no chance for Java to catch it.

Solution

Reject pathological expression trees on the Java side, before the JNI call. The check is iterative (uses an explicit ArrayDeque work-stack) so the guard itself cannot stack-overflow on the same input it's trying to protect against.

Limits

Default System property override
Max AST depth 100 org.apache.arrow.gandiva.expr.max_depth
Max total node count 10 000 org.apache.arrow.gandiva.expr.max_nodes

Why these numbers:

  • Depth 100 is well below the compile-time-recursion cliff and far above any legitimate hand-written CASE/IF depth (single digits to low tens).
  • Node count 10 000 covers the wide-flat shape. The customer's failing input (393-arm OR-of-ANDs → ~77 421 nodes) is rejected with ~8× margin while ordinary expressions are unaffected.

The defaults can be raised at deploy time without rebuilding.

Changes

New files

  • gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/ExpressionGuard.java — package-private iterative walker over GandivaTypes.TreeNode protobuf. Handles IfNode, AndNode, OrNode, FunctionNode, InNode; leaves (fields, literals) terminate naturally. Each step increments a counter and checks against the limits.
  • gandiva/src/test/java/org/apache/arrow/gandiva/evaluator/ExpressionGuardTest.java — 5 focused tests:
    • filterRejectsExpressionAboveDepthLimit, projectorRejectsExpressionAboveDepthLimit
    • filterRejectsExpressionAboveNodeCountLimit, projectorRejectsExpressionAboveNodeCountLimit
    • smallExpressionPassesGuard — sanity that ordinary expressions are unaffected
  • gandiva/src/test/java/org/apache/arrow/gandiva/evaluator/UnpivotCaseStackOverflowTest.java — regression tests for the original customer plan shape (393 nested IfNodes) as both a Projector and Filter:
    • testUnpivotCaseStackOverflow / testUnpivotCaseStackOverflowFilter — assert GandivaException with a depth-limit message instead of a JVM crash.
    • testUnpivotCaseShallow / testUnpivotCaseShallowFilter — shallow (depth-8) versions confirming legitimate CASE expressions still compile and evaluate.

Modified files

  • gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Filter.java — calls ExpressionGuard.check(conditionBuf) in the synchronized make(Schema, Condition, long) overload, just after condition.toProtobuf() and before the JNI call. Every public make() overload funnels through this one, so the guard applies uniformly.
  • gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java — same pattern, with ExpressionGuard.check(exprList) over the built ExpressionList.

Tests

  • Full gandiva suite: 101 tests, 0 failures, 0 errors (verified with mvn -Parrow-jni -pl gandiva test).
  • 5 new guard tests, all green.
  • 4 UNPIVOT regression tests, all green — the deep variants assert the guard exception, the shallow variants assert normal compilation.

What this change does NOT do

  • It does not eliminate the underlying native vulnerabilities. The AST visitors in Gandiva (ExprDecomposer::Visit, LLVMGenerator::Visitor::Visit) are still recursive; LLVM's own passes still walk def-use chains recursively; the JIT'd function can still produce a huge frame given the right input shape. The guard only ensures that Java-submitted expressions can't reach those vulnerabilities through the public API.
  • It does not make the customer's UNPIVOT query succeed. With the guard in place, the same plan now produces a clear GandivaException instead of a JVM crash. The real fix needs to happen upstream — either skip the planner rewrite for large N or replace the rewrite with one that doesn't materialize O(N²) redundant comparisons.

Suggested follow-ups (outside this change)

  • Planner-side: identify the Calcite/Dremio rule that turns IS NOT NULL(CASE …) into OR-of-ANDs and guard it on arm count, or replace it with a lowering that shares prefix tests rather than duplicating them.
  • Gandiva C++: rewrite ExprDecomposer::Visit(IfNode&) and LLVMGenerator::Visitor::Visit(IfDex&) iteratively, and special-case long if-chains in LLVMGenerator to a single basic-block compare-cascade-with-φ instead of cascading BuildIfElse calls.

@lriggs lriggs requested review from jbonofre and laurentgo as code owners May 27, 2026 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant