Add guards around expression depth for Filters and Projectors.#26
Open
lriggs wants to merge 2 commits into
Open
Add guards around expression depth for Filters and Projectors.#26lriggs wants to merge 2 commits into
lriggs wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Expression Guard — change summary
Problem
A Dremio query plan containing an
UNPIVOTproduced anIS NOT NULL(CASE WHEN key='lit_0' THEN v0 WHEN key='lit_1' THEN v1 … WHEN key='lit_392' THEN v392 ELSE NULL END)expression with 393 arms. When the plan was passed to Gandiva, the executor thread died with a nativeSIGSEGV(exit 139). The customer hs_err showed:Filter.evaluate→JniWrapper.evaluateFilter(i.e. runtime, not compile-time)RSP~1 MB below the executor thread's 1028 KB stack baseSEGV_ACCERRon the guard page exactly at the stack base[rsp+0x6e??]/[rsp+0x6f??]/… slotsi.e. one Gandiva-emitted function tried to reserve a stack frame larger than the entire executor thread's stack, and the prologue walked off the bottom on first call.
Root cause
Dremio's planner has a rewrite rule that turns the compact
IS NOT NULL(CASE …)form into:For N=393 that materializes 77,421
equalcomparisons in one ~4.5 MB expression. Gandiva compiles the whole thing into a single LLVM function; the optimizer can't deduplicate the redundantnot(equal)subterms across AND arms, so thousands of SSA values get spill slots in the function's stack frame. The frame size exceeds available thread stack and the prologue faults.The crash class has two reachable failure modes:
IfNodechain) causesExprDecomposer::Visit(IfNode&)andLLVMGenerator::Visitor::Visit(IfDex&)to recurse once per level. Each frame holds severalshared_ptrs and lambda captures (~3 KB). Around depth ~360 on a 1 MB stack the recursion blows the stack duringFilter.make/Projector.make.Filter.evaluate/Projector.evaluate.Either mode kills the JVM in native code with no chance for Java to catch it.
Solution
Reject pathological expression trees on the Java side, before the JNI call. The check is iterative (uses an explicit
ArrayDequework-stack) so the guard itself cannot stack-overflow on the same input it's trying to protect against.Limits
org.apache.arrow.gandiva.expr.max_depthorg.apache.arrow.gandiva.expr.max_nodesWhy these numbers:
The defaults can be raised at deploy time without rebuilding.
Changes
New files
gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/ExpressionGuard.java— package-private iterative walker overGandivaTypes.TreeNodeprotobuf. HandlesIfNode,AndNode,OrNode,FunctionNode,InNode; leaves (fields, literals) terminate naturally. Each step increments a counter and checks against the limits.gandiva/src/test/java/org/apache/arrow/gandiva/evaluator/ExpressionGuardTest.java— 5 focused tests:filterRejectsExpressionAboveDepthLimit,projectorRejectsExpressionAboveDepthLimitfilterRejectsExpressionAboveNodeCountLimit,projectorRejectsExpressionAboveNodeCountLimitsmallExpressionPassesGuard— sanity that ordinary expressions are unaffectedgandiva/src/test/java/org/apache/arrow/gandiva/evaluator/UnpivotCaseStackOverflowTest.java— regression tests for the original customer plan shape (393 nestedIfNodes) as both a Projector and Filter:testUnpivotCaseStackOverflow/testUnpivotCaseStackOverflowFilter— assertGandivaExceptionwith a depth-limit message instead of a JVM crash.testUnpivotCaseShallow/testUnpivotCaseShallowFilter— shallow (depth-8) versions confirming legitimate CASE expressions still compile and evaluate.Modified files
gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Filter.java— callsExpressionGuard.check(conditionBuf)in the synchronizedmake(Schema, Condition, long)overload, just aftercondition.toProtobuf()and before the JNI call. Every publicmake()overload funnels through this one, so the guard applies uniformly.gandiva/src/main/java/org/apache/arrow/gandiva/evaluator/Projector.java— same pattern, withExpressionGuard.check(exprList)over the builtExpressionList.Tests
mvn -Parrow-jni -pl gandiva test).What this change does NOT do
ExprDecomposer::Visit,LLVMGenerator::Visitor::Visit) are still recursive; LLVM's own passes still walk def-use chains recursively; the JIT'd function can still produce a huge frame given the right input shape. The guard only ensures that Java-submitted expressions can't reach those vulnerabilities through the public API.GandivaExceptioninstead of a JVM crash. The real fix needs to happen upstream — either skip the planner rewrite for large N or replace the rewrite with one that doesn't materialize O(N²) redundant comparisons.Suggested follow-ups (outside this change)
IS NOT NULL(CASE …)into OR-of-ANDs and guard it on arm count, or replace it with a lowering that shares prefix tests rather than duplicating them.ExprDecomposer::Visit(IfNode&)andLLVMGenerator::Visitor::Visit(IfDex&)iteratively, and special-case long if-chains inLLVMGeneratorto a single basic-block compare-cascade-with-φ instead of cascadingBuildIfElsecalls.