feat: add cosine_distance scalar function#21542
Conversation
Add cosine_distance (and list_cosine_distance alias) to compute cosine distance between two numeric arrays. Includes shared vector math primitives in vector_math.rs for reuse by follow-on functions. Part of apache#21536. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses review comments on apache#21542: - Iterate list offsets/values directly instead of per-row ArrayRef downcast - Remove nested-list unwrap loop (function does not support nested lists) - Drop convert_to_f64_array wrapper (coerce_types already guarantees Float64) - Remove duplicate Rust unit tests now covered by SLT - More descriptive error message for mismatched list lengths - Delete now-unused vector_math module; inline math into sole caller Adds SLT coverage for NULL-element-in-list behavior previously tested only in Rust unit tests. Part of apache#21536. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the detailed review, @Jefffrey. Rework pushed in fc3ee90. Walking through each comment: 1. Iterate via offsets/values, not per-row ArrayRef downcast ( 2. Nested-list unwrap loop ( 3. Redundant null/Float64 check ( 4. Ambiguous length-mismatch error wording ( 5. Duplicate Rust unit tests ( 6. 7. Inline the math instead of a separate module ( Full validation matrix ( |
Addresses round-2 review comments on apache#21542: - Widen container variant in coerce_types when inputs mix List and LargeList (or FixedSizeList), so mixed-type calls like `cosine_distance([1.0, 0.0], arrow_cast([0.0, 1.0], 'LargeList(Float64)'))` succeed. Follows the pattern from PR apache#21704 (ArrayConcat). - Coerce bare NULL inputs to a matching list variant so `cosine_distance(NULL, [1.0, 2.0])` returns NULL instead of erroring. - Drop the `list_cosine_distance` alias — the base name is not `array_cosine_distance`, so the `array_X` -> `list_X` convention does not apply. - Expand SLT coverage: mixed-type variants, FixedSizeList inputs, Float32 and Int64 inner types, bare NULL in each position, NULL row in a multi-row VALUES, and an unsupported-type plan error case. Dispatch fallthrough in cosine_distance_inner is now unreachable after the coerce_types widening, changed from exec_err! to internal_err!. Part of apache#21536. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @Jefffrey. Round-3 pushed in ce312cc: 1. Mixed-type inputs ( 2. 3. Bare NULL input ( 4. Multi-row NULL coverage ( Additional SLT coverage added proactively:
Full validation run clean (fmt, clippy, full + sqlite-extended SLT, CLI, doctests, feature-flag checks, |
|
Thanks @crm26 |
## Which issue does this PR close? Part of apache#21536 — split of apache#21371 into one-function-per-PR. ## Rationale for this change Adds `inner_product(array1, array2)` — the dot product of two equal-length numeric arrays, returning `Float64`. Computed as `sum(array1[i] * array2[i])`. ## What changes are included in this PR? Mirrors the structural pattern of merged apache#21542 (`cosine_distance`): - Same `coerce_types` for `List`/`LargeList`/`FixedSizeList` of any numeric inner type, with widening to `LargeList` when any input is `LargeList` (per the apache#21704 pattern) - Same NULL semantics: bare `NULL` → `NULL`, NULL row → NULL, NULL element in list → NULL - Same Arrow-idiomatic implementation: single `as_float64_array(list_array.values())` downcast, slice by `value_offsets()`, iterate via `ScalarBuffer<f64>` - No alias, no shared module — standalone, inline math The arithmetic is the only semantic divergence from `cosine_distance`: - `dot += a*b` (no magnitude or normalization) - Empty arrays return `0.0` (sum of empty set), not `NULL` - No zero-magnitude special case (`inner_product([0,0], [1,2])` returns `0`, which is well-defined for inner product) ## Are these changes tested? Yes. SLT covers: - Orthogonal, identical, opposite, general non-trivial vectors - Single zero vector, both zero vectors - Bare `NULL` in either or both positions - NULL element inside a list (returns NULL for that row) - Mismatched lengths (error) - `LargeList` inputs - Mixed `(List, LargeList)` in both orders - `(FixedSizeList, FixedSizeList)` and `(FixedSizeList, LargeList)` - `Float32` and `Int64` inner type coercion - Multi-row query with NULL row propagation - Empty arrays (returns `0`) - No-args error - Return-type assertion (`Float64`) ## Are there any user-facing changes? New scalar function `inner_product`, documented in `docs/source/user-guide/sql/scalar_functions.md`. --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Summary
cosine_distance(array1, array2)/list_cosine_distance— computes cosine distance (1 - cosine similarity) between two numeric arraysvector_math.rsprimitives (dot_product_f64,magnitude_f64,convert_to_f64_array) for reuse by follow-on vector functionsPart of #21536 — first in a series of split PRs (replacing #21371).
Test plan
cosine_distance.sltcovering all edge cases including empty arrays, LargeList, integer coercion, alias, return typecargo clippy,cargo fmt,taplo,prettier,cargo machete— all clean🤖 Generated with Claude Code