Skip to content

feat: add vector distance and array math functions#21371

Draft
crm26 wants to merge 4 commits intoapache:mainfrom
crm26:feat/vector-distance-functions
Draft

feat: add vector distance and array math functions#21371
crm26 wants to merge 4 commits intoapache:mainfrom
crm26:feat/vector-distance-functions

Conversation

@crm26
Copy link
Copy Markdown
Contributor

@crm26 crm26 commented Apr 4, 2026

Summary

Adds vector distance and array math functions to datafusion-functions-nested, enabling vector search and array algebra in standard SQL.

-- Vector search: find nearest neighbors by cosine distance
SELECT id, cosine_distance(embedding, ARRAY[0.1, 0.2, ...]) as dist
FROM documents ORDER BY dist LIMIT 10

-- Array math
SELECT array_normalize(embedding) FROM documents
SELECT array_add(vec_a, vec_b) FROM t
SELECT array_scale(embedding, 2.0) FROM documents

Functions

Function Returns Description
cosine_distance(a, b) float64 1 - cosine similarity
inner_product(a, b) float64 Dot product
array_normalize(a) list(float64) Unit vector
array_add(a, b) list(float64) Element-wise addition
array_subtract(a, b) list(float64) Element-wise subtraction
array_scale(a, f) list(float64) Scalar multiplication

All have list_* aliases. inner_product also aliased as dot_product.

Design

Shared primitives in vector_math.rs:

  • dot_product_f64(a, b) — used by inner_product and cosine_distance
  • magnitude_f64(a) — used by cosine_distance and array_normalize
  • sum_of_squares_f64(a) — used by magnitude_f64
  • convert_to_f64_array(a) — shared with existing array_distance

The existing distance.rs duplicate convert_to_f64_array is consolidated into the shared module.

Follows the exact pattern of the existing array_distance function: same signature style, coerce_types, null handling, and type support (Float32, Float64, Int32, Int64, FixedSizeList, LargeList, List).

Tests

79 tests including: normal inputs, null handling, zero vectors, orthogonal vectors, empty arrays, Float32/Float64, mismatched lengths, vector search ranking pattern. Sqllogictest coverage in vector_functions.slt. Clippy clean.

crm26 and others added 2 commits April 4, 2026 16:24
Add 6 new scalar functions to datafusion-functions-nested:
- cosine_distance(array, array) — cosine distance (1 - cosine similarity)
- inner_product(array, array) — dot product
- array_normalize(array) — L2 unit normalization
- array_add(array, array) — element-wise addition
- array_subtract(array, array) — element-wise subtraction
- array_scale(array, float) — scalar multiplication

Shared math primitives (dot_product, magnitude, sum_of_squares) extracted
into vector_math.rs to avoid duplication across functions.

Includes aliases (list_*, dot_product), 29 unit tests, and a sqllogictest
file with vector search pattern coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds cosine_distance, inner_product, array_normalize, array_add,
array_subtract, and array_scale to datafusion-functions-nested.

Shared primitives in vector_math.rs (dot_product_f64, magnitude_f64,
sum_of_squares_f64, convert_to_f64_array) are reused across all
functions and the existing array_distance. Consolidates the duplicate
convert_to_f64_array from distance.rs into the shared module.

Functions:
  cosine_distance(a, b) → float64    (aliases: list_cosine_distance)
  inner_product(a, b) → float64      (aliases: list_inner_product, dot_product)
  array_normalize(a) → list(float64) (aliases: list_normalize)
  array_add(a, b) → list(float64)    (aliases: list_add)
  array_subtract(a, b) → list(float64) (aliases: list_subtract)
  array_scale(a, f) → list(float64)  (aliases: list_scale)

Enables vector search in standard SQL:
  SELECT id, cosine_distance(embedding, ARRAY[0.1, 0.2, ...]) as dist
  FROM documents ORDER BY dist LIMIT 10

79 tests, sqllogictest coverage, clippy clean.
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 4, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

Hi -- thank you for this PR. I think it will be challenging to review a PR of this size

To help review can you please:

  1. file a ticket to track adding these new functions, with a reference to the other implementation's documentation
  2. break the PR into smaller individual PRs, one per new function

Thank you so much 🙏

@crm26
Copy link
Copy Markdown
Contributor Author

crm26 commented Apr 10, 2026

Done — filed tracking issue #21536 and splitting into one PR per function. First up: cosine_distance with shared primitives in vector_math.rs. Will close this PR once the splits are submitted.

@Jefffrey Jefffrey marked this pull request as draft April 17, 2026 09:09
zzcclp pushed a commit to zzcclp/arrow-datafusion that referenced this pull request Apr 23, 2026
## Summary

- Adds `cosine_distance(array1, array2)` / `list_cosine_distance` —
computes cosine distance (1 - cosine similarity) between two numeric
arrays
- Introduces shared `vector_math.rs` primitives (`dot_product_f64`,
`magnitude_f64`, `convert_to_f64_array`) for reuse by follow-on vector
functions
- Returns NULL for zero-magnitude vectors; errors on mismatched lengths
- Supports List, LargeList, and FixedSizeList with any numeric element
type

Part of apache#21536 — first in a series of split PRs (replacing apache#21371).

## Test plan

- [x] Unit tests: identical, orthogonal, opposite, 45-degree,
zero-magnitude, mismatched-length, NULL, multi-row
- [x] sqllogictest: `cosine_distance.slt` covering all edge cases
including empty arrays, LargeList, integer coercion, alias, return type
- [x] Full slt suite (426/426 pass)
- [x] `cargo clippy`, `cargo fmt`, `taplo`, `prettier`, `cargo machete`
— all clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to buraksenn/datafusion that referenced this pull request May 3, 2026
## Which issue does this PR close?

Part of apache#21536 — split of apache#21371 into one-function-per-PR.

## Rationale for this change

Adds `inner_product(array1, array2)` — the dot product of two
equal-length numeric arrays, returning `Float64`. Computed as
`sum(array1[i] * array2[i])`.

## What changes are included in this PR?

Mirrors the structural pattern of merged apache#21542 (`cosine_distance`):

- Same `coerce_types` for `List`/`LargeList`/`FixedSizeList` of any
numeric inner type, with widening to `LargeList` when any input is
`LargeList` (per the apache#21704 pattern)
- Same NULL semantics: bare `NULL` → `NULL`, NULL row → NULL, NULL
element in list → NULL
- Same Arrow-idiomatic implementation: single
`as_float64_array(list_array.values())` downcast, slice by
`value_offsets()`, iterate via `ScalarBuffer<f64>`
- No alias, no shared module — standalone, inline math

The arithmetic is the only semantic divergence from `cosine_distance`:
- `dot += a*b` (no magnitude or normalization)
- Empty arrays return `0.0` (sum of empty set), not `NULL`
- No zero-magnitude special case (`inner_product([0,0], [1,2])` returns
`0`, which is well-defined for inner product)

## Are these changes tested?

Yes. SLT covers:
- Orthogonal, identical, opposite, general non-trivial vectors
- Single zero vector, both zero vectors
- Bare `NULL` in either or both positions
- NULL element inside a list (returns NULL for that row)
- Mismatched lengths (error)
- `LargeList` inputs
- Mixed `(List, LargeList)` in both orders
- `(FixedSizeList, FixedSizeList)` and `(FixedSizeList, LargeList)`
- `Float32` and `Int64` inner type coercion
- Multi-row query with NULL row propagation
- Empty arrays (returns `0`)
- No-args error
- Return-type assertion (`Float64`)

## Are there any user-facing changes?

New scalar function `inner_product`, documented in
`docs/source/user-guide/sql/scalar_functions.md`.

---------

Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants