Skip to content

Branch 53 cherry pick 10#21925

Open
fred1268 wants to merge 62 commits intoapache:mainfrom
DataDog:branch-53-cherry-pick-10
Open

Branch 53 cherry pick 10#21925
fred1268 wants to merge 62 commits intoapache:mainfrom
DataDog:branch-53-cherry-pick-10

Conversation

@fred1268
Copy link
Copy Markdown
Contributor

comphead and others added 30 commits March 1, 2026 17:43
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->
Preparation for DataFusion release 53.0.0
- apache#19692


## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
…apache#20680)

## Which issue does this PR close?

N/A

## Rationale for this change

Backport for apache#20625

When enabling the `recursive_protection` feature for the `datafusion`
crate, the `sql` feature is enabled. This is undesirable if the
downstream project would like the `sql` feature to be off.

## What changes are included in this PR?

Use the `?` syntax for features of dependencies for
`recursive_protection`. This was already correctly done for other
features such as `unicode_expressions`.

<https://doc.rust-lang.org/cargo/reference/features.html>

## Are these changes tested?

N/A

## Are there any user-facing changes?

This makes dependency management better for downstream projects and is
not a breaking change.

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Co-authored-by: Heran Lin <heran@lakesail.com>
…) queries (#… (apache#20726)

…20710)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123. -->

- Closes apache#20669 .

## Rationale for this change
Return probe_side.len() for count(*) queries
<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes. -->

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?
slt tests
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
)

## Which issue does this PR close?

- Closes apache#20704

## Rationale for this change

FFI_TableOptions fails with a warning that is getting swallowed in the
unit tests.

## What changes are included in this PR?

Correctly check format for table options.

## Are these changes tested?

Unit tests updated.

## Are there any user-facing changes?

None, internal only.

## Context

Related to apache#20705 but
targetting `branch-53`.
…tor::Colon` (apache#20717)

## Which issue does this PR close?

- part of apache#19692  
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Needed in
datafusion-contrib/datafusion-variant#26

- Backport of apache#20628 on v53
branch.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

- `sqlparser-rs` currently exposes the colon operator (`:`) as a special
`JsonAccess` expression. So it fails in datafusion's parsing before an
`ExprPlanner` is even invoked.
- Add `Operator::Colon`. Currently it's not used/implemented in
datafusion.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Fixes the above problem by converting `JsonAccess` to a normal binary
expr, on which the `ExprPlanner` is invoked and custom parsing can be
done.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added tests.

Also did a prototype of a custom `ExprPlanner` in datafusion-variant
using this to convert colon operator to `variant_get` function -
datafusion-contrib/datafusion-variant#31

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

Add `Operator::Colon`
…size() to reduce memory pool interactions (apache#20733)

Backport apache#20729 to `branch-53`.
…20672) (apache#20792)

- Part of apache#19692
- Closes apache#20683 on branch-53

**This PR:**
- Backports apache#20672 from
@xanderbailey to the [branch-53]

Co-authored-by: Xander <zander181@googlemail.com>
…flatten keys) (apache#20505) (apache#20791)

- Part of apache#19692
- Closes apache#20696 on branch-53

**This PR:**
- Backports apache#20505 from @alamb
to the [branch-53]

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…pache#20887)

- Part of apache#19692

This PR:
- Backports apache#20609 from
@comphead to the branch-53 line

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
…ing (apache#20635) (apache#20885)

- Part of apache#19692
- Closes apache#20634 on branch-53

This PR:
- Backports apache#20635 from
@neilconway to the branch-53 line

Co-authored-by: Neil Conway <neil.conway@gmail.com>
…t inner filter proves zero selectivity (apache#20743) (apache#20882)

- Part of apache#19692
- Closes apache#20742 on branch-53

This PR:
- Backports apache#20743 from
@haohuaijin to the branch-53 line

Co-authored-by: Huaijin <haohuaijin@gmail.com>
…truct cols (apache#20698) (apache#20884)

- Part of apache#19692
- Closes apache#20695 on branch-53

This PR:
- Backports apache#20698 from
@friendlymatthew to the branch-53 line

Co-authored-by: Matthew Kim <38759997+friendlymatthew@users.noreply.github.com>
- Part of apache#19692
- Closes apache#20737 on branch-53

This PR:
- Backports apache#20738 from
@haohuaijin to the branch-53 line

Co-authored-by: Huaijin <haohuaijin@gmail.com>
…he#20424) (apache#20890)

- Part of apache#19692

This PR:
- Backports apache#20424 from
@ariel-miculas to the branch-53 line

Co-authored-by: Ariel Miculas-Trif <ariel.miculas@gmail.com>
…tafusion-proto (apache#20574) (apache#20891)

- Part of apache#19692
- Closes apache#20575 on branch-53

This PR:
- Backports apache#20574 from
@nathanb9 to the branch-53 line

Co-authored-by: nathan <56370526+nathanb9@users.noreply.github.com>
…quired (apache#20900) (apache#20903)

- Part of apache#19692
- Backports apache#20900 from @askalt
to the branch-53 line

This PR:
- Backports apache#20900 to branch-53

Co-authored-by: Albert Skalt <133099191+askalt@users.noreply.github.com>
…ith ANSI mode support (apache#20461) (apache#20896)

- Part of apache#19692

This PR:
- Backports apache#20461 from
@davidlghellin to the branch-53 line

Co-authored-by: David López <hola@devel0pez.com>
…ing (apache#20372) (apache#20893)

- Part of apache#19692
- Closes apache#20371 on branch-53

This PR:
- Backports apache#20372 from
@erenavsarogullari to the branch-53 line

Co-authored-by: Eren Avsarogullari <eren@apache.org>
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
…pache#19739) (apache#20897)

- Part of apache#19692
- Closes apache#16281 on branch-53

This PR:
- Backports apache#19739 from
@kumarUjjawal to the branch-53 line

Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com>
…ushed down into TableScan (apache#19884) (apache#20898)

- Part of apache#19692
- Closes apache#19840 on branch-53

This PR:
- Backports apache#19884 from @kosiew
to the branch-53 line

Co-authored-by: kosiew <kosiew@gmail.com>
…ry in ForeignTableProvider::scan (apache#20393) (apache#20895)

- Part of apache#19692

This PR:
- Backports apache#20393 from
@Kontinuation to the branch-53 line

Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>
…LL) (apache#20391) (apache#20892)

- Part of apache#19692
- Closes apache#20388 on branch-53

This PR:
- Backports apache#20391 from
@fwojciec to the branch-53 line

Co-authored-by: Filip Wojciechowski <fwojciec@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…apache#20685) (apache#20914)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123. -->

- Closes apache#20611 .

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes. -->

## What changes are included in this PR?

The Spark function is actual wrapper on top of `array_has` function.
After result is being produced the nulls mask is set respectively for
the output indices which correspond to input rows having nulls

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
…ache#20724) (apache#20858) (apache#20918)

- Part of apache#20724
- Closes apache#20724 on branch-53

This PR:
- Backports apache#20858 from
@gboucher90 to the branch-53 line

Co-authored-by: gboucher90 <gboucher90@users.noreply.github.com>
…filter (apache#20231) (apache#20932)

- Part of apache#19692
- Closes apache#20194 on branch-53

This PR:
- Backports apache#20231 from
@EeshanBembi to the branch-53 line

Co-authored-by: EeshanBembi <33062610+EeshanBembi@users.noreply.github.com>
## Which issue does this PR close?

- Part of apache#19692

## Rationale for this change

We have backported a bunch of bug fixes to branch-53, so let's make sure
the release notes reflect that

## What changes are included in this PR?

I ran
```shell
uv run ./dev/release/generate-changelog.py 52.3.0 branch-53 53.0.0 > dev/changelog/53.0.0.md
```

And then had codex review via

```
› Please review dev/changelog/53.0.0.md to ensure it reflects all commits between where `apache/branch-53` and `main` diverged
```

Then I updated the change log to reflect the original authors not the
backport authors


## Are these changes tested?

By CI
dd-david-levin and others added 13 commits April 16, 2026 17:43
…mpatibility (apache#21104)

## Problem

`string_to_array` was returning incorrect results for empty string input
— both when the delimiter is non-empty and when the delimiter is itself
an empty string. This diverges from PostgreSQL behavior.

| Query | DataFusion (before) | PostgreSQL (expected) |
|---|---|---|
| `string_to_array('', ',')` | `['']` | `{}` |
| `string_to_array('', '')` | `['']` | `{}` |
| `string_to_array('', ',', 'x')` | `['']` | `{}` |
| `string_to_array('', '', 'x')` | `['']` | `{}` |

Results from datafusion-cli
<img width="1435" height="104" alt="Screenshot 2026-03-23 at 9 14 08 AM"
src="https://github.com/user-attachments/assets/2eaae366-7f8a-4220-87d2-f0b311c26712"
/>

**Root cause:** Rust's `str::split()` on an empty string always yields
one empty-string element, so `"".split(",")` produces `[""]`.
Additionally, the empty-delimiter branch unconditionally appended the
(empty) string value. This is subtle because Arrow's text display format
appears not to quote strings, so `[""]` renders as `[]` —
indistinguishable from an actual empty array. Using `cardinality()`
reveals the current length is 1, not 0.

**PostgreSQL reference:**
[db-fiddle](https://www.db-fiddle.com/f/oCF8EPaZFkDNKSg28rVVWy/3)

## Fix

In `datafusion/functions-nested/src/string.rs`:

- **Non-empty delimiter** `(Some(string), Some(delimiter))`: added `if
!string.is_empty()` guard to skip splitting when input is empty.
- **Empty delimiter** `(Some(string), Some(""))`: added `if
!string.is_empty()` guard so the string value is only appended when
non-empty.

Both the plain variant and the `null_value` variant are fixed (4 checks
total).

## Tests

Added sqllogictest cases in
`datafusion/sqllogictest/test_files/array.slt` using `cardinality()` to
unambiguously verify the arrays are truly empty (not just displaying as
empty):

```sql
SELECT cardinality(string_to_array('', ','))    -- 0
SELECT cardinality(string_to_array('', ''))     -- 0
SELECT cardinality(string_to_array('', ',', 'x'))  -- 0
SELECT cardinality(string_to_array('', '', 'x'))   -- 0
```

Each test covers one of the four `is_empty` guard checks. All four fail
without the fix (returning 1 instead of 0).

(cherry picked from commit cdaecf0)
…ray_any_value`. (apache#21672)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #. apache#21671

## Rationale for this change

Align the implementation of `array_element`, `array_any_value` with
Arrow's spec concerning Null rows. Specifically allow the below from
[Arrow's
spec](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)

> It should be noted that a null value may have a positive slot length.
That is, a null value may occupy a non-empty memory space in the data
buffer. When this is true, the content of the corresponding memory space
is undefined.

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## Are these changes tested?

Unit tests included.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

(cherry picked from commit 4c7bb08)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

Create `array_compact` function which removes NULLs from input array.
There is no direct counterparty in DuckDB however the function used in
Spark, SnowFlake

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 26c6121)
…#21762)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- None

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Sometimes, when an `ExecutionPlan` implementation is complex, different
metrics are collected from different structs that compose the whole
execution plan.

These metrics need to eventually be served from the single entrypoint
`ExecutionPlan::metrics()` or `DataSource::metrics()`, and the current
api does not have good methods for merging several
`ExecutionPlanMetricsSet` coming from different sources into a single
one.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Add some basic conversion and iteration methods for `MetricsSet` and
`ExecutionPlanMetricsSet`, in order to improve ergonomics around these
structs.

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

This is purely just basic std trait implementations and method exposure,
so as long as the code compiles, I don't think it needs further tests.

## Are there any user-facing changes?

People will see some more available methods in the `MetricsSet` and
`ExecutionPlanMetricsSet` structs for ergonomics.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit ff844be)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes part of
datafusion-contrib/datafusion-distributed#361.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

The custom metrics display is unnecessarily prepending the name of the
metric during the display:

```
network_latency_min_2=name:network_latency_min 274.88µs
```

When it should just be:
```
network_latency_min_2=274.88µs
```

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Fix to the display

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

yes, by a new test

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 657887d)
- Closes apache#21763

Add support for nested types to the `nullif` UDF.

Unit tests included.

No changes to the function's signature.

---------

Co-authored-by: Gabriel <45515538+gabotechs@users.noreply.github.com>
(cherry picked from commit 22bb4e6)
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes #.

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Before the PR, explain analyze for copyinto doesn't have metric

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Support metrics for ParquetSink

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
Yes, ut + sqllogictest

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
(cherry picked from commit 95de1bf)
- Adds `MetricCategory` enum (`Rows`, `Bytes`, `Timing`) classifying
metrics by what they measure and, critically, their **determinism**:
rows/bytes are deterministic given the same plan+data; timing varies
across runs.
- Each `Metric` can now declare its category via
`MetricBuilder::with_category()`. Well-known builder methods
(`output_rows`, `elapsed_compute`, `output_bytes`, etc.) set the
category automatically. Custom counters/gauges default to "always
included".
- New session config `datafusion.explain.analyze_categories` accepts
`all` (default), `none`, or comma-separated `rows`, `bytes`, `timing`.
- This is orthogonal to the existing `analyze_level` (summary/dev) which
controls verbosity.

Running `EXPLAIN ANALYZE` in `.slt` tests currently requires liberal use
of `<slt:ignore>` for every non-deterministic timing metric. With this
change, a test can simply:

```sql
SET datafusion.explain.analyze_categories = 'rows';
EXPLAIN ANALYZE SELECT ...;
-- output contains only row-count metrics — fully deterministic, no <slt:ignore> needed
```

In particular, for dynamic filters we have relatively complex
integration tests that exist mostly to assert the plan shapes and state
of the dynamic filters after the plan has been executed. For example
tests. I've also wanted to e.g. make assertions about pruning
effectiveness without having timing information included.

- [x] New Rust integration test `explain_analyze_categories` covering
all combos (rows, none, all, rows+bytes)
- [x] New `.slt` tests in `explain_analyze.slt` for `rows`, `none`,
`rows,bytes`, and `rows` with dev level
- [x] Existing `explain_analyze` integration tests pass (24/24)
- [x] Proto roundtrip test updated and passing
- [x] `information_schema` slt updated for new config entry
- [x] Full `core_integration` suite passes (918 tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
(cherry picked from commit 2c03881)
…e only (apache#21825)

- Closes apache#21797.

`ParquetSink` registered an `elapsed_compute` metric using a single
wall-clock timer that spanned the entire write operation — upstream
batch wait, CPU Arrow→Parquet encoding, and object-store I/O all rolled
into one number. This
made the metric misleading: it inflated `elapsed_compute` with I/O
latency, which is inconsistent with how every other operator in
DataFusion reports this metric (CPU time only).

Two write paths are fixed independently:

**Sequential path** (`allow_single_file_parallelism = false` or CDC
enabled):
- A new `TimingWriter<W>` wrapper implements `AsyncFileWriter` and
records wall-clock time spent in I/O calls (`write` / `complete`).
- The total time inside `writer.write()` and `writer.close()` is
accumulated in `total_write_time`. After all tasks join,
`elapsed_compute` is set to `total_write_time − io_time`, isolating pure
Arrow→Parquet encoding time.

**Parallel path** (`allow_single_file_parallelism = true`, default):
- `encoding_time: Time` (a clone of the registered `elapsed_compute`
metric) is threaded through the five-function call chain down to the two
leaf sites: `writer.write()` in `column_serializer_task` and
`writer.close()` in `spawn_rg_join_and_finalize_task`. Since `Time` is
`Arc<AtomicUsize>`, all concurrent column tasks accumulate directly into
the registered metric.
- Note: on the parallel path, `append_to_row_group()` in
`concatenate_parallel_row_groups` is interleaved with I/O and cannot be
cleanly isolated. It is excluded from `elapsed_compute`. This is
acceptable since it operates on already-encoded data and represents a
small fraction of total encoding CPU time.

Yes. Two tests are added/extended in
`datafusion/core/src/dataframe/parquet.rs`:
- `test_parquet_sink_metrics_sequential` (new): verifies
`elapsed_compute > 0` with `allow_single_file_parallelism = false`.
- `test_parquet_sink_metrics_parallel`: extended with an
`elapsed_compute > 0` assertion (was previously missing for the parallel
path).

The existing `test_parquet_sink_metrics` test (parallel path, default
config) already asserted `elapsed_compute > 0` and continues to pass.

No API changes. The `elapsed_compute` metric was already surfaced — this
PR makes its value accurate rather than introducing a new metric.

(cherry picked from commit 3aefba7)
@github-actions github-actions Bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation execution Related to the execution crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate spark sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.