Skip to content

Remove Empty Chunks in Cloudfetch concatenation#814

Merged
jprakash-db merged 2 commits into
mainfrom
jprakash-db/rm-mt-chunk
May 30, 2026
Merged

Remove Empty Chunks in Cloudfetch concatenation#814
jprakash-db merged 2 commits into
mainfrom
jprakash-db/rm-mt-chunk

Conversation

@jprakash-db
Copy link
Copy Markdown
Contributor

@jprakash-db jprakash-db commented May 27, 2026

What type of PR is this?

Description

ThriftResultSet.fetchmany_arrow and fetchall_arrow previously appended every chunk
returned by the underlying queue — including the 0-row placeholder that
CloudFetchQueue._create_empty_table() emits when self.table is None — into
partial_result_chunks, and then handed the list to concat_table_chunks.

When the placeholder's schema differs from the real downloaded chunks (which it can,
because it is built from schema_bytes that may be stale, or schemaless when
schema_bytes is None), pyarrow.concat_tables(..., promote_options="default")
silently introduces phantom columns filled with NULLs, or — for type mismatches —
raises.

Fix: hold any 0-row chunk aside in a local zero_row_table instead of appending it.
The concat list now only ever contains real chunks, which share a consistent schema.
If every chunk turned out to be 0-row (genuinely empty result set), fall back to
appending the held-aside placeholder so the method still returns a valid pyarrow.Table.

CloudFetchQueue itself is unchanged. The columnar fetch paths are unchanged
(they only run with ColumnQueue, whose empty slices always carry the right schema).
The SEA arrow methods are unchanged (single queue call per invocation, no concat).

How is this tested?

Added 4 regression tests in tests/unit/test_fetches.py:

  • test_fetchall_arrow_drops_mismatched_empty_placeholder — first queue returns a
    0-row placeholder with column stale_col; second returns real data with col0.
    Asserts the result has only col0. Verified to fail against the pre-fix code with
    ['stale_col', 'col0'] != ['col0'].
  • test_fetchmany_arrow_drops_mismatched_empty_placeholder — same for fetchmany_arrow.
  • test_fetchall_arrow_all_empty_returns_zero_row_table — every chunk is 0-row;
    asserts a pyarrow.Table with num_rows == 0 is returned (the held-aside fallback fires).
  • test_fetchmany_arrow_all_empty_returns_zero_row_table — same for fetchmany_arrow.

Full unit suite (pytest tests/unit -x): 765 passed, 4 skipped.

Integration / e2e to be run separately.

@gopalldb
Copy link
Copy Markdown

P2 — Minor

  1. Inconsistent type annotation
    result_set.py:372

fetchall_arrow uses partial_result_chunks: List = [] while fetchmany_arrow:313 uses List["pyarrow.Table"]. Cosmetic; align both.

  1. Pre-existing infinite-loop semantics preserved
    result_set.py:325-336, 382-388

If a pathological server returned 0-row chunks while leaving has_more_rows=True forever, both methods spin. The old code (append-without-decrement) and the new code (continue) have identical semantics here.
Not introduced by this PR, but the new shape makes it easier to add a max-consecutive-empties safety bound as a follow-up.

  1. No mid-stream-placeholder test
    tests/unit/test_fetches.py

All four new tests place the placeholder first. The continue path is exercised, but only with the initial queue being the placeholder — there's no [real, placeholder, real] test that locks in the mid-stream
continue behaviour in isolation from the initial-call branch.

  1. Stub queue minor mismatch with production next_n_rows
    tests/unit/test_fetches.py:99-108

_StubArrowQueue returns the full table on first call and slice(0, 0) thereafter. Production next_n_rows(N) returns up to N rows. The tests happen to call fetchmany_arrow(3) against a 3-row table so the
difference doesn't matter. Pure fixture observation.

Copy link
Copy Markdown

@gopalldb gopalldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments

@jprakash-db
Copy link
Copy Markdown
Contributor Author

Pre-existing infinite-loop semantics preserved
Why would the server return has more rows and then return 0 rows ? This is a server side issue

Comments regarding the tests are fine, but our main point is to check whether concatenation is averted or not with the empty chunks and this works, so its fine

@jprakash-db jprakash-db merged commit 13987de into main May 30, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants