Remove Empty Chunks in Cloudfetch concatenation#814
Conversation
|
P2 — Minor
fetchall_arrow uses partial_result_chunks: List = [] while fetchmany_arrow:313 uses List["pyarrow.Table"]. Cosmetic; align both.
If a pathological server returned 0-row chunks while leaving has_more_rows=True forever, both methods spin. The old code (append-without-decrement) and the new code (continue) have identical semantics here.
All four new tests place the placeholder first. The continue path is exercised, but only with the initial queue being the placeholder — there's no [real, placeholder, real] test that locks in the mid-stream
_StubArrowQueue returns the full table on first call and slice(0, 0) thereafter. Production next_n_rows(N) returns up to N rows. The tests happen to call fetchmany_arrow(3) against a 3-row table so the |
Comments regarding the tests are fine, but our main point is to check whether concatenation is averted or not with the empty chunks and this works, so its fine |
What type of PR is this?
Description
ThriftResultSet.fetchmany_arrowandfetchall_arrowpreviously appended every chunkreturned by the underlying queue — including the 0-row placeholder that
CloudFetchQueue._create_empty_table()emits whenself.table is None— intopartial_result_chunks, and then handed the list toconcat_table_chunks.When the placeholder's schema differs from the real downloaded chunks (which it can,
because it is built from
schema_bytesthat may be stale, or schemaless whenschema_bytes is None),pyarrow.concat_tables(..., promote_options="default")silently introduces phantom columns filled with NULLs, or — for type mismatches —
raises.
Fix: hold any 0-row chunk aside in a local
zero_row_tableinstead of appending it.The concat list now only ever contains real chunks, which share a consistent schema.
If every chunk turned out to be 0-row (genuinely empty result set), fall back to
appending the held-aside placeholder so the method still returns a valid
pyarrow.Table.CloudFetchQueueitself is unchanged. The columnar fetch paths are unchanged(they only run with
ColumnQueue, whose empty slices always carry the right schema).The SEA arrow methods are unchanged (single queue call per invocation, no concat).
How is this tested?
Added 4 regression tests in
tests/unit/test_fetches.py:test_fetchall_arrow_drops_mismatched_empty_placeholder— first queue returns a0-row placeholder with column
stale_col; second returns real data withcol0.Asserts the result has only
col0. Verified to fail against the pre-fix code with['stale_col', 'col0'] != ['col0'].test_fetchmany_arrow_drops_mismatched_empty_placeholder— same forfetchmany_arrow.test_fetchall_arrow_all_empty_returns_zero_row_table— every chunk is 0-row;asserts a
pyarrow.Tablewithnum_rows == 0is returned (the held-aside fallback fires).test_fetchmany_arrow_all_empty_returns_zero_row_table— same forfetchmany_arrow.Full unit suite (
pytest tests/unit -x): 765 passed, 4 skipped.Integration / e2e to be run separately.