feat: add dictionary_columns to scan API for memory-efficient string reads by tanmayrauth · Pull Request #3234 · apache/iceberg-python

tanmayrauth · 2026-04-13T17:44:39Z

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

ORC files are guarded — dictionary_columns is only passed for Parquet
ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

…reads

tanmayrauth · 2026-04-16T04:53:03Z

@kevinjqliu @Fokko can you please review and approve this?

tanmayrauth · 2026-04-16T18:55:35Z

@geruh @kevinjqliu @Fokko can you please review this implementation?

tanmayrauth mentioned this pull request Apr 13, 2026

feature request: pass optional parameters to DataScan/pyarrow #3170

Open

tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 946d70a to 52b2070 Compare April 13, 2026 18:45

feat: add dictionary_columns to scan API for memory-efficient string …

9fc3b0c

…reads

tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit intoapache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

tanmayrauth commented Apr 13, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanmayrauth commented Apr 13, 2026

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

Uh oh!

tanmayrauth commented Apr 16, 2026

Uh oh!

tanmayrauth commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanmayrauth commented Apr 16, 2026 •

edited

Loading