Skip to content

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

Open
tanmayrauth wants to merge 1 commit intoapache:mainfrom
tanmayrauth:feat/dictionary-columns-scan
Open

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit intoapache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

Conversation

@tanmayrauth
Copy link
Copy Markdown

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

  • ORC files are guarded — dictionary_columns is only passed for Parquet
  • ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
  • DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
  • DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

@tanmayrauth tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48
@tanmayrauth
Copy link
Copy Markdown
Author

@kevinjqliu @Fokko can you please review and approve this?

@tanmayrauth
Copy link
Copy Markdown
Author

tanmayrauth commented Apr 16, 2026

@geruh @kevinjqliu @Fokko can you please review this implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant