[python] Fix HDFS HA and ViewFS URI handling in PyArrowFileIO#7731
Open
TheR1sing3un wants to merge 1 commit intoapache:masterfrom
Open
[python] Fix HDFS HA and ViewFS URI handling in PyArrowFileIO#7731TheR1sing3un wants to merge 1 commit intoapache:masterfrom
TheR1sing3un wants to merge 1 commit intoapache:masterfrom
Conversation
ViewFS and HDFS HA URIs carry no port, so the previous
splitport(netloc) + int(port_str) path raised TypeError on
int(None) before the call ever reached pafs.HadoopFileSystem.
Resolve (host, port) up-front:
* viewfs:// (with or without netloc) -> host='default', port=0
so libhdfs reads fs.defaultFS and the ViewFS mount table.
* hdfs:// without an explicit port (HA nameservice or no netloc)
-> host='default', port=0 to delegate to fs.defaultFS.
* hdfs://host:port -> connect directly with the given host/port.
Add HdfsFileIOTest covering all three branches plus the existing
HADOOP_HOME / HADOOP_CONF_DIR guard checks.
Member
Author
|
failed ci may be caused by a flaky test? fix in : #7735 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
PyArrowFileIO._initialize_hdfs_fscallssplitport(netloc)followed byint(port_str). ViewFS and HDFS HA URIs have no port, soport_strisNoneand we hitTypeError: int() argument must be a string ... not 'NoneType'before reachingpafs.HadoopFileSystem.This PR resolves
(host, port)up-front so all three URI shapes work without surprising the user:viewfs://...(with or without netloc) →host='default', port=0so libhdfs readsfs.defaultFSand resolves the ViewFS mount table fromcore-site.xml.hdfs://nameservice/...(HA, no port) orhdfs:///...(no netloc) → alsohost='default', port=0to delegate tofs.defaultFS.hdfs://host:port/...→ connect directly with the parsed host/port.The
host/portvariables are reused by the existing Kerberos branch unchanged.Linked issue
N/A — surfaced when running pypaimon against a ViewFS-backed cluster and an HDFS HA nameservice without an explicit port.
Tests
New
HdfsFileIOTestinpypaimon/tests/file_io_test.pycovering:test_viewfs_uses_default_host—viewfs://clusterNametest_viewfs_without_netloc_uses_default_host—viewfs:///pathtest_hdfs_with_port_uses_explicit_host—hdfs://namenode:8020test_hdfs_ha_nameservice_without_port_uses_default_host—hdfs://nameservice1test_hdfs_without_netloc_uses_default_host—hdfs:///pathtest_hdfs_missing_hadoop_home_raises/test_hdfs_missing_hadoop_conf_dir_raises— guard checksLocal:
pytest pypaimon/tests/file_io_test.py→ 22 passed;flake8 --config=dev/cfg.iniclean.API and format
No public API change. No file format change. Behaviour change is restricted to URI shapes that previously raised
TypeErrorand are now usable.Documentation
No documentation change required.
Generative AI disclosure
Drafted with assistance from an AI coding tool; all logic reviewed by the author and validated by the tests above.