Skip to content

[python] Fix HDFS HA and ViewFS URI handling in PyArrowFileIO#7731

Open
TheR1sing3un wants to merge 1 commit intoapache:masterfrom
TheR1sing3un:py-fix-hdfs-ha-viewfs-uri
Open

[python] Fix HDFS HA and ViewFS URI handling in PyArrowFileIO#7731
TheR1sing3un wants to merge 1 commit intoapache:masterfrom
TheR1sing3un:py-fix-hdfs-ha-viewfs-uri

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Purpose

PyArrowFileIO._initialize_hdfs_fs calls splitport(netloc) followed by int(port_str). ViewFS and HDFS HA URIs have no port, so port_str is None and we hit TypeError: int() argument must be a string ... not 'NoneType' before reaching pafs.HadoopFileSystem.

This PR resolves (host, port) up-front so all three URI shapes work without surprising the user:

  • viewfs://... (with or without netloc) → host='default', port=0 so libhdfs reads fs.defaultFS and resolves the ViewFS mount table from core-site.xml.
  • hdfs://nameservice/... (HA, no port) or hdfs:///... (no netloc) → also host='default', port=0 to delegate to fs.defaultFS.
  • hdfs://host:port/... → connect directly with the parsed host/port.

The host/port variables are reused by the existing Kerberos branch unchanged.

Linked issue

N/A — surfaced when running pypaimon against a ViewFS-backed cluster and an HDFS HA nameservice without an explicit port.

Tests

New HdfsFileIOTest in pypaimon/tests/file_io_test.py covering:

  • test_viewfs_uses_default_hostviewfs://clusterName
  • test_viewfs_without_netloc_uses_default_hostviewfs:///path
  • test_hdfs_with_port_uses_explicit_hosthdfs://namenode:8020
  • test_hdfs_ha_nameservice_without_port_uses_default_hosthdfs://nameservice1
  • test_hdfs_without_netloc_uses_default_hosthdfs:///path
  • test_hdfs_missing_hadoop_home_raises / test_hdfs_missing_hadoop_conf_dir_raises — guard checks

Local: pytest pypaimon/tests/file_io_test.py → 22 passed; flake8 --config=dev/cfg.ini clean.

API and format

No public API change. No file format change. Behaviour change is restricted to URI shapes that previously raised TypeError and are now usable.

Documentation

No documentation change required.

Generative AI disclosure

Drafted with assistance from an AI coding tool; all logic reviewed by the author and validated by the tests above.

ViewFS and HDFS HA URIs carry no port, so the previous
splitport(netloc) + int(port_str) path raised TypeError on
int(None) before the call ever reached pafs.HadoopFileSystem.

Resolve (host, port) up-front:
  * viewfs:// (with or without netloc) -> host='default', port=0
    so libhdfs reads fs.defaultFS and the ViewFS mount table.
  * hdfs:// without an explicit port (HA nameservice or no netloc)
    -> host='default', port=0 to delegate to fs.defaultFS.
  * hdfs://host:port -> connect directly with the given host/port.

Add HdfsFileIOTest covering all three branches plus the existing
HADOOP_HOME / HADOOP_CONF_DIR guard checks.
@TheR1sing3un
Copy link
Copy Markdown
Member Author

failed ci may be caused by a flaky test? fix in : #7735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant