Allow pandas 3.x in dependency constraints#768
Conversation
Relax the pandas version upper bound from <2.4.0 to <4.0.0 to allow pandas 3.x. The pandas APIs used in this project (nullable extension dtypes, DataFrame.to_numpy, PyArrow-to-pandas conversion via types_mapper) are all compatible with pandas 3.0. Add unit tests for _convert_arrow_table covering all mapped data types (int8-64, uint8-64, float32/64, bool, string), null handling, mixed types, duplicate column names, and the disable_pandas code path. Closes databricks#732
|
Friendly ping for review — this unblocks pandas 3.x users (see #732, which has multiple +1s including Ibis). cc @jprakash-db @vikrantpuppala @tejassp-db — would any of you be able to take a look or assign a reviewer? Summary of validationAll existing and new tests pass on both pandas 2.x and pandas 3.x:
The pandas 3 behavioral change ( Happy to address any feedback. Thanks! |
|
Could a maintainer please take a look. |
vikrantpuppala
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Mostly LGTM, with a few minor comments
| mock_connection = Mock() | ||
| mock_connection.disable_pandas = False | ||
|
|
||
| rs = object.__new__(_ConcreteResultSet) |
There was a problem hiding this comment.
Bypassing __init__ via object.__new__ works because _convert_arrow_table only reads self.description and self.connection.disable_pandas but it might be fragile.
Can we construct via the normal constructor with mocked args as done here: https://github.com/databricks/databricks-sql-python/blob/main/tests/unit/test_client.py#L187-L196
There was a problem hiding this comment.
Can we perhaps add tests for these datatypes as well:
- pa.decimal128
- pa.date32 / pa.date64
- pa.timestamp
- pa.binary / pa.large_string.
- pa.list_ / pa.struct / pa.map_
Summary
Relax the pandas version upper bound from
<2.4.0to<4.0.0to allow pandas 3.x alongsidedatabricks-sql-connector.The pandas APIs used in this project (nullable extension dtypes like
Int64Dtype,StringDtype,DataFrame.to_numpy, PyArrow-to-pandas conversion viatypes_mapper) are all compatible with pandas 3.0. The key behavioral change in pandas 3 —StringDtype()defaulting to PyArrow-backed storage — does not affect this code because results are immediately converted to numpy/Python objects viato_numpy(na_value=None, dtype="object").Closes #732
Changes
pyproject.toml: Raise pandas upper bound from<2.4.0to<4.0.0for both Python version groupstests/unit/test_pandas_compatibility.py: New test suite for_convert_arrow_tablecovering all mapped data typesTest Results
All existing and new tests pass on both pandas 2.x and pandas 3.x:
New test cases (
test_pandas_compatibility.py)Tests exercise
ResultSet._convert_arrow_table()with the following scenarios:test_integer_types— int8, int16, int32, int64 columns with nullstest_unsigned_integer_types— uint8, uint16, uint32, uint64 columns with nullstest_float_types— float32, float64 columns with nullstest_boolean_type— boolean column with nullstest_string_type— string column with nulls (validatesStringDtypebehavior change in pandas 3)test_mixed_types— table with int64, string, float64, bool columns simulating real query resultstest_duplicate_column_names— verifies the rename-before-to_pandas workaroundtest_empty_table— empty Arrow tabletest_all_nulls— all-null columns for int and string typestest_disable_pandas_path— non-pandas code path (_disable_pandas=True)