Skip to content

[python] Align Identifier with Java by encoding branch into the object field#7738

Merged
JingsongLi merged 3 commits intoapache:masterfrom
TheR1sing3un:py-identifier-align-java
May 3, 2026
Merged

[python] Align Identifier with Java by encoding branch into the object field#7738
JingsongLi merged 3 commits intoapache:masterfrom
TheR1sing3un:py-identifier-align-java

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

@TheR1sing3un TheR1sing3un commented Apr 29, 2026

Purpose

Java's org.apache.paimon.catalog.Identifier encodes the branch (and any system-table suffix) into the object fieldtbl$branch_dev$snapshots — so the wire shape is exactly two strings, database and object. Python's Identifier was out of step on three real points:

  1. SYSTEM_BRANCH_PREFIX = 'branch-' in Python vs branch_ (with underscore) in Java. Anything written by the Java REST server in the form tbl$branch_<name> falls into Python's two-segment branch-of-is_system_table(), where the prefix check startswith('branch-') returns False → the method returns True. Today a branched table is misclassified as a system table by Python.
  2. branch was a separate JSON field. Java's @JsonIgnoreProperties(ignoreUnknown = true) silently drops that field on the wire — Python clients calling Identifier(database=..., object=..., branch=...) thought they were targeting a branch but the server never saw it.
  3. get_table_name() returned the raw object name including any encoded suffix, and get_branch_name() / get_system_table_name() couldn't decode tbl$branch_dev$snapshots-style names.

This PR ports the Java design 1:1.

Identifier changes

  • Identifier(database, object) is now the JSON-create constructor; the object may already carry an encoded branch / system_table.
  • Identifier.create(database, table, branch=None, system_table=None) encodes the components into the object. branch == "main" (case-insensitive) is the default and is not encoded, matching Java.
  • _split_object_name() lazily decodes the object into table / branch / system_table, with the same 1- / 2- / 3-segment rules as Java's splitObjectName().
  • SYSTEM_BRANCH_PREFIX is now branch_.
  • get_full_name() drops the database prefix when database is UNKNOWN_DATABASE (matches Java).
  • __hash__ / __eq__ use only (database, object), so a JSON round-trip produces an equal instance.

Backward-compat shim (added in commit 2 after review feedback)

Three previously-public surfaces keep working for one more release with a DeprecationWarning, scheduled for removal in the next minor release:

  • Identifier(database, object, branch=...) — accepted again; the branch is encoded into object internally so the wire output matches Identifier.create(db, table, branch=...).
  • identifier.branch — exposed as a @property that delegates to get_branch_name().
  • Identifier.create(db, "tbl$snapshots") — when the second positional arg already contains $ and no branch= / system_table= kwargs are supplied, the call is treated as the legacy raw-object form: the string is stored verbatim in object, preserving the pre-PR behaviour where create("db", "orders$snapshots") yielded a system-table identifier.

The bug-fix changes (SYSTEM_BRANCH_PREFIX corrected to branch_, dropping the un-round-trippable branch JSON field) are kept as-is; they were already incorrect versus Java, and there is no compat path that preserves the buggy behaviour while letting branched system tables work end-to-end.

Knock-on cleanup (single source of truth)

Now that the identifier carries the branch:

  • SnapshotCommit.commit() (abstract + CatalogSnapshotCommit + RenamingSnapshotCommit) drops its branch parameter; FileStoreCommit stops passing table.current_branch().
  • FileStoreTable.current_branch() reads from the identifier (was self.options.branch()).
  • FileStoreTable.copy() re-encodes the branch into a new identifier when CoreOptions.BRANCH is supplied, so CatalogEnvironment sees the branched object name.
  • RESTCatalog.to_table_metadata() drops the now-redundant block that copied the branch back into options.
  • RESTTokenFileIO.refresh_token() uses Identifier.create(db, table, branch=...) to strip the system-table suffix while preserving the branch.
  • The mock REST server's URL parser stops scanning for a table.branch suffix and reads identifier.get_branch_name() directly, matching the Java URL convention.

Migration

Old New
Identifier(db, obj, branch=b) Identifier.create(db, table, branch=b)
identifier.branch identifier.get_branch_name() (or get_branch_name_or_default())
Identifier.create(db, "tbl$snapshots") Identifier(db, "tbl$snapshots") (raw) or Identifier.create(db, "tbl", system_table="snapshots")

The deprecated forms keep working for this release; deprecation warnings indicate the migration path. Removal targeted at the next minor release.

Linked issue

N/A — surfaced when running pypaimon against a Java REST server that emits tbl$branch_<name> object names.

Tests

  • pypaimon/tests/identifier_test.py::IdentifierTest — 25 cases covering simple / backtick / Java-compatible parsing, branch encoding, main-branch normalization, UNKNOWN_DATABASE, one- / two- / three-segment object names with and without branch prefix, create() with branch + system_table, equality through the JSON shape, and the malformed tbl$x$y-without-branch-prefix case.
  • pypaimon/tests/identifier_test.py::IdentifierBackwardCompatibilityShimTest — 8 cases locking in the shim: constructor branch= warns + encodes correctly; branch="main" not encoded; branch=None still warns; .branch property delegates + warns; create(db, "$"-containing) falls back to raw-object form + warns; create(db, "no-dollar") does NOT warn; create(db, "$"-table, branch=...) skips the fallback so kwargs win.
  • catalog_branch_manager_test, file_store_table_test, rest_catalog_commit_snapshot_test, rest_token_file_io_test — updated to construct branched identifiers via Identifier.create(...) or the db.tbl$branch_<x> string form.

Local: pytest pypaimon/tests/{identifier_test,branch,table/file_store_table_test,file_store_commit_test,rest/{client_test,rest_token_file_io_test,rest_catalog_commit_snapshot_test}.py} → 33 identifier tests + surrounding suites green; the only failure in those suites is the pre-existing lance environment issue (unrelated to this PR). flake8 --config=dev/cfg.ini clean.

API and format

  • Wire compatibility: this is a wire-correctness fix. The previous branch JSON field was already silently dropped by Java; nothing that successfully round-tripped through the Java REST server is affected. After the fix, branch-aware Python clients now correctly produce object="tbl$branch_<name>" that Java decodes the same way.
  • Public Python API: Identifier(database, object, branch=...) and identifier.branch and the legacy Identifier.create(db, object-with-dollar) form keep working for this release with a DeprecationWarning. Migration table above.
  • No file format change.

Documentation

The new create() signature, the encoding rules, and the deprecation paths are documented inline on Identifier, __init__, create(), and branch.

Generative AI disclosure

Drafted with assistance from an AI coding tool; the design is a direct 1:1 port of org.apache.paimon.catalog.Identifier (paimon-api), and every behavioural difference between Java and Python that this PR closes is exercised by a test in identifier_test.py. The deprecation shim was added in response to review feedback (#7738 review thread on identifier.py:29).

…t field

Java's org.apache.paimon.catalog.Identifier represents a table on a branch
(or a system table on a branch) by encoding the branch / system_table into
the "object" field, e.g. "tbl$branch_dev$snapshots". The wire shape is
exactly two fields, "database" and "object". The Python Identifier was
out of step:

  * SYSTEM_BRANCH_PREFIX was "branch-", whereas Java uses "branch_". Any
    "$branch_<name>" object name produced by the REST server made
    is_system_table() return True (treating a branched table as a system
    table) — a real correctness bug for catalog snapshot commits that
    routed through the branch-encoded identifier.
  * "branch" was a third JSON field. Java has @JsonIgnoreProperties so it
    silently dropped that field on the wire — Python clients setting
    `branch=...` thought they were branching but the server never saw it.
  * get_table_name() returned the raw object name including any encoded
    suffix; get_branch_name() / get_system_table_name() were missing.

Port the Java design 1:1:
  * Identifier(database, object) is the JSON-create constructor; "object"
    may already carry an encoded branch / system_table.
  * Identifier.create(database, table, branch=None, system_table=None)
    encodes the components into the object field. branch="main"
    (case-insensitive) is the default and is not encoded, matching Java.
  * _split_object_name() lazily decodes the object into table / branch /
    system_table, with the same one- / two- / three-segment rules as
    Java's splitObjectName().
  * SYSTEM_BRANCH_PREFIX is now "branch_".
  * get_full_name() drops the database prefix when the database is
    UNKNOWN_DATABASE, matching Java.
  * Equality and hash use only (database, object), so a JSON round-trip
    produces an equal instance.

Eliminate redundant branch parameter passing now that the identifier is
the single source of truth:
  * SnapshotCommit.commit() no longer takes a separate `branch` argument
    — both CatalogSnapshotCommit and RenamingSnapshotCommit (and the
    abstract base) drop it; FileStoreCommit's call site stops passing
    table.current_branch().
  * FileStoreTable.current_branch() reads from the identifier instead of
    CoreOptions; FileStoreTable.copy() re-encodes the branch into the
    new identifier when CoreOptions.BRANCH is supplied, so the catalog
    environment sees the branched object name.
  * RESTCatalog.to_table_metadata() drops the now-redundant block that
    re-derives BRANCH from the identifier into options.
  * RESTTokenFileIO.refresh_token() uses Identifier.create(db, table,
    branch=...) to strip the system-table suffix while preserving the
    branch.
  * The mock RESTCatalogServer's URL parser stops looking for a
    "table.branch" suffix and instead reads identifier.get_branch_name()
    directly, matching the Java URL convention.

Tests:
  * identifier_test: 25 cases covering simple parse, backtick parsing,
    branch encoding, main-branch normalization, UNKNOWN_DATABASE,
    one/two/three-segment object names with and without branch prefix,
    create() with branch + system_table, equality through JSON shape.
  * catalog_branch_manager_test, file_store_table_test,
    rest_catalog_commit_snapshot_test, rest_token_file_io_test: updated
    to construct branched identifiers via Identifier.create(...) or the
    "db.tbl$branch_<x>" string form.


@dataclass
@dataclass(init=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a break change?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that @JingsongLi — yes, you're right. Let me enumerate what changes shape so I can address each:

  1. Identifier(database, object, branch=...) constructor kwarg removed.
  2. identifier.branch attribute removed.
  3. Identifier.create(db, object) second positional arg renamed from object to table; passing a pre-encoded object containing $ now caches it as the table name rather than splitting it.
  4. SYSTEM_BRANCH_PREFIX changed from 'branch-' to 'branch_' (this one is a bug-fix — Java has always used branch_, so anything Python wrote with branch- was already not round-trippable through the Java REST server).
  5. The branch JSON field is no longer emitted (Java @JsonIgnoreProperties(ignoreUnknown = true) was already dropping it on the wire, so this never carried information end-to-end, but the Python wire shape does change).

(4) and (5) I'd defend as correctness fixes — the pre-fix behaviour was inconsistent with Java in ways that silently corrupted cross-language scenarios, and I don't see a backward-compat path that preserves the buggy behaviour while letting branched system tables work against a Java REST server.

(1) (2) (3) are the real API-shape break. I agree those need a soft-deprecation path. I'll push a follow-up commit on this PR that:

  • Re-accepts Identifier(..., branch=...) and routes it through the new encoding internally, emitting DeprecationWarning.
  • Re-exposes identifier.branch as a property that delegates to get_branch_name(), also with DeprecationWarning.
  • Re-accepts the old Identifier.create(db, object) two-arg form (detected by $ in the second arg) and routes through the new constructor, with DeprecationWarning.
  • Adds tests that lock in both the warning and the equivalent-result behaviour, so we don't accidentally drop the shim before users have a chance to migrate.
  • Adds a migration note to the PR description.

The plan would be to remove the shim in the next minor version. Does that sound reasonable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you need to introduce _BRANCH_NOT_SET, Java also has many constructors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why you need to introduce _BRANCH_NOT_SET, Java also has many constructors.

You're right — the sentinel was overkill, and the deprecation path it was
gating is unnecessary if we just match Java's constructor shape directly.

Pushed b17c946, which:

  • Drops _BRANCH_NOT_SET and all DeprecationWarning.
  • __init__(database, object=None, branch=None, system_table=None)
    mirrors Java's three public constructors in one signature: when
    branch / system_table is non-None, encode into object; otherwise
    treat object as the final string (the @JsonCreator path).
  • Identifier.create keeps its existing kwargs and now just delegates
    to the constructor — Java has only the 2-arg create, but keeping
    the multi-arg form here costs nothing and avoids touching every
    caller introduced by the first commit.

Backward compatibility against the pre-PR public surface (the worry
from your first review) is preserved without warnings:

  • Identifier(db, obj, branch=...) / branch=None — accepted, encodes
    into object so the Java REST server now actually sees the branch.
  • identifier.branch — read+write property. The setter re-encodes
    object so id.branch = "x" keeps working for any caller that used
    to assign to the dataclass field.
  • Identifier.create(db, "tbl$snapshots") two-arg form — unchanged.

A new IdentifierBackwardCompatibilityTest locks these in by asserting
"the old call sites run without raising" (no warning assertions, no
shim semantics — these are first-class supported entry points now).

PTAL.

…nch=/branch/object-with-dollar surfaces

Address review feedback on PR apache#7738: the new wire-aligned shape is
correct, but the change shouldn't break callers that today use
``Identifier(database=..., object=..., branch=...)`` (kwarg form),
``identifier.branch`` (read), or ``Identifier.create(db, "tbl$snapshots")``
(legacy raw-object two-arg form). These surfaces now keep working for
one more release with a ``DeprecationWarning``:

  * ``Identifier.__init__`` accepts ``branch=`` again. The branch is
    encoded into ``object`` internally so the wire output matches what
    ``Identifier.create(db, table, branch=...)`` would produce.
  * ``Identifier.branch`` is exposed as a ``@property`` that delegates
    to ``get_branch_name()``.
  * ``Identifier.create(db, table)`` detects the legacy raw-object
    shape (``$`` in the second arg, no branch / system_table kwargs)
    and stores the string verbatim in ``object``, preserving the
    pre-PR behaviour where ``create("db", "orders$snapshots")``
    yielded a system-table identifier.

All three paths emit a single ``DeprecationWarning`` with a clear
migration suggestion. Scheduled for removal in the next minor release.

The bug-fix changes — ``SYSTEM_BRANCH_PREFIX = 'branch_'`` (was
``'branch-'``, never matched Java) and dropping the un-roundtrippable
``branch`` JSON field — are kept as-is; they were already incorrect
versus Java and there is no compat path that preserves the buggy
behaviour while letting branched system tables work end-to-end.

New tests in ``IdentifierBackwardCompatibilityShimTest`` lock in
both the behavioural compatibility AND the warning, so the shim
doesn't get silently dropped before users have a chance to migrate:

  * constructor branch= encodes into object, equal to create()
  * constructor branch="main" is not encoded
  * constructor branch=None still warns (deprecated kwarg)
  * .branch property delegates and warns
  * create(db, "$"-containing) falls back to raw-object form, warns
  * create(db, "no-dollar") does NOT warn
  * create(db, "$"-table, branch=...) skips the fallback (kwargs win)
@TheR1sing3un
Copy link
Copy Markdown
Member Author

Pushed a913e1461 addressing the review:

  • Identifier(..., branch=...) constructor kwarg restored, internally encodes into object, emits DeprecationWarning.
  • identifier.branch re-exposed as a @property delegating to get_branch_name(), also DeprecationWarning.
  • Identifier.create(db, "tbl$snapshots") legacy two-arg form (raw object containing $, no branch= / system_table= kwargs) routes back to the raw-object behaviour, with DeprecationWarning.
  • 8 new cases in IdentifierBackwardCompatibilityShimTest lock in both the warnings and the equivalent-result behaviour, so the shim can't get silently dropped.
  • PR description now has a migration table; deprecation removal targeted at the next minor release.

Kept the two correctness fixes — SYSTEM_BRANCH_PREFIX = 'branch_' and dropping the un-round-trippable branch JSON field — as-is. PTAL.

@TheR1sing3un TheR1sing3un requested a review from JingsongLi April 29, 2026 15:33
…entinel

Address review on a913e14: instead of a `_BRANCH_NOT_SET` sentinel
gating a DeprecationWarning, mirror Java's three public constructors
(`Identifier(db, object)`, `Identifier(db, table, branch)`,
`Identifier(db, table, branch, systemTable)`) directly in `__init__` --
`branch` / `system_table` not None routes through the encoding path,
otherwise `object` is taken as the final string.

Backward compatibility (pre-PR public surface keeps working without
errors):
  * `Identifier(db, obj, branch=...)` / `branch=None` -- accepted.
  * `identifier.branch` -- read+write property; the setter re-encodes
    `object` so the wire stays consistent.
  * `Identifier.create(db, obj)` two-arg form -- unchanged.

Tests: `IdentifierBackwardCompatibilityTest` now asserts the old call
sites run without raising (no warning assertions, no shim semantics).
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 487bed2 into apache:master May 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants