Skip to content

pack-objects: integrate --path-walk and some --filter options#2101

Draft
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:path-walk-filters
Draft

pack-objects: integrate --path-walk and some --filter options#2101
derrickstolee wants to merge 6 commits intogitgitgadget:masterfrom
derrickstolee:path-walk-filters

Conversation

@derrickstolee
Copy link
Copy Markdown

tbd

When 'git pack-objects' has the --path-walk option enabled, it uses a
different set of revision walk parameters than normal. For once,
--objects was previously assumed by the path-walk API and was not needed
to be added. We also needed --boundary to allow discovering
UNINTERESTING objects to use as delta bases.

We will be updating the path-walk API soon to work with some filter
options. However, the revision machinery will trigger a fatal error:

  fatal: object filtering requires --objects

The fix is easy: add the --objects option as an argument. This has no
effect on the path-walk API but does simplify the revision option
parsing for the objects filter.

We can remove the comment about "removing" the options because they were
never removed and instead not added. We still need to disable using
bitmaps.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git pack-objects' command can opt-in to using the path-walk API for
scanning the objects. Currently, this option is dynamically disabled if
combined with '--filter=<X>', even when using a simple filter such as
'blob:none' to signal a blobless packfile. This is a common scenario for
repos at scale, so is worth integrating.

Also, users can opt-in to the '--path-walk' option by default through
the pack.usePathWalk=true config option. When using that in a blobless
partial clone, the following warning can appear even though the user did
not specify either option directly:

  warning: cannot use --filter with --path-walk

Teach the path-walk API to handle the 'blob:none' object filter
natively. When revs->filter.choice is LOFC_BLOB_NONE, the path-walk
sets info->blobs to 0 (skipping all blob objects) and clears the
filter from revs so that prepare_revision_walk() does not reject the
configuration.

This check is implemented in the static prepare_filters() method, which
will simultaneously check if the input filters are compatible and will
make the appropriate mutations to the path_walk_info and filters if the
path_walk_info is non-NULL. This allows us to use this logic both in the
API method path_walk_filter_compatible() for use in
builtin/pack-objects.c and as a prep step in walk_objects_by_path().

Update the test helper (test-path-walk) to accept --filter=<spec>
as a test-tool option (before '--'), applying it to revs after
setup_revisions() to avoid the --objects requirement check.

Also switch test-path-walk from REV_INFO_INIT with manual repo
assignment to repo_init_revisions(), which properly initializes
the filter_spec strbuf needed for filter parsing.

Add tests for blob:none with --all and with a single branch.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The 'git backfill' command uses the path-walk API in a critical way: it
uses the objects output from the command to find the batches of missing
objects that should be requested from the server. Unlike 'git
pack-objects', we cannot fall back to another mechanism.

The previous change added the path_walk_filter_compatible() method that
we can reuse here. Use it during argument validation in cmd_backfill().

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Extend the path-walk API to handle the 'blob:limit=<size>' object
filter natively. This filter omits blobs whose size is equal to or
greater than the given limit, matching the semantics used by the
list-objects-filter machinery.

When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters()
method stores the limit value in info->blob_limit and clears the filter
from revs. If the limit is zero, this degenerates to blob:none (all
blobs excluded), so info->blobs is set to 0 instead.

During walk_path(), blob batches are filtered before being delivered to
the callback: each blob's size is checked via odb_read_object_info(),
and only blobs strictly smaller than the limit are included. Blobs whose
size cannot be determined (e.g. missing in a partial clone) are
conservatively included, matching the existing filter behavior. Empty
batches after filtering are skipped entirely.

The check for inclusion in the path batch looks a little strange at
first glance. We use odb_read_object_info() to read the object's size.
Based on all of the assumptions to this point, this _should_ return
OBJ_BLOB. Since we are focused on the size filter, we use a
short-circuited OR (||) to skip the size check if that method returns a
different object type.

Notice that this inspection of object sizes requires the content to be
present in the repository. The odb_read_object_info() call will download
a missing blob on-demand. This means that the use of the path-walk API
within 'git backfill' would not operate nicely with this filter type.
The intention of that command is to download missing blobs in batches.
Downloading objects one-by-one would go against the point. Update the
validation in 'git backfill' to add its own compatibility check on top
of path_walk_filter_compatible().

Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3
(which exercises partial filtering within a batch where some blobs are
kept and others are excluded).

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Add p5315-pack-objects-filter.sh to measure the performance of
'git pack-objects --revs --all' under different filter and traversal
combinations:

  * no filter (baseline)
  * --filter=blob:none (blobless)
  * --filter=sparse:oid=<oid> (cone-mode sparse)

Each filter scenario is tested both with and without --path-walk,
producing paired measurements that show the impact of the path-walk
traversal for each filter type.

The sparse filter definition is built automatically by sampling
depth-2 directories from the test repository, making the test work
on any repo passed via GIT_PERF_LARGE_REPO. For repos that lack
depth-2 directories, a single top-level directory is used; for flat
repos, the sparse tests are skipped via prerequisite.

Currently the sparse:oid filter is not yet supported by the path-walk
API, so the --path-walk variant silently falls back to the standard
traversal. This establishes the baseline for measuring improvement
once sparse filter support is added.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs, but also the set of matching trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.

Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.

Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.

This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Further, we can use cone mode patterns to prune the tree
walk to avoid expanding groups of tree objects that don't match the
prefix-based patterns.

The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings. We use a combination of errors and warnings to signal problems
during this load. The difference is that errors are likely fatal for the
non-path-walk version while the warnings are probably just
implementation details for the path-walk version and the 'git
pack-objects' command can fall back to the revision walk version.

However, there was one detail that I noticed while implementing this
that does impact 'git backfill --sparse' though we haven't seen it
matter in real scenarios: we could under-count objects within a sparse
scope if they appear in multiple paths and those paths are within and
without the sparse-checkout.

Move the SEEN flag assignment in add_tree_entries() to after the sparse
pattern and pathspec checks. Previously, SEEN was set immediately upon
discovering an object, before checking whether its path matched the
sparse patterns. When the same object ID appeared at multiple paths
(e.g. sibling directories with identical contents), the first path to be
visited would mark the object as SEEN. If that path was outside the
sparse cone, the object would be skipped there but also never discovered
at its in-cone path.

By deferring the SEEN flag until after the pattern check passes, objects
that are skipped due to sparse filtering remain discoverable at other
paths where they may be in scope. The new tests in t5317 demonstrate
this behavior and would fail without this change.

The recently-added tests in p5315 demonstrate this improvement, even on
the Git repository:

Test                                            HEAD~1    HEAD
--------------------------------------------------------------------
5315.10: repack (sparse:oid)                     61.68   62.59 +1.5%
5315.11: repack size (sparse:oid)               285.8M  285.7M -0.0%
5315.12: repack (sparse:oid, --path-walk)        61.75   9.49 -84.6%
5315.13: repack size (sparse:oid, --path-walk)  286.0M  265.1M -7.3%

This shows that the performance is relatively stable for the case
without --path-walk, but now that the --path-walk option is engaged with
the change we can see an 84% reduction in the end-to-end time and a 7%
reduction in the packfile size. The size decrease is due to the existing
benefits of the path-walk algorithm placing relevant objects next to
each other in the first compression pass, followed by a second pass
order by name hash and size.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant