Skip to content

revision: priority queue for streaming walks + walk_ops vtable#2118

Draft
spkrka wants to merge 4 commits into
gitgitgadget:masterfrom
spkrka:revision-walk-ops
Draft

revision: priority queue for streaming walks + walk_ops vtable#2118
spkrka wants to merge 4 commits into
gitgitgadget:masterfrom
spkrka:revision-walk-ops

Conversation

@spkrka
Copy link
Copy Markdown

@spkrka spkrka commented May 19, 2026

Summary

Building on the limit_list() priority queue change from PR #2114, this adds a missing release_revisions() call needed for the new allocation pattern, extends the O(log N) priority queue approach to the non-limited streaming walk in get_revision_1(), and refactors the walk dispatch into a vtable pattern.

Commit 1 (same as PR #2114)

revision: use priority queue in limit_list() — replaces O(N) sorted linked-list insertion with O(log N) priority queue in limit_list().

Commit 2

pack-objects: call release_revisions() after cruft traversalenumerate_and_traverse_cruft_objects() never called release_revisions(). This was harmless on master (the linked-list queue is consumed during traversal), but becomes a real leak once commit 3 switches to a prio_queue whose backing array must be freed. Fixes the linux-leaks CI failure in t7704.

Commit 3

revision: use priority queue for non-limited streaming walks — adds a commit_queue field to rev_info alongside the existing commits linked list. The two representations are mutually exclusive: setup uses the list, then the walk drains it into the priority queue. Removes the list parameter from process_parents() and eliminates commit_list_insert_by_date() entirely.

Performance: git rev-list HEAD on a 2.3M-commit repository: ~21s → ~6.4s (3.3x speedup).

Commit 4

revision: introduce revision_walk_ops for get_revision_1() — replaces the nested if/else dispatch in get_revision_1() with a revision_walk_ops vtable (init/next/expand function pointers). Five walk modes (reflog, topo, streaming, no_walk, limited) are each self-contained. The ops are selected once and cached on rev_info.

Design notes

This takes a two-phase approach: linked list during setup, priority queue during the walk. External callers that read revs->commits between prepare_revision_walk() and the walk loop are unchanged. This contrasts with the alternative approach of replacing commits entirely with a prio_queue (see Jeff King's jk/revs-commits-prio-queue branch at https://github.com/peff/git).

Tradeoffs are discussed in the mailing list thread for PR #2114.

Test plan

  • All existing tests pass (t6001, t6012, t6019, t4013, t4202, t1411 specifically verified)
  • linux-leaks CI (t7704) passes with the release_revisions() fix
  • Correctness verified on 2.3M-commit monorepo across multiple range sizes

@spkrka spkrka force-pushed the revision-walk-ops branch 2 times, most recently from 789b94a to 029889e Compare May 19, 2026 13:23
spkrka added 4 commits May 19, 2026 15:27
limit_list() maintains a date-sorted work queue of commits using a
linked list with commit_list_insert_by_date() for insertion.  Each
insertion walks the list to find the right position — O(n) per insert.
In repositories with merge-heavy histories, the symmetric difference
can contain thousands of commits, making this O(n) insertion the
dominant cost.

Replace the sorted linked list with a prio_queue (binary heap).  This
gives O(log n) insertion and O(log n) extraction instead of O(n)
insertion and O(1) extraction, which is a net win when the queue is
large.

The still_interesting() and everybody_uninteresting() helpers are
updated to scan the prio_queue's contiguous array instead of walking a
linked list.  process_parents() already accepts both a commit_list and
a prio_queue parameter, so the change in limit_list() simply switches
which one is passed.

Benchmark: git rev-list --left-right --count HEAD~N...HEAD
Repository: 2.3M commits, merge-heavy DAG (monorepo)
Best of 5 runs, times in seconds:

  commits in
  symmetric diff   baseline   patched    speedup
  --------------   --------   -------    -------
            10       0.01      0.01       1.0x
            50       0.01      0.01       1.0x
          3751      21.23      8.49       2.5x
          4524      21.70      8.29       2.6x
         10130      20.10      6.65       3.0x

No change for small traversals; 2.5-3.0x faster when the queue grows
to thousands of commits.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
enumerate_and_traverse_cruft_objects() initializes a rev_info on the
stack and uses it for a traversal, but never calls release_revisions()
afterwards. This currently does not cause a leak on master because the
commit queue uses a linked list that is consumed during traversal.

However, a subsequent commit will switch the revision walking machinery
to use a prio_queue, which dynamically allocates its backing array.
Without this cleanup, that allocation is reported as a leak by
LeakSanitizer (384 bytes in the linux-leaks CI job via t7704).

Add the missing release_revisions() call.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Replace the O(N) sorted linked-list insertion used by get_revision_1()
for non-limited walks with an O(log N) priority queue.

Add a commit_queue field to rev_info alongside the existing commits
linked list. The two representations are mutually exclusive: setup and
external callers that need list access (mark_edges_uninteresting,
bisect, etc.) use the linked list, then get_revision_1() lazily drains
it into the priority queue on the first call for non-limited,
non-no_walk, non-topo, non-reflog walks.

The conversion function rev_info_commit_list_to_queue() is public so
callers that know they will iterate can convert early.

The parent-rewriting path (rewrite_one) always merges newly discovered
commits into the priority queue, which is correct because rewrite_one_1
only calls process_parents for non-limited walks where the queue is
active.

Combined with the limit_list() priority queue change in the parent
commit, this eliminates all O(N) sorted linked-list insertion from the
revision walk machinery.

Performance: git rev-list HEAD on a 2.3M-commit repository improves
from ~21s to ~6.4s (3.3x speedup).

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Extract the per-mode dispatch logic in get_revision_1() into a
revision_walk_ops struct with function pointers for next() and
expand(). Each walk mode (reflog, topo, streaming, no_walk, limited)
gets its own small helper functions, and get_walk_ops() selects the
right ops table once per walk.

This replaces the scattered if/else chains that checked the same mode
conditions at multiple points in the loop body. The loop is now
mode-agnostic: it calls ops->next() to pop the next commit and
ops->expand() to process parents, with NULL expand for limited walks
that skip parent processing entirely.

No functional change.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
@spkrka spkrka force-pushed the revision-walk-ops branch from 029889e to 7d4da97 Compare May 19, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant