revision: priority queue for streaming walks + walk_ops vtable#2118
Draft
spkrka wants to merge 4 commits into
Draft
revision: priority queue for streaming walks + walk_ops vtable#2118spkrka wants to merge 4 commits into
spkrka wants to merge 4 commits into
Conversation
789b94a to
029889e
Compare
limit_list() maintains a date-sorted work queue of commits using a
linked list with commit_list_insert_by_date() for insertion. Each
insertion walks the list to find the right position — O(n) per insert.
In repositories with merge-heavy histories, the symmetric difference
can contain thousands of commits, making this O(n) insertion the
dominant cost.
Replace the sorted linked list with a prio_queue (binary heap). This
gives O(log n) insertion and O(log n) extraction instead of O(n)
insertion and O(1) extraction, which is a net win when the queue is
large.
The still_interesting() and everybody_uninteresting() helpers are
updated to scan the prio_queue's contiguous array instead of walking a
linked list. process_parents() already accepts both a commit_list and
a prio_queue parameter, so the change in limit_list() simply switches
which one is passed.
Benchmark: git rev-list --left-right --count HEAD~N...HEAD
Repository: 2.3M commits, merge-heavy DAG (monorepo)
Best of 5 runs, times in seconds:
commits in
symmetric diff baseline patched speedup
-------------- -------- ------- -------
10 0.01 0.01 1.0x
50 0.01 0.01 1.0x
3751 21.23 8.49 2.5x
4524 21.70 8.29 2.6x
10130 20.10 6.65 3.0x
No change for small traversals; 2.5-3.0x faster when the queue grows
to thousands of commits.
Signed-off-by: Kristofer Karlsson <krka@spotify.com>
enumerate_and_traverse_cruft_objects() initializes a rev_info on the stack and uses it for a traversal, but never calls release_revisions() afterwards. This currently does not cause a leak on master because the commit queue uses a linked list that is consumed during traversal. However, a subsequent commit will switch the revision walking machinery to use a prio_queue, which dynamically allocates its backing array. Without this cleanup, that allocation is reported as a leak by LeakSanitizer (384 bytes in the linux-leaks CI job via t7704). Add the missing release_revisions() call. Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Replace the O(N) sorted linked-list insertion used by get_revision_1() for non-limited walks with an O(log N) priority queue. Add a commit_queue field to rev_info alongside the existing commits linked list. The two representations are mutually exclusive: setup and external callers that need list access (mark_edges_uninteresting, bisect, etc.) use the linked list, then get_revision_1() lazily drains it into the priority queue on the first call for non-limited, non-no_walk, non-topo, non-reflog walks. The conversion function rev_info_commit_list_to_queue() is public so callers that know they will iterate can convert early. The parent-rewriting path (rewrite_one) always merges newly discovered commits into the priority queue, which is correct because rewrite_one_1 only calls process_parents for non-limited walks where the queue is active. Combined with the limit_list() priority queue change in the parent commit, this eliminates all O(N) sorted linked-list insertion from the revision walk machinery. Performance: git rev-list HEAD on a 2.3M-commit repository improves from ~21s to ~6.4s (3.3x speedup). Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Extract the per-mode dispatch logic in get_revision_1() into a revision_walk_ops struct with function pointers for next() and expand(). Each walk mode (reflog, topo, streaming, no_walk, limited) gets its own small helper functions, and get_walk_ops() selects the right ops table once per walk. This replaces the scattered if/else chains that checked the same mode conditions at multiple points in the loop body. The loop is now mode-agnostic: it calls ops->next() to pop the next commit and ops->expand() to process parents, with NULL expand for limited walks that skip parent processing entirely. No functional change. Signed-off-by: Kristofer Karlsson <krka@spotify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Building on the
limit_list()priority queue change from PR #2114, this adds a missingrelease_revisions()call needed for the new allocation pattern, extends the O(log N) priority queue approach to the non-limited streaming walk inget_revision_1(), and refactors the walk dispatch into a vtable pattern.Commit 1 (same as PR #2114)
revision: use priority queue in limit_list()— replaces O(N) sorted linked-list insertion with O(log N) priority queue inlimit_list().Commit 2
pack-objects: call release_revisions() after cruft traversal—enumerate_and_traverse_cruft_objects()never calledrelease_revisions(). This was harmless on master (the linked-list queue is consumed during traversal), but becomes a real leak once commit 3 switches to aprio_queuewhose backing array must be freed. Fixes the linux-leaks CI failure in t7704.Commit 3
revision: use priority queue for non-limited streaming walks— adds acommit_queuefield torev_infoalongside the existingcommitslinked list. The two representations are mutually exclusive: setup uses the list, then the walk drains it into the priority queue. Removes thelistparameter fromprocess_parents()and eliminatescommit_list_insert_by_date()entirely.Performance:
git rev-list HEADon a 2.3M-commit repository: ~21s → ~6.4s (3.3x speedup).Commit 4
revision: introduce revision_walk_ops for get_revision_1()— replaces the nested if/else dispatch inget_revision_1()with arevision_walk_opsvtable (init/next/expand function pointers). Five walk modes (reflog, topo, streaming, no_walk, limited) are each self-contained. The ops are selected once and cached onrev_info.Design notes
This takes a two-phase approach: linked list during setup, priority queue during the walk. External callers that read
revs->commitsbetweenprepare_revision_walk()and the walk loop are unchanged. This contrasts with the alternative approach of replacingcommitsentirely with aprio_queue(see Jeff King'sjk/revs-commits-prio-queuebranch at https://github.com/peff/git).Tradeoffs are discussed in the mailing list thread for PR #2114.
Test plan
release_revisions()fix