feat(dag/walker): opt-in BloomTracker to avoid duplicated walks by lidel · Pull Request #1124 · ipfs/boxo

lidel · 2026-03-20T21:46:48Z

Warning

not ready for review, this is a sandbox for running CI

kubo PR: feat(provide): +unique and +entities strategy modifiers kubo#11245

Summary

New dag/walker package for memory-efficient DAG traversal with bloom filter deduplication, plus new pinned-provider strategies that use it to avoid re-announcing duplicate blocks across pins.

`dag/walker` (new package)

VisitedTracker interface with two implementations:
- BloomTracker -- auto-scaling bloom filter chain (~4 bytes/CID vs ~75 for a map), with uncorrelated false positives across nodes (unique random SipHash keys per instance)
- MapTracker -- exact dedup for tests and small datasets
WalkDAG -- iterative DFS traversal with integrated dedup, codec-agnostic link extraction (dag-pb, dag-cbor, raw, and any registered codec). ~2x faster than the legacy go-ipld-prime selector-based walker.
WalkEntityRoots -- entity-aware traversal that emits only file/directory/HAMT shard roots instead of every block, skipping internal file chunks.

`pinner`

NewUniquePinnedProvider -- emits all blocks reachable from pins with cross-pin bloom dedup (recursive DAGs first, then direct pins).
NewPinnedEntityRootsProvider -- same but emits only entity roots via WalkEntityRoots.
Both log and skip corrupted pin entries instead of aborting the entire provide cycle.

`provider`

NewPrioritizedProvider now continues to the next stream when one fails instead of stopping all streams.
NewConcatProvider added for pre-deduplicated streams that don't need the cidutil.StreamingSet overhead.

Other

Identity CIDs are filtered out during walks (they are inlined data, not real blocks).
Siblings are visited in left-to-right link order for deterministic traversal.

VisitedTracker interface for memory-efficient DAG traversal dedup. BloomTracker uses a scalable bloom filter chain (~4 bytes/CID vs ~75 for a map), enabling dedup on repos with tens of millions of CIDs. - BloomTracker: auto-scaling chain, configurable FP rate via BloomParams, unique random SipHash keys per instance (uncorrelated FPs across nodes) - MapTracker: exact dedup for tests and small datasets - *cid.Set satisfies the interface for drop-in compatibility - go.mod: update ipfs/bbloom to master (for NewWithKeys)

iterative DFS walker that integrates VisitedTracker dedup directly into the traversal loop, skipping entire subtrees in O(1). - LinksFetcherFromBlockstore: extracts links from any codec registered in the global multicodec registry (dag-pb, dag-cbor, raw, etc.) - ~2x faster than legacy go-ipld-prime selector traversal (no selector machinery, simpler decoding, fewer allocations) - WithLocality option for MFS providers to skip non-local blocks - best-effort error handling: fetch failures log and skip, do not mark the CID as visited (allows retry via another pin or next cycle) - benchmarks comparing BlockAll vs WalkDAG across dag-pb, dag-cbor, and mixed-codec DAGs

codecov · 2026-03-21T03:49:17Z

Codecov Report

❌ Patch coverage is 78.46154% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.77%. Comparing base (43c30ce) to head (c8920a2).

Files with missing lines	Patch %	Lines
pinning/pinner/dspinner/uniquepinprovider.go	55.17%	17 Missing and 9 partials ⚠️
dag/walker/walker.go	80.95%	10 Missing and 6 partials ⚠️
dag/walker/entity.go	77.94%	11 Missing and 4 partials ⚠️
dag/walker/visited.go	87.77%	7 Missing and 4 partials ⚠️
provider/provider.go	92.00%	2 Missing ⚠️

@@            Coverage Diff             @@
##             main    #1124      +/-   ##
==========================================
+ Coverage   62.56%   62.77%   +0.21%     
==========================================
  Files         261      265       +4     
  Lines       26216    26539     +323     
==========================================
+ Hits        16402    16660     +258     
- Misses       8125     8170      +45     
- Partials     1689     1709      +20

Files with missing lines	Coverage Δ
provider/provider.go	`90.16% <92.00%> (+19.11%)`	⬆️
dag/walker/visited.go	`87.77% <87.77%> (ø)`
dag/walker/entity.go	`77.94% <77.94%> (ø)`
dag/walker/walker.go	`80.95% <80.95%> (ø)`
pinning/pinner/dspinner/uniquepinprovider.go	`55.17% <55.17%> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

emits entity roots (files, directories, HAMT shards) skipping internal file chunks. core of the +entities provide strategy. - NodeFetcherFromBlockstore: detects UnixFS entity type from the ipld-prime decoded node's Data field - directories and HAMT shards: emit and recurse into children - non-UnixFS codecs (dag-cbor, dag-json): emit and follow links - same options as WalkDAG: WithVisitedTracker, WithLocality - tests: dag-pb, raw, dag-cbor, mixed codecs, HAMT, dedup, error handling, stop conditions

catch unexpected regressions in ipfs/bbloom behavior or BloomParams derivation that would silently degrade the false positive rate. - measurable rate (1/1000): 100K probes produce observable FPs, asserts rate is within 5x of target - default rate (1/4.75M): 100K probes must produce exactly 0 FPs

- NewPrioritizedProvider: stream init error no longer stops remaining streams (e.g. MFS flush error does not prevent pinned content from being provided) - NewConcatProvider: concatenates pre-deduplicated streams without its own visited set, for use with shared VisitedTracker

…vider NewUniquePinnedProvider: emits all pinned blocks with cross-pin dedup via shared VisitedTracker (bloom or map). walks recursive pin DAGs first, then direct pins. NewPinnedEntityRootsProvider: same structure but uses WalkEntityRoots, emitting only entity roots and skipping internal file chunks. existing NewPinnedProvider is unchanged.

- remove unused daggen variable in uniquepinprovider_test.go

…tency match the defensive read-side ctx.Done select pattern already used by NewPrioritizedProvider in the same file

- deduplicate LinkSystem construction used by both LinksFetcherFromBlockstore and NodeFetcherFromBlockstore - wrap blockstore with NewIdStore so identity CIDs (multihash 0x00, data inline in the CID) are decoded without a datastore lookup

identity CIDs (multihash 0x00) embed data inline, so providing them to the DHT is wasteful. the walker now traverses through identity CIDs (following their links) but never emits them. - add isIdentityCID check to WalkDAG and WalkEntityRoots - simplify WalkEntityRoots emit/descend logic - tests for identity raw leaf, identity dag-pb directory with normal children, normal directory with identity child

- inline identity CID check (c.Prefix().MhType == mh.IDENTITY) in all emit paths: WalkDAG, WalkEntityRoots, and direct pin loops in both NewUniquePinnedProvider and NewPinnedEntityRootsProvider - move all identity CID tests to dag/walker/identity_test.go - add provider-level identity tests for direct pins and recursive DAGs

the stack-based DFS was pushing children in link order, causing the last child to be popped first (right-to-left). reverse children before pushing so the first link is on top and gets visited first. this matches the legacy fetcherhelpers.BlockAll selector traversal (ipld-prime iterates list/map entries in insertion order) and the conventional DFS order described in IPIP-0412. - walker.go, entity.go: slices.Reverse(children) before stack push - walker.go: document traversal order in WalkDAG godoc - entity.go: document order parity in WalkEntityRoots godoc - walker_test.go, entity_test.go: add sibling order regression tests

a corrupted pin entry was stopping the entire provide cycle because the goroutine returned on RecursiveKeys/DirectKeys error. change to continue so remaining pins are still provided (best-effort). the error from the pinner iterator already contains context (bad CID bytes, datastore key, etc.) -- sc.Pin.Key is zero-value on error so including it in the log would be noise. matches the best-effort pattern used in WalkDAG/WalkEntityRoots where fetch errors are logged and skipped.

- collectLinks: note that map keys are not recursed (no known codec uses link-typed map keys) - detectEntityType: extract c.Prefix() once for readability - grow: document MinBloomCapacity invariant that prevents small-bitset FP rate issues in grown blooms

gammazero

Made a few suggestions but nothing blocking.

dag/walker/entity.go

dag/walker/walker.go

pinning/pinner/dspinner/uniquepinprovider.go

dag/walker/visited.go

…oots-with-dedup

uniquepinprovider: use skip-early style for tracker.Visit in direct pin loops (clearer control flow) visited.go: document that VisitedTracker implementations may be probabilistic, and must keep FP rate negligible or allow callers to adjust it

log capacity, FP rate, and hash parameters on creation. log previous/new capacity and chain length on autoscale. helps operators understand bloom sizing and detect unexpected growth during reprovide cycles.

counts Visit() calls that returned false (CID already seen). callers can log this after a reprovide cycle to show how much dedup the bloom filter achieved.

gammazero

LGTM

guillaumemichel

Neat implementation! A few comments inline, but nothing major

guillaumemichel · 2026-04-01T08:06:30Z

dag/walker/visited.go

+
+func (bt *BloomTracker) Has(c cid.Cid) bool {
+	key := []byte(c.Hash())
+	for _, b := range bt.chain {


Maybe document the tradeoff of the order in which filters are checked.

Old -> new (current):
Works some CIDs are very often repeated. These CIDs are likely to land in first filter hence this is the one we want to check first. Nodes landing in newer filters are expected to be less frequent.

New -> old:
Duplicates are close to each other in the DAG traversal (e.g after a CID is visited, its duplicates are expected to be in the next 10K visited nodes). Hence it is better to check newest filters first since it has the highest probability to contain the duplicate.

guillaumemichel · 2026-04-01T08:07:24Z

dag/walker/visited.go

+	// Check earlier blooms for the CID. If any reports it as
+	// present (true positive from a prior growth epoch, or rare
+	// cross-bloom false positive), skip it.
+	earlier := bt.chain[:len(bt.chain)-1]


Same remark as https://github.com/ipfs/boxo/pull/1124/changes#r3020485430 concerning bloom filter iteration order

guillaumemichel · 2026-04-01T08:16:56Z

dag/walker/walker.go

+//  1. [VisitedTracker].Visit -- if already visited, skip entire subtree.
+//     The CID is marked visited immediately (before fetch). If fetch
+//     later fails, the CID stays in the tracker and won't be retried
+//     this cycle -- caught in the next cycle (22h). This avoids a


22h is a Kubo DHT specific parameter, not sure we want to hardcode it in the comment here

With reprovide sweep it corresponds to the keystore GC interval: calling KeyChanFunc to replace the set of keys being periodically reprovided.

With legacy reprovide it corresponds to the reprovide interval (also calling KeyChanFunc)

guillaumemichel · 2026-04-01T08:22:18Z

dag/walker/walker.go

+		slices.Reverse(children)
+		stack = append(stack, children...)
+
+		// skip identity CIDs: content is inline, no need to provide


If the dag walker has other purpose than providing, I would suggest leaving this filtering to the provide system.

Or instead of blocking IDENTITY allow caller to pass a blocklist as option, so that provide systems can block IDENTITY?

guillaumemichel · 2026-04-01T08:29:02Z

dag/walker/entity.go

+func WalkEntityRoots(
+	ctx context.Context,
+	root cid.Cid,
+	fetch NodeFetcher,
+	emit func(cid.Cid) bool,
+	opts ...Option,
+) error {
+	cfg := &walkConfig{}
+	for _, o := range opts {
+		o(cfg)
+	}
+
+	stack := []cid.Cid{root}
+
+	for len(stack) > 0 {
+		if ctx.Err() != nil {
+			return ctx.Err()
+		}
+
+		// pop
+		c := stack[len(stack)-1]
+		stack = stack[:len(stack)-1]
+
+		// dedup via tracker
+		if cfg.tracker != nil && !cfg.tracker.Visit(c) {
+			continue
+		}
+
+		// locality check
+		if cfg.locality != nil {
+			local, err := cfg.locality(ctx, c)
+			if err != nil {
+				log.Errorf("entity walk: locality check %s: %s", c, err)
+				continue
+			}
+			if !local {
+				continue
+			}
+		}
+
+		// fetch block and detect entity type
+		children, entityType, err := fetch(ctx, c)
+		if err != nil {
+			log.Errorf("entity walk: fetch %s: %s", c, err)
+			continue
+		}
+
+		// decide whether to descend into children
+		descend := entityType != EntityFile && entityType != EntitySymlink
+		if descend {
+			// reverse so first link is popped next (left-to-right
+			// sibling order, matching WalkDAG and legacy BlockAll)
+			slices.Reverse(children)
+			stack = append(stack, children...)
+		}
+
+		// skip identity CIDs: content is inline, no need to provide.
+		// we still descend (above) so an inlined dag-pb directory's
+		// normal children get provided.
+		if c.Prefix().MhType == mh.IDENTITY {
+			continue
+		}
+
+		if !emit(c) {
+			return nil
+		}
+	}
+
+	return nil
+}


A lot of common code with WalkDAG(). Would it make sense to consolidate them as possible?

guillaumemichel · 2026-04-01T08:34:23Z

dag/walker/walker.go

+	stack := []cid.Cid{root}
+
+	for len(stack) > 0 {
+		if ctx.Err() != nil {


nit: since checking ctx.Err() implies acquiring a mutex, we could check it only every 1k operations or similar?

guillaumemichel · 2026-04-01T08:37:51Z

pinning/pinner/dspinner/uniquepinprovider.go

+					log.Errorf("entity provide recursive pins: %s", sc.Err)
+					continue
+				}
+				if err := walker.WalkEntityRoots(ctx, sc.Pin.Key, fetch, emit, walker.WithVisitedTracker(tracker)); err != nil {


This seems to be the only line where NewPinnedEntityRootsProvider() differs from NewUniquePinnedProvider(). Consolidating the functions body would increase clarity.

guillaumemichel · 2026-04-01T08:43:07Z

CHANGELOG.md


 ### Fixed

+- `pinner`: `NewUniquePinnedProvider` and `NewPinnedEntityRootsProvider` now log and skip corrupted pin entries instead of aborting the entire provide cycle, allowing remaining pins to still be provided. [#1124](https://github.com/ipfs/boxo/pull/1124)


Should this be in the Added section?

lidel mentioned this pull request Mar 20, 2026

feat(provide): +unique and +entities strategy modifiers ipfs/kubo#11245

Draft

lidel force-pushed the feat/provide-entity-roots-with-dedup branch from c16a5b7 to 4dad6c1 Compare March 20, 2026 21:57

lidel force-pushed the feat/provide-entity-roots-with-dedup branch from 4dad6c1 to c8962fc Compare March 21, 2026 03:15

lidel force-pushed the feat/provide-entity-roots-with-dedup branch from 19bf557 to 224c2ae Compare March 21, 2026 03:46

lidel added 14 commits March 21, 2026 05:23

test: add PrioritizedProvider error-continue regression test

7b8f853

- remove unused daggen variable in uniquepinprovider_test.go

refactor(provider): use labeled break in NewConcatProvider for consis…

685c82e

…tency match the defensive read-side ctx.Done select pattern already used by NewPrioritizedProvider in the same file

test(dag/walker): add symlink entity detection tests

609ff3d

Merge branch 'main' into feat/provide-entity-roots-with-dedup

dcfda13

lidel changed the title ~~feat(dag/walker): BloomTracker experiment~~ feat(dag/walker): opt-in BloomTracker to avoid duplicated walks Mar 23, 2026

gammazero approved these changes Mar 24, 2026

View reviewed changes

dag/walker/entity.go Show resolved Hide resolved

dag/walker/walker.go Show resolved Hide resolved

pinning/pinner/dspinner/uniquepinprovider.go Outdated Show resolved Hide resolved

dag/walker/visited.go Show resolved Hide resolved

dag/walker/visited.go Show resolved Hide resolved

lidel added 4 commits March 25, 2026 23:59

Merge remote-tracking branch 'origin/main' into feat/provide-entity-r…

14c5f91

…oots-with-dedup

fix: address review feedback from gammazero

0fc1a0b

uniquepinprovider: use skip-early style for tracker.Visit in direct pin loops (clearer control flow) visited.go: document that VisitedTracker implementations may be probabilistic, and must keep FP rate negligible or allow callers to adjust it

feat(walker): log bloom tracker creation and autoscaling

577fa3f

log capacity, FP rate, and hash parameters on creation. log previous/new capacity and chain length on autoscale. helps operators understand bloom sizing and detect unexpected growth during reprovide cycles.

feat(walker): add Deduplicated() to BloomTracker and MapTracker

c8920a2

counts Visit() calls that returned false (CID already seen). callers can log this after a reprovide cycle to show how much dedup the bloom filter achieved.

lidel requested a review from gammazero March 29, 2026 22:55

lidel marked this pull request as ready for review March 29, 2026 22:55

lidel requested a review from a team as a code owner March 29, 2026 22:55

guillaumemichel self-requested a review March 31, 2026 13:58

gammazero approved these changes Mar 31, 2026

View reviewed changes

guillaumemichel approved these changes Apr 1, 2026

View reviewed changes


		### Fixed

		- `pinner`: `NewUniquePinnedProvider` and `NewPinnedEntityRootsProvider` now log and skip corrupted pin entries instead of aborting the entire provide cycle, allowing remaining pins to still be provided. [#1124](https://github.com/ipfs/boxo/pull/1124)

Conversation

lidel commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

dag/walker (new package)

pinner

provider

Other

Uh oh!

codecov bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gammazero left a comment

Choose a reason for hiding this comment

Uh oh!

guillaumemichel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lidel commented Mar 20, 2026 •

edited

Loading

`dag/walker` (new package)

`pinner`

`provider`

codecov bot commented Mar 21, 2026 •

edited

Loading