Skip to content

Initial sync performance optimizations#195

Open
philippem wants to merge 20 commits intoBlockstream:new-indexfrom
philippem:initial-sync-perf-11
Open

Initial sync performance optimizations#195
philippem wants to merge 20 commits intoBlockstream:new-indexfrom
philippem:initial-sync-perf-11

Conversation

@philippem
Copy link
Collaborator

Initial sync performance optimizations

Summary

A series of RocksDB tuning and indexing pipeline improvements targeting initial sync performance. The changes fall into three categories:

RocksDB configuration

  • Bounded bulk-load mode: Replace fully-disabled compaction with a bounded L0 trigger (64 files). Previously L0 files accumulated without limit (1,000+ observed), making multi_get() increasingly expensive. Capping L0 count bounds lookup cost while keeping
    write amplification low.
  • Bloom filters (10-bit, ~1% FPR) for point lookups, eliminating the need to search every L0 file per key
  • Prefix extractor (33-byte: code || hash) enabling prefix Bloom filters on range scans, with automatic total_order_seek fallback for short prefixes to preserve correctness
  • Cache index/filter blocks in the shared block cache to bound memory usage via LRU eviction
  • LZ4 compression (replacing Snappy) for ~2-3x faster decompression; Zstd for bottommost level for ~30-40% smaller final DB size
  • Parallel subcompactions and larger compaction readahead for faster end-of-sync compaction
  • Parallel sort (par_sort_unstable) in write_rows/delete_rows to utilize all cores
  • Incremental bytes_per_sync and forced bottommost compaction

Indexing pipeline

  • Single-pass add+index: Merge the two separate full-batch sweeps into one per-block pass, reducing peak memory and enabling fetch/index overlap
  • Configurable batch size (--initial-sync-batch-size, default 250) to tune the trade-off between write buffer utilization and lookup locality
  • Pre-computed txids: Compute transaction IDs once at fetch time instead of hashing twice (once in add, once in index)
  • Fetch pipeline improvements: Reuse rayon thread pool across blk*.dat files (avoiding ~47k thread spawns), increased pipeline lookahead for better I/O overlap

Observability

  • Per-level SST file count gauges (rocksdb_num_files_at_level) for monitoring compaction health
  • batch_total timer in index_duration histogram for at-a-glance per-batch throughput
  • initial_sync_height and initial_sync_progress_pct Prometheus gauges
  • Progress logging shows block height instead of opaque batch numbers

Previously, initial sync disabled auto-compaction entirely, causing L0
files to accumulate without bound — observed 1 000+ files in practice.
With unbounded L0 accumulation, multi_get() must binary-search the index
block of every L0 file per key, making lookup_txos() increasingly slow
as sync progresses (~700 s/batch with 1 000+ L0 files).

Replace the "all or nothing" approach with a bounded bulk-load mode:
auto-compaction is always enabled, but the L0 compaction trigger is
raised to 64 files (vs the default 4). This keeps write amplification
low — compaction fires infrequently in large batches — while capping
L0 file count and therefore lookup cost.

Slowdown/stop triggers are set at 4×/8× the compaction trigger so
writes are never stalled while background compaction catches up.
The pending-compaction-bytes stall is disabled to prevent blocking
writes against the large initial backlog.

When initial sync is complete, reset L0 triggers and pending-compaction
stall thresholds to RocksDB defaults for stable steady-state performance.
With up to 64 L0 files accumulating before compaction fires, each open
SST file has RocksDB allocating its index and filter blocks on the heap
outside the block cache. This memory is not subject to LRU eviction and
grows proportionally to the number of open SST files.

Setting cache_index_and_filter_blocks(true) routes these blocks through
the shared block cache, bounding total memory to --db-block-cache-mb
per database. Increase to 4096+ for initial sync to ensure the working
set of filter/index blocks stays in cache without thrashing.
set_bytes_per_sync(1 MiB): background-syncs SST files incrementally,
avoiding large fsync stalls at file close.
compact_range() (used by full_compaction at end of initial sync) is
single-threaded by default regardless of increase_parallelism().
Setting max_subcompactions to db_parallelism splits the key range
across all background threads, reducing the final compaction
wall-clock time proportionally to CPU count.
Use compact_range_opt with BottommostLevelCompaction::Force so the
final full_compaction pushes all data to the lowest level and writes
optimally-merged SST files. Also log elapsed time so it is easy to
track how long the end-of-sync compaction takes.
LZ4 decompresses roughly 2-3x faster than Snappy with comparable
compression ratio. Profiling shows ~13% of cycles spent in Snappy
decompression during lookup_txos(), making this a meaningful
improvement for read-heavy workloads.

Use Zstd for the bottommost level (L6), where data is fully compacted
and rarely rewritten. Zstd compresses ~30-40% smaller than LZ4 with
comparable decompression speed, reducing final DB size significantly.

Existing SST files compressed with Snappy remain readable — RocksDB
decompresses them transparently. New files will be written as LZ4/Zstd.
Increase set_compaction_readahead_size from 1 MiB to 4 MiB. Better
amortises syscall overhead when sequentially reading the many
accumulated L0 files during the final compaction.
The sort before WriteBatch construction was single-threaded, pegging one
core while all others sat idle during the index pass (~2-4M rows per
batch). Switch to rayon par_sort_unstable_by so all available cores are
used for the sort.
With up to 64 L0 files accumulating before compaction fires, each
multi_get() must binary-search the index block of every L0 file whose
key range overlaps the query — all of them for random txids. Before the
bounded L0 trigger was introduced this was catastrophic (~700 s/batch
with 1 000+ files); with 64 files it is still the dominant lookup cost.

Add full-key Bloom filters at 10 bits/key (~1 % false-positive rate).
With filters, only ~0.64 files on average need actual index or data I/O
per key, reducing per-lookup cost by ~100×. The filter blocks are cached
and pinned in the block cache alongside index blocks via the
set_cache_index_and_filter_blocks and
set_pin_l0_filter_and_index_blocks_in_cache settings.
Tests iter_scan, iter_scan_from, iter_scan_reverse, and raw_iterator
with both short (1-byte) and full (33-byte) prefixes. Validates correct
row counts, prefix boundary enforcement, and no cross-prefix leakage.

These tests exercise the exact scan patterns used by electrs: 1-byte
prefixes for block headers (b"B") and done markers (b"D"), and 33-byte
prefixes for history rows (code + scripthash).
Add a 33-byte fixed prefix extractor (code || hash) so RocksDB builds
prefix Bloom filters instead of full-key filters. Range scans like
iter_scan("H" + scripthash) can now skip L0 SST files whose Bloom
filter doesn't match, rather than checking every file.

With a prefix extractor, iterator scans whose seek key is shorter than
33 bytes would silently miss results. Use a conditional in iter_scan,
iter_scan_from, and iter_scan_reverse: short prefixes (< 33 bytes) use
total_order_seek for correctness, while full-length prefixes use the
prefix bloom for performance. raw_iterator always uses total_order_seek
since callers may seek to arbitrary positions.

Requires full re-compaction (delete F marker) to rebuild SST files with
prefix bloom metadata.
Previously, initial sync processed the full batch in two sequential
full sweeps: add() wrote T/O rows for every block, then index() read
them back to build history rows. This meant peak memory scaled with
the entire batch, and the fetch pipeline could not overlap with index
work because both passes had to complete before the next fetch began.

Merging into a single pass — add then index for each block in turn —
reduces peak working-set size and allows the prefetch pipeline to
deliver the next batch's blocks while the current batch is still
being indexed.
Previously the bitcoind fetcher hard-coded a 100-block batch size. The
single-pass add+index approach relies on O rows from add() still being
in the write buffer when index() calls lookup_txos(), so larger batches
improve cache hit rate for outputs spent within the same batch window.

Add --initial-sync-batch-size (default 250) to Config and thread it
through IndexerConfig → start_fetcher → bitcoind_fetcher. The blkfiles
path is naturally batched by blk file so is unaffected.
Previously parse_blocks() created a new rayon::ThreadPool on every
invocation — once per blk*.dat file (~2,900 for a full chain). Each
construction spawns N OS threads. Lift the pool out of the per-blob
closure so it is created once and reused for the entire blkfiles_parser
run, avoiding ~47k thread spawns at 16 cores.
Both add_blocks() and index_blocks() previously called
tx.compute_txid() independently, hashing every transaction twice per
batch. Compute txids once at BlockEntry construction time (in the
fetcher) and store them alongside the block. add_blocks and
index_blocks consume the pre-computed slice, and index_transaction
receives the txid as a parameter rather than recomputing it.
Increase SyncChannel capacity from 1 to 2 in all three blkfiles
stages (reader, parser, fetcher). Each stage can now stay one batch
ahead of the next, overlapping sequential disk I/O (reader) with CPU
deserialization (parser) and block-entry construction (fetcher).
The bitcoind fetcher stays at 1 since its batches can be hundreds of MB.
Add a batch_total label to the existing index_duration histogram that
covers the full wall-clock time of each add+index cycle. This makes it
easy to see per-batch throughput at a glance without summing the
individual step histograms (add_process, add_write, index_lookup,
index_process).

Only recorded for batches where actual work is done (blocks not yet
added or indexed), so no-op iterations do not skew the distribution.
Add rocksdb_num_files_at_level gauge with 'db' and 'level' labels,
covering all 7 RocksDB compaction levels (L0-L6).
Change fetcher and indexer progress logging from "blocks fetched / total
blocks" to "current height / chain tip height". This is more meaningful
during initial sync — users can immediately see how far along the chain
the sync has progressed rather than counting opaque batch numbers.
Add initial_sync_height and initial_sync_progress_pct gauges, updated
after every batch so sync progress is visible in Prometheus without
waiting for the full update_blocks() call to complete.
@philippem philippem requested review from RCasatta and shesek March 18, 2026 16:40
@philippem
Copy link
Collaborator Author

Sync can be monitoring with https://github.com/philippem/electrs-grafana-dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant