Initial sync performance optimizations by philippem · Pull Request #195 · Blockstream/electrs

philippem · 2026-03-18T16:39:58Z

Initial sync performance optimizations

Summary

A series of RocksDB tuning and indexing pipeline improvements targeting initial sync performance. The changes fall into three categories:

RocksDB configuration

Bounded bulk-load mode: Replace fully-disabled compaction with a bounded L0 trigger (64 files). Previously L0 files accumulated without limit (1,000+ observed), making multi_get() increasingly expensive. Capping L0 count bounds lookup cost while keeping
write amplification low.
Bloom filters (10-bit, ~1% FPR) for point lookups, eliminating the need to search every L0 file per key
Prefix extractor (33-byte: code || hash) enabling prefix Bloom filters on range scans, with automatic total_order_seek fallback for short prefixes to preserve correctness
Cache index/filter blocks in the shared block cache to bound memory usage via LRU eviction
LZ4 compression (replacing Snappy) for ~2-3x faster decompression; Zstd for bottommost level for ~30-40% smaller final DB size
Parallel subcompactions and larger compaction readahead for faster end-of-sync compaction
Parallel sort (par_sort_unstable) in write_rows/delete_rows to utilize all cores
Incremental bytes_per_sync and forced bottommost compaction

Indexing pipeline

Single-pass add+index: Merge the two separate full-batch sweeps into one per-block pass, reducing peak memory and enabling fetch/index overlap
Configurable batch size (--initial-sync-batch-size, default 250) to tune the trade-off between write buffer utilization and lookup locality
Pre-computed txids: Compute transaction IDs once at fetch time instead of hashing twice (once in add, once in index)
Fetch pipeline improvements: Reuse rayon thread pool across blk*.dat files (avoiding ~47k thread spawns), increased pipeline lookahead for better I/O overlap

Observability

Per-level SST file count gauges (rocksdb_num_files_at_level) for monitoring compaction health
batch_total timer in index_duration histogram for at-a-glance per-batch throughput
initial_sync_height and initial_sync_progress_pct Prometheus gauges
Progress logging shows block height instead of opaque batch numbers

Previously, initial sync disabled auto-compaction entirely, causing L0 files to accumulate without bound — observed 1 000+ files in practice. With unbounded L0 accumulation, multi_get() must binary-search the index block of every L0 file per key, making lookup_txos() increasingly slow as sync progresses (~700 s/batch with 1 000+ L0 files). Replace the "all or nothing" approach with a bounded bulk-load mode: auto-compaction is always enabled, but the L0 compaction trigger is raised to 64 files (vs the default 4). This keeps write amplification low — compaction fires infrequently in large batches — while capping L0 file count and therefore lookup cost. Slowdown/stop triggers are set at 4×/8× the compaction trigger so writes are never stalled while background compaction catches up. The pending-compaction-bytes stall is disabled to prevent blocking writes against the large initial backlog. When initial sync is complete, reset L0 triggers and pending-compaction stall thresholds to RocksDB defaults for stable steady-state performance.

With up to 64 L0 files accumulating before compaction fires, each open SST file has RocksDB allocating its index and filter blocks on the heap outside the block cache. This memory is not subject to LRU eviction and grows proportionally to the number of open SST files. Setting cache_index_and_filter_blocks(true) routes these blocks through the shared block cache, bounding total memory to --db-block-cache-mb per database. Increase to 4096+ for initial sync to ensure the working set of filter/index blocks stays in cache without thrashing.

set_bytes_per_sync(1 MiB): background-syncs SST files incrementally, avoiding large fsync stalls at file close.

compact_range() (used by full_compaction at end of initial sync) is single-threaded by default regardless of increase_parallelism(). Setting max_subcompactions to db_parallelism splits the key range across all background threads, reducing the final compaction wall-clock time proportionally to CPU count.

Use compact_range_opt with BottommostLevelCompaction::Force so the final full_compaction pushes all data to the lowest level and writes optimally-merged SST files. Also log elapsed time so it is easy to track how long the end-of-sync compaction takes.

LZ4 decompresses roughly 2-3x faster than Snappy with comparable compression ratio. Profiling shows ~13% of cycles spent in Snappy decompression during lookup_txos(), making this a meaningful improvement for read-heavy workloads. Use Zstd for the bottommost level (L6), where data is fully compacted and rarely rewritten. Zstd compresses ~30-40% smaller than LZ4 with comparable decompression speed, reducing final DB size significantly. Existing SST files compressed with Snappy remain readable — RocksDB decompresses them transparently. New files will be written as LZ4/Zstd.

Increase set_compaction_readahead_size from 1 MiB to 4 MiB. Better amortises syscall overhead when sequentially reading the many accumulated L0 files during the final compaction.

The sort before WriteBatch construction was single-threaded, pegging one core while all others sat idle during the index pass (~2-4M rows per batch). Switch to rayon par_sort_unstable_by so all available cores are used for the sort.

With up to 64 L0 files accumulating before compaction fires, each multi_get() must binary-search the index block of every L0 file whose key range overlaps the query — all of them for random txids. Before the bounded L0 trigger was introduced this was catastrophic (~700 s/batch with 1 000+ files); with 64 files it is still the dominant lookup cost. Add full-key Bloom filters at 10 bits/key (~1 % false-positive rate). With filters, only ~0.64 files on average need actual index or data I/O per key, reducing per-lookup cost by ~100×. The filter blocks are cached and pinned in the block cache alongside index blocks via the set_cache_index_and_filter_blocks and set_pin_l0_filter_and_index_blocks_in_cache settings.

Tests iter_scan, iter_scan_from, iter_scan_reverse, and raw_iterator with both short (1-byte) and full (33-byte) prefixes. Validates correct row counts, prefix boundary enforcement, and no cross-prefix leakage. These tests exercise the exact scan patterns used by electrs: 1-byte prefixes for block headers (b"B") and done markers (b"D"), and 33-byte prefixes for history rows (code + scripthash).

Add a 33-byte fixed prefix extractor (code || hash) so RocksDB builds prefix Bloom filters instead of full-key filters. Range scans like iter_scan("H" + scripthash) can now skip L0 SST files whose Bloom filter doesn't match, rather than checking every file. With a prefix extractor, iterator scans whose seek key is shorter than 33 bytes would silently miss results. Use a conditional in iter_scan, iter_scan_from, and iter_scan_reverse: short prefixes (< 33 bytes) use total_order_seek for correctness, while full-length prefixes use the prefix bloom for performance. raw_iterator always uses total_order_seek since callers may seek to arbitrary positions. Requires full re-compaction (delete F marker) to rebuild SST files with prefix bloom metadata.

Previously, initial sync processed the full batch in two sequential full sweeps: add() wrote T/O rows for every block, then index() read them back to build history rows. This meant peak memory scaled with the entire batch, and the fetch pipeline could not overlap with index work because both passes had to complete before the next fetch began. Merging into a single pass — add then index for each block in turn — reduces peak working-set size and allows the prefetch pipeline to deliver the next batch's blocks while the current batch is still being indexed.

Previously the bitcoind fetcher hard-coded a 100-block batch size. The single-pass add+index approach relies on O rows from add() still being in the write buffer when index() calls lookup_txos(), so larger batches improve cache hit rate for outputs spent within the same batch window. Add --initial-sync-batch-size (default 250) to Config and thread it through IndexerConfig → start_fetcher → bitcoind_fetcher. The blkfiles path is naturally batched by blk file so is unaffected.

Previously parse_blocks() created a new rayon::ThreadPool on every invocation — once per blk*.dat file (~2,900 for a full chain). Each construction spawns N OS threads. Lift the pool out of the per-blob closure so it is created once and reused for the entire blkfiles_parser run, avoiding ~47k thread spawns at 16 cores.

Both add_blocks() and index_blocks() previously called tx.compute_txid() independently, hashing every transaction twice per batch. Compute txids once at BlockEntry construction time (in the fetcher) and store them alongside the block. add_blocks and index_blocks consume the pre-computed slice, and index_transaction receives the txid as a parameter rather than recomputing it.

Increase SyncChannel capacity from 1 to 2 in all three blkfiles stages (reader, parser, fetcher). Each stage can now stay one batch ahead of the next, overlapping sequential disk I/O (reader) with CPU deserialization (parser) and block-entry construction (fetcher). The bitcoind fetcher stays at 1 since its batches can be hundreds of MB.

Add a batch_total label to the existing index_duration histogram that covers the full wall-clock time of each add+index cycle. This makes it easy to see per-batch throughput at a glance without summing the individual step histograms (add_process, add_write, index_lookup, index_process). Only recorded for batches where actual work is done (blocks not yet added or indexed), so no-op iterations do not skew the distribution.

Add rocksdb_num_files_at_level gauge with 'db' and 'level' labels, covering all 7 RocksDB compaction levels (L0-L6).

Change fetcher and indexer progress logging from "blocks fetched / total blocks" to "current height / chain tip height". This is more meaningful during initial sync — users can immediately see how far along the chain the sync has progressed rather than counting opaque batch numbers.

Add initial_sync_height and initial_sync_progress_pct gauges, updated after every batch so sync progress is visible in Prometheus without waiting for the full update_blocks() call to complete.

philippem · 2026-03-18T16:42:22Z

Sync can be monitoring with https://github.com/philippem/electrs-grafana-dashboard

philippem added 20 commits March 13, 2026 18:00

perf(db): incremental bytes_per_sync for SST writes

ba8f849

set_bytes_per_sync(1 MiB): background-syncs SST files incrementally, avoiding large fsync stalls at file close.

perf(db): larger compaction readahead

defab78

Increase set_compaction_readahead_size from 1 MiB to 4 MiB. Better amortises syscall overhead when sequentially reading the many accumulated L0 files during the final compaction.

perf(db): parallel sort in write_rows/delete_rows

8860625

The sort before WriteBatch construction was single-threaded, pegging one core while all others sat idle during the index pass (~2-4M rows per batch). Switch to rayon par_sort_unstable_by so all available cores are used for the sort.

feat(metrics): export per-level SST file counts to Prometheus

2251fb5

Add rocksdb_num_files_at_level gauge with 'db' and 'level' labels, covering all 7 RocksDB compaction levels (L0-L6).

feat(metrics): export initial sync height as Prometheus gauge

e9f56c3

Add initial_sync_height and initial_sync_progress_pct gauges, updated after every batch so sync progress is visible in Prometheus without waiting for the full update_blocks() call to complete.

philippem requested review from RCasatta and shesek March 18, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial sync performance optimizations#195

Initial sync performance optimizations#195
philippem wants to merge 20 commits intoBlockstream:new-indexfrom
philippem:initial-sync-perf-11

philippem commented Mar 18, 2026

Uh oh!

philippem commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

philippem commented Mar 18, 2026

Summary

RocksDB configuration

Indexing pipeline

Observability

Uh oh!

philippem commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant