Initial sync performance optimizations#195
Open
philippem wants to merge 20 commits intoBlockstream:new-indexfrom
Open
Initial sync performance optimizations#195philippem wants to merge 20 commits intoBlockstream:new-indexfrom
philippem wants to merge 20 commits intoBlockstream:new-indexfrom
Conversation
Previously, initial sync disabled auto-compaction entirely, causing L0 files to accumulate without bound — observed 1 000+ files in practice. With unbounded L0 accumulation, multi_get() must binary-search the index block of every L0 file per key, making lookup_txos() increasingly slow as sync progresses (~700 s/batch with 1 000+ L0 files). Replace the "all or nothing" approach with a bounded bulk-load mode: auto-compaction is always enabled, but the L0 compaction trigger is raised to 64 files (vs the default 4). This keeps write amplification low — compaction fires infrequently in large batches — while capping L0 file count and therefore lookup cost. Slowdown/stop triggers are set at 4×/8× the compaction trigger so writes are never stalled while background compaction catches up. The pending-compaction-bytes stall is disabled to prevent blocking writes against the large initial backlog. When initial sync is complete, reset L0 triggers and pending-compaction stall thresholds to RocksDB defaults for stable steady-state performance.
With up to 64 L0 files accumulating before compaction fires, each open SST file has RocksDB allocating its index and filter blocks on the heap outside the block cache. This memory is not subject to LRU eviction and grows proportionally to the number of open SST files. Setting cache_index_and_filter_blocks(true) routes these blocks through the shared block cache, bounding total memory to --db-block-cache-mb per database. Increase to 4096+ for initial sync to ensure the working set of filter/index blocks stays in cache without thrashing.
set_bytes_per_sync(1 MiB): background-syncs SST files incrementally, avoiding large fsync stalls at file close.
compact_range() (used by full_compaction at end of initial sync) is single-threaded by default regardless of increase_parallelism(). Setting max_subcompactions to db_parallelism splits the key range across all background threads, reducing the final compaction wall-clock time proportionally to CPU count.
Use compact_range_opt with BottommostLevelCompaction::Force so the final full_compaction pushes all data to the lowest level and writes optimally-merged SST files. Also log elapsed time so it is easy to track how long the end-of-sync compaction takes.
LZ4 decompresses roughly 2-3x faster than Snappy with comparable compression ratio. Profiling shows ~13% of cycles spent in Snappy decompression during lookup_txos(), making this a meaningful improvement for read-heavy workloads. Use Zstd for the bottommost level (L6), where data is fully compacted and rarely rewritten. Zstd compresses ~30-40% smaller than LZ4 with comparable decompression speed, reducing final DB size significantly. Existing SST files compressed with Snappy remain readable — RocksDB decompresses them transparently. New files will be written as LZ4/Zstd.
Increase set_compaction_readahead_size from 1 MiB to 4 MiB. Better amortises syscall overhead when sequentially reading the many accumulated L0 files during the final compaction.
The sort before WriteBatch construction was single-threaded, pegging one core while all others sat idle during the index pass (~2-4M rows per batch). Switch to rayon par_sort_unstable_by so all available cores are used for the sort.
With up to 64 L0 files accumulating before compaction fires, each multi_get() must binary-search the index block of every L0 file whose key range overlaps the query — all of them for random txids. Before the bounded L0 trigger was introduced this was catastrophic (~700 s/batch with 1 000+ files); with 64 files it is still the dominant lookup cost. Add full-key Bloom filters at 10 bits/key (~1 % false-positive rate). With filters, only ~0.64 files on average need actual index or data I/O per key, reducing per-lookup cost by ~100×. The filter blocks are cached and pinned in the block cache alongside index blocks via the set_cache_index_and_filter_blocks and set_pin_l0_filter_and_index_blocks_in_cache settings.
Tests iter_scan, iter_scan_from, iter_scan_reverse, and raw_iterator with both short (1-byte) and full (33-byte) prefixes. Validates correct row counts, prefix boundary enforcement, and no cross-prefix leakage. These tests exercise the exact scan patterns used by electrs: 1-byte prefixes for block headers (b"B") and done markers (b"D"), and 33-byte prefixes for history rows (code + scripthash).
Add a 33-byte fixed prefix extractor (code || hash) so RocksDB builds
prefix Bloom filters instead of full-key filters. Range scans like
iter_scan("H" + scripthash) can now skip L0 SST files whose Bloom
filter doesn't match, rather than checking every file.
With a prefix extractor, iterator scans whose seek key is shorter than
33 bytes would silently miss results. Use a conditional in iter_scan,
iter_scan_from, and iter_scan_reverse: short prefixes (< 33 bytes) use
total_order_seek for correctness, while full-length prefixes use the
prefix bloom for performance. raw_iterator always uses total_order_seek
since callers may seek to arbitrary positions.
Requires full re-compaction (delete F marker) to rebuild SST files with
prefix bloom metadata.
Previously, initial sync processed the full batch in two sequential full sweeps: add() wrote T/O rows for every block, then index() read them back to build history rows. This meant peak memory scaled with the entire batch, and the fetch pipeline could not overlap with index work because both passes had to complete before the next fetch began. Merging into a single pass — add then index for each block in turn — reduces peak working-set size and allows the prefetch pipeline to deliver the next batch's blocks while the current batch is still being indexed.
Previously the bitcoind fetcher hard-coded a 100-block batch size. The single-pass add+index approach relies on O rows from add() still being in the write buffer when index() calls lookup_txos(), so larger batches improve cache hit rate for outputs spent within the same batch window. Add --initial-sync-batch-size (default 250) to Config and thread it through IndexerConfig → start_fetcher → bitcoind_fetcher. The blkfiles path is naturally batched by blk file so is unaffected.
Previously parse_blocks() created a new rayon::ThreadPool on every invocation — once per blk*.dat file (~2,900 for a full chain). Each construction spawns N OS threads. Lift the pool out of the per-blob closure so it is created once and reused for the entire blkfiles_parser run, avoiding ~47k thread spawns at 16 cores.
Both add_blocks() and index_blocks() previously called tx.compute_txid() independently, hashing every transaction twice per batch. Compute txids once at BlockEntry construction time (in the fetcher) and store them alongside the block. add_blocks and index_blocks consume the pre-computed slice, and index_transaction receives the txid as a parameter rather than recomputing it.
Increase SyncChannel capacity from 1 to 2 in all three blkfiles stages (reader, parser, fetcher). Each stage can now stay one batch ahead of the next, overlapping sequential disk I/O (reader) with CPU deserialization (parser) and block-entry construction (fetcher). The bitcoind fetcher stays at 1 since its batches can be hundreds of MB.
Add a batch_total label to the existing index_duration histogram that covers the full wall-clock time of each add+index cycle. This makes it easy to see per-batch throughput at a glance without summing the individual step histograms (add_process, add_write, index_lookup, index_process). Only recorded for batches where actual work is done (blocks not yet added or indexed), so no-op iterations do not skew the distribution.
Add rocksdb_num_files_at_level gauge with 'db' and 'level' labels, covering all 7 RocksDB compaction levels (L0-L6).
Change fetcher and indexer progress logging from "blocks fetched / total blocks" to "current height / chain tip height". This is more meaningful during initial sync — users can immediately see how far along the chain the sync has progressed rather than counting opaque batch numbers.
Add initial_sync_height and initial_sync_progress_pct gauges, updated after every batch so sync progress is visible in Prometheus without waiting for the full update_blocks() call to complete.
Collaborator
Author
|
Sync can be monitoring with https://github.com/philippem/electrs-grafana-dashboard |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Initial sync performance optimizations
Summary
A series of RocksDB tuning and indexing pipeline improvements targeting initial sync performance. The changes fall into three categories:
RocksDB configuration
multi_get()increasingly expensive. Capping L0 count bounds lookup cost while keepingwrite amplification low.
total_order_seekfallback for short prefixes to preserve correctnesspar_sort_unstable) inwrite_rows/delete_rowsto utilize all coresbytes_per_syncand forced bottommost compactionIndexing pipeline
--initial-sync-batch-size, default 250) to tune the trade-off between write buffer utilization and lookup localityObservability
rocksdb_num_files_at_level) for monitoring compaction healthbatch_totaltimer inindex_durationhistogram for at-a-glance per-batch throughputinitial_sync_heightandinitial_sync_progress_pctPrometheus gauges