Update#36
Merged
Merged
Conversation
Background: the async path streams candidates from a child process into
a two-level cands_top table and scores them on a persistent background
scoring thread using a pool of parallel worker pthreads. When 50+ million
candidates are in the pool, two allocations inside scoring_thread_fn
previously scaled linearly with pool size:
1. char **snap = malloc(scount * sizeof *snap) — a flat snapshot of all
candidate pointers. At 55.9 M candidates this is 447 MB allocated in
a single malloc call, causing VM pressure stalls lasting 17+ seconds.
2. struct AsyncScoringBatch *batches — each batch embedded xs[BATCH_SIZE]
(2048 × 16 B = 32 KB) inline. The array grew to 32768 entries via
doubling, reaching 32768 × 32 KB = 1 GB before any worker ran.
Both allocations existed since the original batch-parallel design and
survived the chunked cands_top fix (commit 9883b51, which only addressed
the reader's realloc stall). A 17-second gap in the C-side debug log
between the last reader allocation and the first post-reader scoring
event confirmed the scoring thread was the culprit.
Fix: replace the batch/snapshot design with a range-based worker design.
Workers no longer receive pre-filled string arrays. Instead:
- AsyncScoringRange { size_t from; size_t to; } (16 bytes each) replaces
AsyncScoringBatch (32 KB each). The range array for 55 M candidates at
BATCH_SIZE=2048 is ~430 KB rather than 1 GB.
- Workers resolve candidate pointers on the fly from cands_top via the
shift+mask accessor (gi >> CANDS_BLOCK_SHIFT, gi & CANDS_BLOCK_MASK).
This is safe because entries at index i < pool_count are immutably
written by the reader and never moved; the scoring thread captures
pool_count under s->mu before spawning workers.
- Per-worker result buffers grow only as matches are found, bounded by
ceil(limit / num_workers + 1) × 4. For 10000-candidate limit across
9 workers, each worker buffer is ≤ 71 KB; total flat memory after merge
is ≤ 640 KB regardless of pool size.
- The char **snap intermediate copy is eliminated entirely; workers read
directly from cands_top.
Memory footprint for 55 M candidates (limit=10000, 9 workers):
Allocation Old New
snap 447 MB 0 (removed)
batches array ~1 GB ~430 KB
per-worker results N/A ≤71 KB each
flat after merge same ≤640 KB
Also adds a 20-entry LRU result cache (SharedIdx, ref-counted) that maps
(filter, pool_gen) to a scored snapshot. Exact cache hits return without
scheduling a new scoring run. Prefix cache hits restrict the next scoring
run to the cached entry's matched indices plus newly-arrived candidates
(delta refinement), avoiding a full re-score of the entire pool on each
keystroke during incremental narrowing (e.g. "f" → "fo" → "foo").
Background: fzf-async previously delegated candidate highlighting to
Elisp via completion-pcm--hilit-commonality. For a 32-character query,
the Elisp helper fzf-async--make-fzf-highlight-pattern built a 63-element
PCM pattern of interleaved `any' and character tokens. Emacs compiled this
to a regex with 32 nested .* groups, causing exponential backtracking on
non-matching candidates — the common case for most results. With 1038
candidates and a 32-char query this produced an 18-second hang on the
Emacs main thread, confirmed in the C-side debug log (18-second gap at
the fzf_native_async_candidates call site, all other threads idle).
Fix: highlight from C after building the result list, using the
already-compiled fzf_pattern_t and fzf_get_positions.
Implementation in fzf_native_async_candidates:
Two defcustoms are read via symbol-value at the top of each call:
fzf-async-highlight — nil disables; non-nil enables
fzf-async-highlight-max-candidates — cap on candidates highlighted
(later merged into fzf-async-highlight
as the tri-value nil/t/N form)
A fresh fzf_pattern_t is parsed from filter_for_hilit (a strdup of the
filter string taken before filter ownership transfers to the scoring
thread), and one fzf_slab_t is allocated for the highlight pass.
For each of the top min(rcount, highlight_cap) candidates:
- fzf_get_positions(str, pattern, slab) returns pos->data[] in
descending byte-offset order (highest position first).
- The loop iterates ascending (index from size-1 to 0), merging
adjacent offsets into contiguous runs.
- One put-text-property call per run applies face=completions-common-part
to the Emacs string object before it is consed into the result list.
Merging runs minimises the number of text-property operations, which
matters most for dense matches (e.g. exact-substring queries).
Two global emacs_value handles — Qface and Qcompletions_common_part —
are interned once at module init and reused on every call.
Pattern and slab are freed before returning. filter_for_hilit is always
freed regardless of whether highlighting is enabled.
Complexity: O(query_len × highlight_cap) vs O(2^query_len × N) for the
PCM regex path. For 200 candidates and a 32-char query: ~6400 fzf position
lookups vs. the exponential blowup that caused the 18-second hang.
Note: fzf positions are byte offsets; put-text-property uses character
positions. For ASCII candidates (the overwhelming majority of find/rg/git
output) these are identical. Multi-byte UTF-8 candidates may have slightly
misaligned face ranges — the same limitation shared by fussy's fzf-based
highlighting.
…ader
Long lines — minified JavaScript, base64 payloads, binary-adjacent grep
output — slow fzf_get_score (more characters to align), inflate the arena,
and produce unreadable candidates in the completion UI. This commit adds
a configurable line-length gate applied in the reader thread, before any
candidate enters the pool.
The gate is controlled by fzf-async-max-line-length, a defcustom read by
fzf_native_async_start on the main thread:
nil no limit (default, preserves current behavior)
t built-in default of 512 characters
+N exclude lines longer than N characters
-N include but truncate to N characters
The value is converted to a signed ptrdiff_t stored as
AsyncSession.max_line_length (0 = off, >0 = exclude, <0 = truncate).
The write completes before pthread_create, so the reader thread sees the
value with no additional synchronization.
In async_reader, after ANSI stripping:
ptrdiff_t mll = s->max_line_length;
if (mll != 0) {
ptrdiff_t cap = mll > 0 ? mll : -mll;
if ((ptrdiff_t)len > cap) {
if (mll > 0) continue; // exclude
len = (size_t)cap; // truncate
line[len] = '\0';
}
}
Truncation writes into the on-stack line[] buffer before arena_strdup, so
the arena never stores more than cap characters per candidate. Exclusion
skips the arena_strdup and cands_top append entirely.
Character vs. byte: strlen() is used for the length check. For the common
case (ASCII output from find, rg, git ls-files) bytes equal characters.
For mixed-script content the check is conservative — it may admit lines
slightly longer in character count than the configured cap. This matches
user intuition for the primary use case.
We're not returning indices anymore.
Before it was /bin/sh and used a minimal $PATH so we had to fully qualify binaries e.g. /opt/homebrew/bin/fd instead of just fd
This reverts commit cba91ff.
Per-session LRU result cache that maps query strings to scored snapshots, sitting on top of the existing batch-parallel scoring path. Reintroduces just the cache half of the reverted 76b7318 — none of the range-based-worker / per_worker_limit machinery that caused the under-counting bugs. Cache anatomy: - SharedIdx: refcounted flexible-array struct of uint32_t holding the full set of matched candidate indices for a scoring run. Allocated by the scoring thread on publish, retained in O(1) by lookup consumers under the cache mutex (no memcpy), freed by the last consumer. - CacheEntry { query, pool_gen, top[K], m_idx } in a doubly-linked LRU list, mutex-protected, MRU at head. Successful lookups bump the entry to MRU. Inserts pre-allocate everything outside the mutex; the critical section is just pointer swaps + LRU manipulation. - subsumes(Q', Q): byte-prefix match + reject any '|'. OR queries can never serve as refinement sources because adding an OR alternate widens the result set unpredictably. - cache_lookup_exact: exact query match; returns cached top-K + bumped SharedIdx + pool_gen. cache_lookup_prefix: longest subsuming Q' with non-NULL m_idx (entries from OR queries are skipped). - cache_insert: drops m_idx for OR queries (still inserts the entry so it can serve future exact lookups, but it can't be a refinement source). Scoring-thread integration: - ScoredStr gained a uint32_t idx field. The batch-construction loop populates it from a parallel snap_idx[] array; the worker preserves it via existing struct copy; counting_sort_scored preserves it the same way. - Refinement mode: when score_req_refine_idx is non-NULL, the snap is the union of those indices and s->cands[refine_delta_from..count] instead of the full pool. For typing past the first 2-3 chars this is typically <1% of pool size. - On publish, builds the full match-set index array (all pos matches, not just the top-K emitted) and calls cache_insert with it, so a later subsuming query can refine over the complete prior match set. Dispatch (fzf_native_async_candidates): Lookup result | Display | Scoring scheduled ---------------------+----------------------+---------------------------- Exact, pool_gen==now | Cached top-K | None (no work) Exact, pool_gen<now | Cached top-K | Refine on m_idx + delta Prefix hit | Prefix's top-K | Refine on prefix m_idx + d Miss | Current score_results | Full pool scoring Same-filter abort suppression preserved (timer re-triggers don't interrupt in-flight work on the matching query). Defcustom: fzf-async-cache-size (default 20) re-added to fzf-async.el; read in fzf-native-async-start via symbol-value at session start. Tests: 8 new ctests covering the exact-match cache surface (lookup miss/hit, in-place update, LRU eviction at capacity, MRU touch on hit, zero-count entries, pool_gen reporting). Phase-2 tests for subsumes() and cache_lookup_prefix() will follow when v2 (term-set subsumption, in-term backspace) lands. 31 ctests total, all passing. 31 ERT tests also pass. Measured behavior on 63.4M-candidate fzf-async-rg session: typing past ~8 chars drives refine scans down 100x+ vs full-pool scans (e.g. ~460K scanned vs ~47M pool); typing past ~18 chars drives them down 200,000x+ (e.g. 283 scanned vs 63.4M pool). Backspace and timer re-ticks on a stable query trigger zero scoring work.
term-set subsumption + larger default
Layered on top of phase 1 (commit ed25fc7). Three changes:
1. Term-set subsumption (replaces byte-prefix-only rule with byte-
prefix OR term-set).
Phase 1's subsumes(Q', Q) used byte-prefix matching: Q' subsumes Q
iff Q' is a literal prefix of Q (and neither contains '|'). This
caught extending a term, adding AND terms at the end, etc., but
missed cases where Q' subsumes Q semantically:
- Adding an AND term in non-prefix position: fo -> x fo
- Term reordering: foo bar -> bar foo (same term set, different
textual order)
- Non-prefix negation/anchor: fo -> !x fo
Phase 2 adds subsumes_pattern(P', P) operating on parsed
fzf_pattern_t structures. Rule: every term-set in P' must have an
equivalent term-set in P. Equivalent = same algorithm (fn function
pointer), same inv flag, same case_sensitive flag, same strcmp(ptr).
fzf semantics gotcha: within a term-set = OR (terms are
alternatives); across term-sets = AND. So "foo bar" parses as 2
sets x 1 term (AND), and "foo | bar" parses as 1 set x 2 terms
(OR). subsumes_pattern rejects any term-set with >1 term — these
can never serve as refinement sources.
cache_lookup_prefix now uses byte-prefix OR term-set — both rules
are valid superset relations and we want the union. Best entry =
most term-sets (most constraints = smallest match set = fastest
refinement scan), with byte-length as tiebreaker.
To avoid re-parsing on every iteration of the lookup scan,
CacheEntry gained a fzf_pattern_t *parsed field populated on
insert (and freed in cache_entry_free). fzf_parse_pattern mutates
its input, so parse_query_for_cache strdups before parsing — the
returned pattern is self-contained.
2. Default cache size 20 -> 40.
Doubles the typing-trail kept in LRU. Helps backspace coverage:
backspacing past N keystrokes still hits the LRU as long as those
intermediate queries weren't evicted by unrelated lookups. C
fallback in cache_init and async_start both bumped; the matching
defcustom default change is in fzf-async.
3. Tests.
12 new ctests covering subsumes_pattern (extending via byte-prefix,
adding at end/start, reorder, negation, OR rejection, distinct
terms) and cache_lookup_prefix v2 paths (term-subset, reordered,
picks-most-terms, skips-OR-in-query, skips-exact-match). Plus 2
new ERT tests:
- fzf-native-async-cache-prefix-refinement-test: typing
progression fo -> foo -> backspace to fo, verify backspace
returns same set as initial (covers exact-cache-hit-on-
backspace via larger LRU).
- fzf-native-async-cache-term-reorder-test: foo bar and bar foo
return identical sets (exercises subsumes_pattern term
reordering end-to-end).
Totals: 43 ctests pass (was 31, +12 phase 2); 33 ERT tests pass
(was 31, +2 cache E2E).
Measured behavior on 63.4M-candidate fzf-async-rg session typing a
6-AND-term query: each new term refines from the previous match set,
final scan = 10,669 candidates (vs 63.4M for full scan, ~5,940x
speedup). v2's most-terms preference correctly picks the most-
restrictive prior entry as refinement source on every step.
— `Makefile:81` now passes `-D_POSIX_C_SOURCE=200809L`, which exposes `strdup`'s declaration in `<string.h>` so the return value is correctly typed as `char *`.
The flat s->cands pointer array (and its doubling realloc) was the last remaining O(N) allocation in the async path. At 33M candidates, growing to 67M slots requires a single 537 MB malloc plus a 264 MB transient memcpy. Under macOS memory-compressor pressure the kernel cannot find that much contiguous memory quickly, and every thread in the process stalls — Hang dangduc#2 in the project history. Replace the flat array with a two-level table: cands_top[] : CANDS_TOP_CAP slots (4096) × 8 B = 32 KB inline, zero-initialized at session start, never grown. cands_top[i] : CANDS_BLOCK_SIZE (256K) pointers × 8 B = 2 MB block, allocated lazily by the reader on first write. Index split: i = (hi << SHIFT) + lo hi = i >> 18 → which block lo = i & 0x3FFFF → which slot in that block Both single-cycle CPU instructions because BLOCK_SIZE is a power of 2. Largest single allocation the reader ever does: 2 MB regardless of pool size. macOS's compressor satisfies a 2 MB allocation in microseconds even under heavy pressure; a 537 MB allocation can stall for seconds. 1 G candidate ceiling (4096 × 256K = 2^32) is well past the practical limit of any realistic shell command. Reader changes (async_reader): - Compute (hi, lo) from s->count BEFORE arena_strdup. The reader is the sole writer to s->count, so reading without s->mu is safe. - Cap check first: if hi >= CANDS_TOP_CAP, drop the line entirely (don't even arena-allocate) and log verbosely with line preview — hitting 1 G candidates is so far outside expected behavior that it almost certainly indicates a broken upstream command (infinite loop, runaway find on a cyclic FS, etc.), and we want the cause obvious in the log. - Pre-allocate the new 2 MB block OUTSIDE s->mu. Doing the malloc under the lock would let a slow allocation stall the scoring thread's snapshot path — which is exactly the original Hang dangduc#2 problem. - Take s->mu briefly to publish the block pointer (if newly allocated) and write the slot + increment count. Scoring thread changes (scoring_thread_fn): - Full-pool snapshot walks block-by-block, doing a flat memcpy within each block. Boundary-crossing cost paid once per block (~250 times for a 60M pool — basically free) while inner loops match flat-array speed. - Refine path's matched_idx and delta loops resolve via the shift+mask accessor: snap[w] = s->cands_top[gi >> SHIFT][gi & MASK]. Random access pays one extra L1 cache line load on first access to each block, negligible vs string-comparison cost. Teardown (async_session_destroy): - Walk cands_top[0..CAP-1] and free each non-NULL block. Strings still freed in O(chunks) by arena_free. Init (fzf_native_async_start): - No initial allocation needed. cands_top is zero-initialized by the calloc that allocates the AsyncSession itself. ASYNC_INIT_CAP removed.
Two related freshness fixes for the prompt overlay during cache hits. Without these, the displayed counts could lag behind the visible state by several seconds — visible to the user as the prompt showing e.g. [27815204](46291615) for ~3 seconds while they type past the prefix into a longer query against a streaming pool that has grown to 55M. A. last_total now tracks the *current* pool, not the pool size at scoring-publish time. Without this, the TOTAL displayed in the prompt freezes at the last scored value, lagging behind the streaming counter visible elsewhere. Cheap fix: the dispatch path already reads s->count under s->mu for the pool-size check; piggy- back a write of s->last_total under score_res_mu in the same call. B. last_filtered on cache hits is now set from the cached entry's full match-set count (m_idx->count) — which describes the candidate set the user is actually looking at right now (the cached top-K we just returned). Previously last_filtered held whatever the most recent scoring run published, which during prefix hits could be a completely different query's count. For OR-query entries (m_idx == NULL, can't be refinement sources) we fall back to top_count. On cache miss we leave last_filtered alone — scoring will publish a fresh value shortly, and meanwhile the existing value is at least consistent with the score_results we're falling back to display. The race with the scoring thread is benign: scoring publishes its authoritative values before bumping gen, so any time-ordering of our write vs scoring's write that ends with scoring last (the common case) results in scoring's fresh values being shown. The other ordering shows our cache-derived values briefly until the next gen bump triggers a re-display.
Currently the C module reads four knobs via symbol-value from three different package namespaces: fussy-fzf-native-highlight (fussy) fzf-async-highlight (fzf-async) fzf-async-max-line-length (fzf-async) fzf-async-cache-size (fzf-async) This leaks the layering the wrong direction — the lowest-level package shouldn't have to know symbol names from two higher-level packages. Move all four to the fzf-native namespace: fzf-native-batch-highlight (sync path, default 25) fzf-native-async-highlight (async path, default 200) fzf-native-max-line-length (async reader, default t) fzf-native-async-cache-size (async start, default 40) C reads now hit only fzf-native-* names. Higher-level packages keep their existing user-facing defcustoms and bridge the values onto these canonical names — fussy via setq-local (synchronous, same-buffer call pattern) and fzf-async via :around advice on the C entry points (timer-driven, cross-buffer). Those bridges land in the respective package commits. Naming convention: - "batch-" prefix marks the synchronous score/score-all path. - "async-" prefix marks the streaming async path. - max-line-length has no prefix because it conceptually belongs to the line-stream itself; kept short. The companion fzf-async / fussy bridge commits make this fully backward-compatible — users continue to set fussy-fzf-native-highlight or fzf-async-highlight as before, and the canonical name picks up the package-specific value at call time.
The defcustom previously used `t' as a sentinel meaning "use the
built-in default of 512", with the actual integer hardcoded in C.
Two related cleanups:
1. Make the defcustom value the actual integer. No `t' sentinel,
no hardcoded fallback in C. Type is `(choice nil integer)':
nil -> no limit
positive N -> exclude lines longer than N characters
negative -N -> include but truncate lines to N characters
2. Drop the `else if (env->eq(env, val, Qt)) s->max_line_length = 256'
branch in `fzf_native_async_start'. The C side now reads the
integer directly via `extract_integer'; the default lives where
it should — in fzf-native.el as the defcustom's :type/value, not
as a magic fallback in the dynamic module.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
with these changes, we can run fzf-async-rg in ~/ which is 60 million candidates for me