Update by jojojames · Pull Request #36 · dangduc/fzf-native

jojojames · 2026-05-09T15:10:27Z

with these changes, we can run fzf-async-rg in ~/ which is 60 million candidates for me

Background: the async path streams candidates from a child process into a two-level cands_top table and scores them on a persistent background scoring thread using a pool of parallel worker pthreads. When 50+ million candidates are in the pool, two allocations inside scoring_thread_fn previously scaled linearly with pool size: 1. char **snap = malloc(scount * sizeof *snap) — a flat snapshot of all candidate pointers. At 55.9 M candidates this is 447 MB allocated in a single malloc call, causing VM pressure stalls lasting 17+ seconds. 2. struct AsyncScoringBatch *batches — each batch embedded xs[BATCH_SIZE] (2048 × 16 B = 32 KB) inline. The array grew to 32768 entries via doubling, reaching 32768 × 32 KB = 1 GB before any worker ran. Both allocations existed since the original batch-parallel design and survived the chunked cands_top fix (commit 9883b51, which only addressed the reader's realloc stall). A 17-second gap in the C-side debug log between the last reader allocation and the first post-reader scoring event confirmed the scoring thread was the culprit. Fix: replace the batch/snapshot design with a range-based worker design. Workers no longer receive pre-filled string arrays. Instead: - AsyncScoringRange { size_t from; size_t to; } (16 bytes each) replaces AsyncScoringBatch (32 KB each). The range array for 55 M candidates at BATCH_SIZE=2048 is ~430 KB rather than 1 GB. - Workers resolve candidate pointers on the fly from cands_top via the shift+mask accessor (gi >> CANDS_BLOCK_SHIFT, gi & CANDS_BLOCK_MASK). This is safe because entries at index i < pool_count are immutably written by the reader and never moved; the scoring thread captures pool_count under s->mu before spawning workers. - Per-worker result buffers grow only as matches are found, bounded by ceil(limit / num_workers + 1) × 4. For 10000-candidate limit across 9 workers, each worker buffer is ≤ 71 KB; total flat memory after merge is ≤ 640 KB regardless of pool size. - The char **snap intermediate copy is eliminated entirely; workers read directly from cands_top. Memory footprint for 55 M candidates (limit=10000, 9 workers): Allocation Old New snap 447 MB 0 (removed) batches array ~1 GB ~430 KB per-worker results N/A ≤71 KB each flat after merge same ≤640 KB Also adds a 20-entry LRU result cache (SharedIdx, ref-counted) that maps (filter, pool_gen) to a scored snapshot. Exact cache hits return without scheduling a new scoring run. Prefix cache hits restrict the next scoring run to the cached entry's matched indices plus newly-arrived candidates (delta refinement), avoiding a full re-score of the entire pool on each keystroke during incremental narrowing (e.g. "f" → "fo" → "foo").

Background: fzf-async previously delegated candidate highlighting to Elisp via completion-pcm--hilit-commonality. For a 32-character query, the Elisp helper fzf-async--make-fzf-highlight-pattern built a 63-element PCM pattern of interleaved `any' and character tokens. Emacs compiled this to a regex with 32 nested .* groups, causing exponential backtracking on non-matching candidates — the common case for most results. With 1038 candidates and a 32-char query this produced an 18-second hang on the Emacs main thread, confirmed in the C-side debug log (18-second gap at the fzf_native_async_candidates call site, all other threads idle). Fix: highlight from C after building the result list, using the already-compiled fzf_pattern_t and fzf_get_positions. Implementation in fzf_native_async_candidates: Two defcustoms are read via symbol-value at the top of each call: fzf-async-highlight — nil disables; non-nil enables fzf-async-highlight-max-candidates — cap on candidates highlighted (later merged into fzf-async-highlight as the tri-value nil/t/N form) A fresh fzf_pattern_t is parsed from filter_for_hilit (a strdup of the filter string taken before filter ownership transfers to the scoring thread), and one fzf_slab_t is allocated for the highlight pass. For each of the top min(rcount, highlight_cap) candidates: - fzf_get_positions(str, pattern, slab) returns pos->data[] in descending byte-offset order (highest position first). - The loop iterates ascending (index from size-1 to 0), merging adjacent offsets into contiguous runs. - One put-text-property call per run applies face=completions-common-part to the Emacs string object before it is consed into the result list. Merging runs minimises the number of text-property operations, which matters most for dense matches (e.g. exact-substring queries). Two global emacs_value handles — Qface and Qcompletions_common_part — are interned once at module init and reused on every call. Pattern and slab are freed before returning. filter_for_hilit is always freed regardless of whether highlighting is enabled. Complexity: O(query_len × highlight_cap) vs O(2^query_len × N) for the PCM regex path. For 200 candidates and a 32-char query: ~6400 fzf position lookups vs. the exponential blowup that caused the 18-second hang. Note: fzf positions are byte offsets; put-text-property uses character positions. For ASCII candidates (the overwhelming majority of find/rg/git output) these are identical. Multi-byte UTF-8 candidates may have slightly misaligned face ranges — the same limitation shared by fussy's fzf-based highlighting.

…ader Long lines — minified JavaScript, base64 payloads, binary-adjacent grep output — slow fzf_get_score (more characters to align), inflate the arena, and produce unreadable candidates in the completion UI. This commit adds a configurable line-length gate applied in the reader thread, before any candidate enters the pool. The gate is controlled by fzf-async-max-line-length, a defcustom read by fzf_native_async_start on the main thread: nil no limit (default, preserves current behavior) t built-in default of 512 characters +N exclude lines longer than N characters -N include but truncate to N characters The value is converted to a signed ptrdiff_t stored as AsyncSession.max_line_length (0 = off, >0 = exclude, <0 = truncate). The write completes before pthread_create, so the reader thread sees the value with no additional synchronization. In async_reader, after ANSI stripping: ptrdiff_t mll = s->max_line_length; if (mll != 0) { ptrdiff_t cap = mll > 0 ? mll : -mll; if ((ptrdiff_t)len > cap) { if (mll > 0) continue; // exclude len = (size_t)cap; // truncate line[len] = '\0'; } } Truncation writes into the on-stack line[] buffer before arena_strdup, so the arena never stores more than cap characters per candidate. Exclusion skips the arena_strdup and cands_top append entirely. Character vs. byte: strlen() is used for the length check. For the common case (ASCII output from find, rg, git ls-files) bytes equal characters. For mixed-script content the check is conservative — it may admit lines slightly longer in character count than the configured cap. This matches user intuition for the primary use case.

We're not returning indices anymore.

Before it was /bin/sh and used a minimal $PATH so we had to fully qualify binaries e.g. /opt/homebrew/bin/fd instead of just fd

This reverts commit cba91ff.

Per-session LRU result cache that maps query strings to scored snapshots, sitting on top of the existing batch-parallel scoring path. Reintroduces just the cache half of the reverted 76b7318 — none of the range-based-worker / per_worker_limit machinery that caused the under-counting bugs. Cache anatomy: - SharedIdx: refcounted flexible-array struct of uint32_t holding the full set of matched candidate indices for a scoring run. Allocated by the scoring thread on publish, retained in O(1) by lookup consumers under the cache mutex (no memcpy), freed by the last consumer. - CacheEntry { query, pool_gen, top[K], m_idx } in a doubly-linked LRU list, mutex-protected, MRU at head. Successful lookups bump the entry to MRU. Inserts pre-allocate everything outside the mutex; the critical section is just pointer swaps + LRU manipulation. - subsumes(Q', Q): byte-prefix match + reject any '|'. OR queries can never serve as refinement sources because adding an OR alternate widens the result set unpredictably. - cache_lookup_exact: exact query match; returns cached top-K + bumped SharedIdx + pool_gen. cache_lookup_prefix: longest subsuming Q' with non-NULL m_idx (entries from OR queries are skipped). - cache_insert: drops m_idx for OR queries (still inserts the entry so it can serve future exact lookups, but it can't be a refinement source). Scoring-thread integration: - ScoredStr gained a uint32_t idx field. The batch-construction loop populates it from a parallel snap_idx[] array; the worker preserves it via existing struct copy; counting_sort_scored preserves it the same way. - Refinement mode: when score_req_refine_idx is non-NULL, the snap is the union of those indices and s->cands[refine_delta_from..count] instead of the full pool. For typing past the first 2-3 chars this is typically <1% of pool size. - On publish, builds the full match-set index array (all pos matches, not just the top-K emitted) and calls cache_insert with it, so a later subsuming query can refine over the complete prior match set. Dispatch (fzf_native_async_candidates): Lookup result | Display | Scoring scheduled ---------------------+----------------------+---------------------------- Exact, pool_gen==now | Cached top-K | None (no work) Exact, pool_gen<now | Cached top-K | Refine on m_idx + delta Prefix hit | Prefix's top-K | Refine on prefix m_idx + d Miss | Current score_results | Full pool scoring Same-filter abort suppression preserved (timer re-triggers don't interrupt in-flight work on the matching query). Defcustom: fzf-async-cache-size (default 20) re-added to fzf-async.el; read in fzf-native-async-start via symbol-value at session start. Tests: 8 new ctests covering the exact-match cache surface (lookup miss/hit, in-place update, LRU eviction at capacity, MRU touch on hit, zero-count entries, pool_gen reporting). Phase-2 tests for subsumes() and cache_lookup_prefix() will follow when v2 (term-set subsumption, in-term backspace) lands. 31 ctests total, all passing. 31 ERT tests also pass. Measured behavior on 63.4M-candidate fzf-async-rg session: typing past ~8 chars drives refine scans down 100x+ vs full-pool scans (e.g. ~460K scanned vs ~47M pool); typing past ~18 chars drives them down 200,000x+ (e.g. 283 scanned vs 63.4M pool). Backspace and timer re-ticks on a stable query trigger zero scoring work.

term-set subsumption + larger default Layered on top of phase 1 (commit ed25fc7). Three changes: 1. Term-set subsumption (replaces byte-prefix-only rule with byte- prefix OR term-set). Phase 1's subsumes(Q', Q) used byte-prefix matching: Q' subsumes Q iff Q' is a literal prefix of Q (and neither contains '|'). This caught extending a term, adding AND terms at the end, etc., but missed cases where Q' subsumes Q semantically: - Adding an AND term in non-prefix position: fo -> x fo - Term reordering: foo bar -> bar foo (same term set, different textual order) - Non-prefix negation/anchor: fo -> !x fo Phase 2 adds subsumes_pattern(P', P) operating on parsed fzf_pattern_t structures. Rule: every term-set in P' must have an equivalent term-set in P. Equivalent = same algorithm (fn function pointer), same inv flag, same case_sensitive flag, same strcmp(ptr). fzf semantics gotcha: within a term-set = OR (terms are alternatives); across term-sets = AND. So "foo bar" parses as 2 sets x 1 term (AND), and "foo | bar" parses as 1 set x 2 terms (OR). subsumes_pattern rejects any term-set with >1 term — these can never serve as refinement sources. cache_lookup_prefix now uses byte-prefix OR term-set — both rules are valid superset relations and we want the union. Best entry = most term-sets (most constraints = smallest match set = fastest refinement scan), with byte-length as tiebreaker. To avoid re-parsing on every iteration of the lookup scan, CacheEntry gained a fzf_pattern_t *parsed field populated on insert (and freed in cache_entry_free). fzf_parse_pattern mutates its input, so parse_query_for_cache strdups before parsing — the returned pattern is self-contained. 2. Default cache size 20 -> 40. Doubles the typing-trail kept in LRU. Helps backspace coverage: backspacing past N keystrokes still hits the LRU as long as those intermediate queries weren't evicted by unrelated lookups. C fallback in cache_init and async_start both bumped; the matching defcustom default change is in fzf-async. 3. Tests. 12 new ctests covering subsumes_pattern (extending via byte-prefix, adding at end/start, reorder, negation, OR rejection, distinct terms) and cache_lookup_prefix v2 paths (term-subset, reordered, picks-most-terms, skips-OR-in-query, skips-exact-match). Plus 2 new ERT tests: - fzf-native-async-cache-prefix-refinement-test: typing progression fo -> foo -> backspace to fo, verify backspace returns same set as initial (covers exact-cache-hit-on- backspace via larger LRU). - fzf-native-async-cache-term-reorder-test: foo bar and bar foo return identical sets (exercises subsumes_pattern term reordering end-to-end). Totals: 43 ctests pass (was 31, +12 phase 2); 33 ERT tests pass (was 31, +2 cache E2E). Measured behavior on 63.4M-candidate fzf-async-rg session typing a 6-AND-term query: each new term refines from the previous match set, final scan = 10,669 candidates (vs 63.4M for full scan, ~5,940x speedup). v2's most-terms preference correctly picks the most- restrictive prior entry as refinement source on every step.

— `Makefile:81` now passes `-D_POSIX_C_SOURCE=200809L`, which exposes `strdup`'s declaration in `<string.h>` so the return value is correctly typed as `char *`.

The flat s->cands pointer array (and its doubling realloc) was the last remaining O(N) allocation in the async path. At 33M candidates, growing to 67M slots requires a single 537 MB malloc plus a 264 MB transient memcpy. Under macOS memory-compressor pressure the kernel cannot find that much contiguous memory quickly, and every thread in the process stalls — Hang dangduc#2 in the project history. Replace the flat array with a two-level table: cands_top[] : CANDS_TOP_CAP slots (4096) × 8 B = 32 KB inline, zero-initialized at session start, never grown. cands_top[i] : CANDS_BLOCK_SIZE (256K) pointers × 8 B = 2 MB block, allocated lazily by the reader on first write. Index split: i = (hi << SHIFT) + lo hi = i >> 18 → which block lo = i & 0x3FFFF → which slot in that block Both single-cycle CPU instructions because BLOCK_SIZE is a power of 2. Largest single allocation the reader ever does: 2 MB regardless of pool size. macOS's compressor satisfies a 2 MB allocation in microseconds even under heavy pressure; a 537 MB allocation can stall for seconds. 1 G candidate ceiling (4096 × 256K = 2^32) is well past the practical limit of any realistic shell command. Reader changes (async_reader): - Compute (hi, lo) from s->count BEFORE arena_strdup. The reader is the sole writer to s->count, so reading without s->mu is safe. - Cap check first: if hi >= CANDS_TOP_CAP, drop the line entirely (don't even arena-allocate) and log verbosely with line preview — hitting 1 G candidates is so far outside expected behavior that it almost certainly indicates a broken upstream command (infinite loop, runaway find on a cyclic FS, etc.), and we want the cause obvious in the log. - Pre-allocate the new 2 MB block OUTSIDE s->mu. Doing the malloc under the lock would let a slow allocation stall the scoring thread's snapshot path — which is exactly the original Hang dangduc#2 problem. - Take s->mu briefly to publish the block pointer (if newly allocated) and write the slot + increment count. Scoring thread changes (scoring_thread_fn): - Full-pool snapshot walks block-by-block, doing a flat memcpy within each block. Boundary-crossing cost paid once per block (~250 times for a 60M pool — basically free) while inner loops match flat-array speed. - Refine path's matched_idx and delta loops resolve via the shift+mask accessor: snap[w] = s->cands_top[gi >> SHIFT][gi & MASK]. Random access pays one extra L1 cache line load on first access to each block, negligible vs string-comparison cost. Teardown (async_session_destroy): - Walk cands_top[0..CAP-1] and free each non-NULL block. Strings still freed in O(chunks) by arena_free. Init (fzf_native_async_start): - No initial allocation needed. cands_top is zero-initialized by the calloc that allocates the AsyncSession itself. ASYNC_INIT_CAP removed.

Two related freshness fixes for the prompt overlay during cache hits. Without these, the displayed counts could lag behind the visible state by several seconds — visible to the user as the prompt showing e.g. [27815204](46291615) for ~3 seconds while they type past the prefix into a longer query against a streaming pool that has grown to 55M. A. last_total now tracks the *current* pool, not the pool size at scoring-publish time. Without this, the TOTAL displayed in the prompt freezes at the last scored value, lagging behind the streaming counter visible elsewhere. Cheap fix: the dispatch path already reads s->count under s->mu for the pool-size check; piggy- back a write of s->last_total under score_res_mu in the same call. B. last_filtered on cache hits is now set from the cached entry's full match-set count (m_idx->count) — which describes the candidate set the user is actually looking at right now (the cached top-K we just returned). Previously last_filtered held whatever the most recent scoring run published, which during prefix hits could be a completely different query's count. For OR-query entries (m_idx == NULL, can't be refinement sources) we fall back to top_count. On cache miss we leave last_filtered alone — scoring will publish a fresh value shortly, and meanwhile the existing value is at least consistent with the score_results we're falling back to display. The race with the scoring thread is benign: scoring publishes its authoritative values before bumping gen, so any time-ordering of our write vs scoring's write that ends with scoring last (the common case) results in scoring's fresh values being shown. The other ordering shows our cache-derived values briefly until the next gen bump triggers a re-display.

Currently the C module reads four knobs via symbol-value from three different package namespaces: fussy-fzf-native-highlight (fussy) fzf-async-highlight (fzf-async) fzf-async-max-line-length (fzf-async) fzf-async-cache-size (fzf-async) This leaks the layering the wrong direction — the lowest-level package shouldn't have to know symbol names from two higher-level packages. Move all four to the fzf-native namespace: fzf-native-batch-highlight (sync path, default 25) fzf-native-async-highlight (async path, default 200) fzf-native-max-line-length (async reader, default t) fzf-native-async-cache-size (async start, default 40) C reads now hit only fzf-native-* names. Higher-level packages keep their existing user-facing defcustoms and bridge the values onto these canonical names — fussy via setq-local (synchronous, same-buffer call pattern) and fzf-async via :around advice on the C entry points (timer-driven, cross-buffer). Those bridges land in the respective package commits. Naming convention: - "batch-" prefix marks the synchronous score/score-all path. - "async-" prefix marks the streaming async path. - max-line-length has no prefix because it conceptually belongs to the line-stream itself; kept short. The companion fzf-async / fussy bridge commits make this fully backward-compatible — users continue to set fussy-fzf-native-highlight or fzf-async-highlight as before, and the canonical name picks up the package-specific value at call time.

The defcustom previously used `t' as a sentinel meaning "use the built-in default of 512", with the actual integer hardcoded in C. Two related cleanups: 1. Make the defcustom value the actual integer. No `t' sentinel, no hardcoded fallback in C. Type is `(choice nil integer)': nil -> no limit positive N -> exclude lines longer than N characters negative -N -> include but truncate lines to N characters 2. Drop the `else if (env->eq(env, val, Qt)) s->max_line_length = 256' branch in `fzf_native_async_start'. The C side now reads the integer directly via `extract_integer'; the default lives where it should — in fzf-native.el as the defcustom's :type/value, not as a magic fallback in the dynamic module.

jojojames and others added 30 commits May 9, 2026 11:07

Implement highlighting for 1-candidate and batch case

f9a255d

Expose cache variable

adb91f4

Update architecture

e3cadaf

Update binary ubuntu-latest

347829e

Update binary windows-latest

304f046

Update binary macos-latest

4940da9

Fix c tests

250341a

Remove tests related to indices being returned

bc8521e

Remove advice

9373268

We're not returning indices anymore.

Remove mention of advice

9ca2b35

Update readme

13765c8

Require 29

cdab0ac

Extract $PATH from emacs and use shell set by emacs

b72933c

Before it was /bin/sh and used a minimal $PATH so we had to fully qualify binaries e.g. /opt/homebrew/bin/fd instead of just fd

Update binary ubuntu-latest

77e4bc6

Update binary windows-latest

fded535

Update binary macos-latest

1220006

Remove

e897fc3

Update binary ubuntu-latest

53fe7d9

Update binary macos-latest

9327cc9

Update binary windows-latest

4b5c999

Try action again

e781893

Update binaries for all platforms

eaeffe2

Add test message

cba91ff

Update binaries for all platforms

9e37074

Add BSD

a77b52e

Update binaries for all platforms

ffd36b7

Revert "Add test message"

3722ec9

This reverts commit cba91ff.

github-actions and others added 7 commits May 10, 2026 00:21

Update binaries for all platforms

6080d49

Update binaries for all platforms

02368af

Build before test

e8c48f9

Set -D_POSIX_C_SOURCE=200809L for ctest

cb8bb57

— `Makefile:81` now passes `-D_POSIX_C_SOURCE=200809L`, which exposes `strdup`'s declaration in `<string.h>` so the return value is correctly typed as `char *`.

Use gnu11 instead

fb43f7b

jojojames force-pushed the main branch from ac7c819 to fb43f7b Compare May 10, 2026 03:07

Handle case mode and expose custom variable

4d564f1

jojojames force-pushed the main branch from 699929a to 4d564f1 Compare May 10, 2026 03:33

jojojames and others added 9 commits May 10, 2026 10:17

Wipe text properties before adding them

0601dd7

Update binaries for all platforms

98913f5

Update binaries for all platforms

32d3a35

Update readme & architecture

32e9288

Update gitignore

6e409e8

dangduc merged commit d40834c into dangduc:main May 11, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update#36

Update#36
dangduc merged 47 commits into
dangduc:mainfrom
jojojames:main

jojojames commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jojojames commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants