Skip to content

Update#36

Merged
dangduc merged 47 commits into
dangduc:mainfrom
jojojames:main
May 11, 2026
Merged

Update#36
dangduc merged 47 commits into
dangduc:mainfrom
jojojames:main

Conversation

@jojojames
Copy link
Copy Markdown
Collaborator

with these changes, we can run fzf-async-rg in ~/ which is 60 million candidates for me

jojojames and others added 30 commits May 9, 2026 11:07
Background: the async path streams candidates from a child process into
a two-level cands_top table and scores them on a persistent background
scoring thread using a pool of parallel worker pthreads. When 50+ million
candidates are in the pool, two allocations inside scoring_thread_fn
previously scaled linearly with pool size:

  1. char **snap = malloc(scount * sizeof *snap) — a flat snapshot of all
     candidate pointers. At 55.9 M candidates this is 447 MB allocated in
     a single malloc call, causing VM pressure stalls lasting 17+ seconds.

  2. struct AsyncScoringBatch *batches — each batch embedded xs[BATCH_SIZE]
     (2048 × 16 B = 32 KB) inline. The array grew to 32768 entries via
     doubling, reaching 32768 × 32 KB = 1 GB before any worker ran.

Both allocations existed since the original batch-parallel design and
survived the chunked cands_top fix (commit 9883b51, which only addressed
the reader's realloc stall). A 17-second gap in the C-side debug log
between the last reader allocation and the first post-reader scoring
event confirmed the scoring thread was the culprit.

Fix: replace the batch/snapshot design with a range-based worker design.

Workers no longer receive pre-filled string arrays. Instead:

  - AsyncScoringRange { size_t from; size_t to; } (16 bytes each) replaces
    AsyncScoringBatch (32 KB each). The range array for 55 M candidates at
    BATCH_SIZE=2048 is ~430 KB rather than 1 GB.

  - Workers resolve candidate pointers on the fly from cands_top via the
    shift+mask accessor (gi >> CANDS_BLOCK_SHIFT, gi & CANDS_BLOCK_MASK).
    This is safe because entries at index i < pool_count are immutably
    written by the reader and never moved; the scoring thread captures
    pool_count under s->mu before spawning workers.

  - Per-worker result buffers grow only as matches are found, bounded by
    ceil(limit / num_workers + 1) × 4. For 10000-candidate limit across
    9 workers, each worker buffer is ≤ 71 KB; total flat memory after merge
    is ≤ 640 KB regardless of pool size.

  - The char **snap intermediate copy is eliminated entirely; workers read
    directly from cands_top.

Memory footprint for 55 M candidates (limit=10000, 9 workers):

  Allocation          Old       New
  snap                447 MB    0 (removed)
  batches array       ~1 GB     ~430 KB
  per-worker results  N/A       ≤71 KB each
  flat after merge    same      ≤640 KB

Also adds a 20-entry LRU result cache (SharedIdx, ref-counted) that maps
(filter, pool_gen) to a scored snapshot. Exact cache hits return without
scheduling a new scoring run. Prefix cache hits restrict the next scoring
run to the cached entry's matched indices plus newly-arrived candidates
(delta refinement), avoiding a full re-score of the entire pool on each
keystroke during incremental narrowing (e.g. "f" → "fo" → "foo").
Background: fzf-async previously delegated candidate highlighting to
Elisp via completion-pcm--hilit-commonality. For a 32-character query,
the Elisp helper fzf-async--make-fzf-highlight-pattern built a 63-element
PCM pattern of interleaved `any' and character tokens. Emacs compiled this
to a regex with 32 nested .* groups, causing exponential backtracking on
non-matching candidates — the common case for most results. With 1038
candidates and a 32-char query this produced an 18-second hang on the
Emacs main thread, confirmed in the C-side debug log (18-second gap at
the fzf_native_async_candidates call site, all other threads idle).

Fix: highlight from C after building the result list, using the
already-compiled fzf_pattern_t and fzf_get_positions.

Implementation in fzf_native_async_candidates:

  Two defcustoms are read via symbol-value at the top of each call:
    fzf-async-highlight             — nil disables; non-nil enables
    fzf-async-highlight-max-candidates — cap on candidates highlighted
                                        (later merged into fzf-async-highlight
                                         as the tri-value nil/t/N form)

  A fresh fzf_pattern_t is parsed from filter_for_hilit (a strdup of the
  filter string taken before filter ownership transfers to the scoring
  thread), and one fzf_slab_t is allocated for the highlight pass.

  For each of the top min(rcount, highlight_cap) candidates:
    - fzf_get_positions(str, pattern, slab) returns pos->data[] in
      descending byte-offset order (highest position first).
    - The loop iterates ascending (index from size-1 to 0), merging
      adjacent offsets into contiguous runs.
    - One put-text-property call per run applies face=completions-common-part
      to the Emacs string object before it is consed into the result list.
      Merging runs minimises the number of text-property operations, which
      matters most for dense matches (e.g. exact-substring queries).

  Two global emacs_value handles — Qface and Qcompletions_common_part —
  are interned once at module init and reused on every call.

  Pattern and slab are freed before returning. filter_for_hilit is always
  freed regardless of whether highlighting is enabled.

Complexity: O(query_len × highlight_cap) vs O(2^query_len × N) for the
PCM regex path. For 200 candidates and a 32-char query: ~6400 fzf position
lookups vs. the exponential blowup that caused the 18-second hang.

Note: fzf positions are byte offsets; put-text-property uses character
positions. For ASCII candidates (the overwhelming majority of find/rg/git
output) these are identical. Multi-byte UTF-8 candidates may have slightly
misaligned face ranges — the same limitation shared by fussy's fzf-based
highlighting.
…ader

Long lines — minified JavaScript, base64 payloads, binary-adjacent grep
output — slow fzf_get_score (more characters to align), inflate the arena,
and produce unreadable candidates in the completion UI. This commit adds
a configurable line-length gate applied in the reader thread, before any
candidate enters the pool.

The gate is controlled by fzf-async-max-line-length, a defcustom read by
fzf_native_async_start on the main thread:

  nil   no limit (default, preserves current behavior)
  t     built-in default of 512 characters
  +N    exclude lines longer than N characters
  -N    include but truncate to N characters

The value is converted to a signed ptrdiff_t stored as
AsyncSession.max_line_length (0 = off, >0 = exclude, <0 = truncate).
The write completes before pthread_create, so the reader thread sees the
value with no additional synchronization.

In async_reader, after ANSI stripping:

  ptrdiff_t mll = s->max_line_length;
  if (mll != 0) {
    ptrdiff_t cap = mll > 0 ? mll : -mll;
    if ((ptrdiff_t)len > cap) {
      if (mll > 0) continue;   // exclude
      len = (size_t)cap;       // truncate
      line[len] = '\0';
    }
  }

Truncation writes into the on-stack line[] buffer before arena_strdup, so
the arena never stores more than cap characters per candidate. Exclusion
skips the arena_strdup and cands_top append entirely.

Character vs. byte: strlen() is used for the length check. For the common
case (ASCII output from find, rg, git ls-files) bytes equal characters.
For mixed-script content the check is conservative — it may admit lines
slightly longer in character count than the configured cap. This matches
user intuition for the primary use case.
We're not returning indices anymore.
Before it was /bin/sh and used a minimal $PATH so we had to fully qualify
binaries

e.g. /opt/homebrew/bin/fd instead of just fd
This reverts commit cba91ff.
github-actions and others added 7 commits May 10, 2026 00:21
Per-session LRU result cache that maps query strings to scored
snapshots, sitting on top of the existing batch-parallel scoring path.
Reintroduces just the cache half of the reverted 76b7318 — none of the
range-based-worker / per_worker_limit machinery that caused the
under-counting bugs.

Cache anatomy:

  - SharedIdx: refcounted flexible-array struct of uint32_t holding the
    full set of matched candidate indices for a scoring run. Allocated
    by the scoring thread on publish, retained in O(1) by lookup
    consumers under the cache mutex (no memcpy), freed by the last
    consumer.

  - CacheEntry { query, pool_gen, top[K], m_idx } in a doubly-linked
    LRU list, mutex-protected, MRU at head. Successful lookups bump
    the entry to MRU. Inserts pre-allocate everything outside the
    mutex; the critical section is just pointer swaps + LRU
    manipulation.

  - subsumes(Q', Q): byte-prefix match + reject any '|'. OR queries can
    never serve as refinement sources because adding an OR alternate
    widens the result set unpredictably.

  - cache_lookup_exact: exact query match; returns cached top-K + bumped
    SharedIdx + pool_gen. cache_lookup_prefix: longest subsuming Q' with
    non-NULL m_idx (entries from OR queries are skipped).

  - cache_insert: drops m_idx for OR queries (still inserts the entry
    so it can serve future exact lookups, but it can't be a refinement
    source).

Scoring-thread integration:

  - ScoredStr gained a uint32_t idx field. The batch-construction loop
    populates it from a parallel snap_idx[] array; the worker preserves
    it via existing struct copy; counting_sort_scored preserves it the
    same way.

  - Refinement mode: when score_req_refine_idx is non-NULL, the snap is
    the union of those indices and s->cands[refine_delta_from..count]
    instead of the full pool. For typing past the first 2-3 chars this
    is typically <1% of pool size.

  - On publish, builds the full match-set index array (all pos matches,
    not just the top-K emitted) and calls cache_insert with it, so a
    later subsuming query can refine over the complete prior match set.

Dispatch (fzf_native_async_candidates):

  Lookup result        | Display              | Scoring scheduled
  ---------------------+----------------------+----------------------------
  Exact, pool_gen==now | Cached top-K         | None (no work)
  Exact, pool_gen<now  | Cached top-K         | Refine on m_idx + delta
  Prefix hit           | Prefix's top-K       | Refine on prefix m_idx + d
  Miss                 | Current score_results | Full pool scoring

  Same-filter abort suppression preserved (timer re-triggers don't
  interrupt in-flight work on the matching query).

Defcustom: fzf-async-cache-size (default 20) re-added to fzf-async.el;
read in fzf-native-async-start via symbol-value at session start.

Tests: 8 new ctests covering the exact-match cache surface (lookup
miss/hit, in-place update, LRU eviction at capacity, MRU touch on hit,
zero-count entries, pool_gen reporting). Phase-2 tests for subsumes()
and cache_lookup_prefix() will follow when v2 (term-set subsumption,
in-term backspace) lands. 31 ctests total, all passing. 31 ERT tests
also pass.

Measured behavior on 63.4M-candidate fzf-async-rg session: typing past
~8 chars drives refine scans down 100x+ vs full-pool scans (e.g.
~460K scanned vs ~47M pool); typing past ~18 chars drives them down
200,000x+ (e.g. 283 scanned vs 63.4M pool). Backspace and timer
re-ticks on a stable query trigger zero scoring work.
term-set subsumption + larger default

Layered on top of phase 1 (commit ed25fc7).  Three changes:

1. Term-set subsumption (replaces byte-prefix-only rule with byte-
   prefix OR term-set).

   Phase 1's subsumes(Q', Q) used byte-prefix matching: Q' subsumes Q
   iff Q' is a literal prefix of Q (and neither contains '|').  This
   caught extending a term, adding AND terms at the end, etc., but
   missed cases where Q' subsumes Q semantically:
     - Adding an AND term in non-prefix position: fo -> x fo
     - Term reordering: foo bar -> bar foo (same term set, different
       textual order)
     - Non-prefix negation/anchor: fo -> !x fo

   Phase 2 adds subsumes_pattern(P', P) operating on parsed
   fzf_pattern_t structures.  Rule: every term-set in P' must have an
   equivalent term-set in P.  Equivalent = same algorithm (fn function
   pointer), same inv flag, same case_sensitive flag, same strcmp(ptr).

   fzf semantics gotcha: within a term-set = OR (terms are
   alternatives); across term-sets = AND.  So "foo bar" parses as 2
   sets x 1 term (AND), and "foo | bar" parses as 1 set x 2 terms
   (OR).  subsumes_pattern rejects any term-set with >1 term — these
   can never serve as refinement sources.

   cache_lookup_prefix now uses byte-prefix OR term-set — both rules
   are valid superset relations and we want the union.  Best entry =
   most term-sets (most constraints = smallest match set = fastest
   refinement scan), with byte-length as tiebreaker.

   To avoid re-parsing on every iteration of the lookup scan,
   CacheEntry gained a fzf_pattern_t *parsed field populated on
   insert (and freed in cache_entry_free).  fzf_parse_pattern mutates
   its input, so parse_query_for_cache strdups before parsing — the
   returned pattern is self-contained.

2. Default cache size 20 -> 40.

   Doubles the typing-trail kept in LRU.  Helps backspace coverage:
   backspacing past N keystrokes still hits the LRU as long as those
   intermediate queries weren't evicted by unrelated lookups.  C
   fallback in cache_init and async_start both bumped; the matching
   defcustom default change is in fzf-async.

3. Tests.

   12 new ctests covering subsumes_pattern (extending via byte-prefix,
   adding at end/start, reorder, negation, OR rejection, distinct
   terms) and cache_lookup_prefix v2 paths (term-subset, reordered,
   picks-most-terms, skips-OR-in-query, skips-exact-match).  Plus 2
   new ERT tests:
     - fzf-native-async-cache-prefix-refinement-test: typing
       progression fo -> foo -> backspace to fo, verify backspace
       returns same set as initial (covers exact-cache-hit-on-
       backspace via larger LRU).
     - fzf-native-async-cache-term-reorder-test: foo bar and bar foo
       return identical sets (exercises subsumes_pattern term
       reordering end-to-end).

   Totals: 43 ctests pass (was 31, +12 phase 2); 33 ERT tests pass
   (was 31, +2 cache E2E).

Measured behavior on 63.4M-candidate fzf-async-rg session typing a
6-AND-term query: each new term refines from the previous match set,
final scan = 10,669 candidates (vs 63.4M for full scan, ~5,940x
speedup).  v2's most-terms preference correctly picks the most-
restrictive prior entry as refinement source on every step.
— `Makefile:81` now passes `-D_POSIX_C_SOURCE=200809L`, which exposes `strdup`'s declaration in `<string.h>` so the return value is correctly typed as `char *`.
jojojames and others added 9 commits May 10, 2026 10:17
The flat s->cands pointer array (and its doubling realloc) was the last
remaining O(N) allocation in the async path.  At 33M candidates, growing
to 67M slots requires a single 537 MB malloc plus a 264 MB transient
memcpy.  Under macOS memory-compressor pressure the kernel cannot find
that much contiguous memory quickly, and every thread in the process
stalls — Hang dangduc#2 in the project history.

Replace the flat array with a two-level table:

  cands_top[]  : CANDS_TOP_CAP slots (4096) × 8 B = 32 KB inline,
                 zero-initialized at session start, never grown.
  cands_top[i] : CANDS_BLOCK_SIZE (256K) pointers × 8 B = 2 MB block,
                 allocated lazily by the reader on first write.

Index split: i = (hi << SHIFT) + lo
             hi = i >> 18     → which block
             lo = i & 0x3FFFF → which slot in that block
Both single-cycle CPU instructions because BLOCK_SIZE is a power of 2.

Largest single allocation the reader ever does: 2 MB regardless of pool
size.  macOS's compressor satisfies a 2 MB allocation in microseconds
even under heavy pressure; a 537 MB allocation can stall for seconds.
1 G candidate ceiling (4096 × 256K = 2^32) is well past the
practical limit of any realistic shell command.

Reader changes (async_reader):
- Compute (hi, lo) from s->count BEFORE arena_strdup.  The reader is the
  sole writer to s->count, so reading without s->mu is safe.
- Cap check first: if hi >= CANDS_TOP_CAP, drop the line entirely (don't
  even arena-allocate) and log verbosely with line preview — hitting 1 G
  candidates is so far outside expected behavior that it almost
  certainly indicates a broken upstream command (infinite loop, runaway
  find on a cyclic FS, etc.), and we want the cause obvious in the log.
- Pre-allocate the new 2 MB block OUTSIDE s->mu.  Doing the malloc under
  the lock would let a slow allocation stall the scoring thread's
  snapshot path — which is exactly the original Hang dangduc#2 problem.
- Take s->mu briefly to publish the block pointer (if newly allocated)
  and write the slot + increment count.

Scoring thread changes (scoring_thread_fn):
- Full-pool snapshot walks block-by-block, doing a flat memcpy within
  each block.  Boundary-crossing cost paid once per block (~250 times
  for a 60M pool — basically free) while inner loops match flat-array
  speed.
- Refine path's matched_idx and delta loops resolve via the shift+mask
  accessor: snap[w] = s->cands_top[gi >> SHIFT][gi & MASK].  Random
  access pays one extra L1 cache line load on first access to each
  block, negligible vs string-comparison cost.

Teardown (async_session_destroy):
- Walk cands_top[0..CAP-1] and free each non-NULL block.  Strings still
  freed in O(chunks) by arena_free.

Init (fzf_native_async_start):
- No initial allocation needed.  cands_top is zero-initialized by the
  calloc that allocates the AsyncSession itself.  ASYNC_INIT_CAP removed.
Two related freshness fixes for the prompt overlay during cache hits.
Without these, the displayed counts could lag behind the visible state
by several seconds — visible to the user as the prompt showing e.g.
[27815204](46291615) for ~3 seconds while they type past the prefix
into a longer query against a streaming pool that has grown to 55M.

A. last_total now tracks the *current* pool, not the pool size at
   scoring-publish time.  Without this, the TOTAL displayed in the
   prompt freezes at the last scored value, lagging behind the
   streaming counter visible elsewhere.  Cheap fix: the dispatch path
   already reads s->count under s->mu for the pool-size check; piggy-
   back a write of s->last_total under score_res_mu in the same call.

B. last_filtered on cache hits is now set from the cached entry's
   full match-set count (m_idx->count) — which describes the candidate
   set the user is actually looking at right now (the cached top-K we
   just returned).  Previously last_filtered held whatever the most
   recent scoring run published, which during prefix hits could be a
   completely different query's count.

   For OR-query entries (m_idx == NULL, can't be refinement sources)
   we fall back to top_count.  On cache miss we leave last_filtered
   alone — scoring will publish a fresh value shortly, and meanwhile
   the existing value is at least consistent with the score_results
   we're falling back to display.

The race with the scoring thread is benign: scoring publishes its
authoritative values before bumping gen, so any time-ordering of our
write vs scoring's write that ends with scoring last (the common
case) results in scoring's fresh values being shown.  The other
ordering shows our cache-derived values briefly until the next gen
bump triggers a re-display.
Currently the C module reads four knobs via symbol-value from three
different package namespaces:

  fussy-fzf-native-highlight  (fussy)
  fzf-async-highlight         (fzf-async)
  fzf-async-max-line-length   (fzf-async)
  fzf-async-cache-size        (fzf-async)

This leaks the layering the wrong direction — the lowest-level package
shouldn't have to know symbol names from two higher-level packages.

Move all four to the fzf-native namespace:

  fzf-native-batch-highlight       (sync path,  default 25)
  fzf-native-async-highlight       (async path, default 200)
  fzf-native-max-line-length       (async reader, default t)
  fzf-native-async-cache-size      (async start, default 40)

C reads now hit only fzf-native-* names.  Higher-level packages keep
their existing user-facing defcustoms and bridge the values onto these
canonical names — fussy via setq-local (synchronous, same-buffer call
pattern) and fzf-async via :around advice on the C entry points
(timer-driven, cross-buffer).  Those bridges land in the respective
package commits.

Naming convention:
- "batch-" prefix marks the synchronous score/score-all path.
- "async-" prefix marks the streaming async path.
- max-line-length has no prefix because it conceptually belongs to
  the line-stream itself; kept short.

The companion fzf-async / fussy bridge commits make this fully
backward-compatible — users continue to set fussy-fzf-native-highlight
or fzf-async-highlight as before, and the canonical name picks up the
package-specific value at call time.
The defcustom previously used `t' as a sentinel meaning "use the
built-in default of 512", with the actual integer hardcoded in C.
Two related cleanups:

1. Make the defcustom value the actual integer.  No `t' sentinel,
   no hardcoded fallback in C.  Type is `(choice nil integer)':
     nil           -> no limit
     positive N    -> exclude lines longer than N characters
     negative -N   -> include but truncate lines to N characters

2. Drop the `else if (env->eq(env, val, Qt)) s->max_line_length = 256'
   branch in `fzf_native_async_start'.  The C side now reads the
   integer directly via `extract_integer'; the default lives where
   it should — in fzf-native.el as the defcustom's :type/value, not
   as a magic fallback in the dynamic module.
@dangduc dangduc merged commit d40834c into dangduc:main May 11, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants