Skip to content

Benchmark: CPU default, trace logging, OPFS caching#315

Merged
sauravpanda merged 1 commit intomainfrom
bench-cpu-default-plus-traces
Apr 23, 2026
Merged

Benchmark: CPU default, trace logging, OPFS caching#315
sauravpanda merged 1 commit intomainfrom
bench-cpu-default-plus-traces

Conversation

@sauravpanda
Copy link
Copy Markdown
Owner

@sauravpanda sauravpanda commented Apr 23, 2026

Summary

Three benchmark-only changes so the Flare engine actually runs and is diagnosable. No functional changes to the library itself; paired with flarellm #500 which lands the Rust-side primitives that the follow-up work builds on.

1. CPU-only default (GPU opt-in via ?gpu=1)

The sync GPU readback path in wgpu (map_asyncdevice.poll(Wait)receiver.recv()) deadlocks on Chrome main-thread wasm:

  • map_async's callback is serviced by JS microtasks
  • microtasks don't run during a sync WASM call
  • device.poll(Wait) is a no-op on the wasm32 target

Every forward_prefill GPU dispatch hits this and never returns. Running CPU-only keeps the benchmark functional (67 tok/s decode, 148 ms TTFT on SmolLM2-135M Q8_0) while the async-readback refactor lands upstream. ?gpu=1 keeps the opt-in for future diagnostics.

2. [flare-trace] console checkpoints

Fine-grained console.log lines through loadFlareEngine + runFlareInference: fetch, OPFS probe, cache hit/miss, download progress, FlareEngine.load entry/return, begin_stream_with_params, per-N decode progress. These paid for themselves immediately: the deadlock above took hours to localise with DOM-polling (Runtime.evaluate blocks on the same frozen main thread); CDP's Runtime.consoleAPICalled stream delivers events even while the UI is frozen, so trace events pinpoint the stall site in real time.

3. OPFS caching + streaming download

The 138 MB Q8_0 GGUF streams straight to an Origin Private File System file as it downloads (keyed on the source URL). Subsequent runs skip the download entirely: load drops from ~4 s cold to ~0.4 s on cache hit. Falls back to a regular in-memory Uint8Array on browsers without OPFS.

Test plan

  • npx jest — 62 passing
  • npx prettier --check "src/**/*.ts" — clean
  • Manual: fresh tab → 67 tok/s decode, 148 ms TTFT, OPFS cache hit shortens load to 0.4 s on reruns
  • Manual: ?gpu=1 reproduces the dispatch_and_readback: recv() deadlock (expected until the async refactor ships)

Follow-up

Async-readback propagation in flare-web so ?gpu=1 actually accelerates inference. Tracked with the primitive landed in flarellm #500.

Summary by CodeRabbit

  • New Features

    • Added OPFS-backed model caching for improved performance on supported browsers
    • Introduced runtime GPU toggle via URL parameter
  • Improvements

    • Graceful fallback to in-memory streaming when caching unavailable
    • Enhanced diagnostic tracing for load and inference execution

Three changes needed to make the Flare benchmark runnable on
memory-pressured Chrome sessions and diagnosable when it isn't:

1. CPU-only default (opt-in GPU via ?gpu=1)

   wgpu's sync GPU readback path (map_async + device.poll(Wait) + recv)
   deadlocks on Chrome main-thread wasm — the map_async callback is
   serviced by JS microtasks that can't run during a sync WASM call and
   device.poll(Wait) is a no-op on wasm32.  Every GPU compute pass in
   forward_prefill and forward_single_token_gpu deadlocks as a result.

   Skip init_gpu by default so the benchmark actually runs (67 tok/s
   CPU SIMD on SmolLM2-135M).  Opt into the deadlocking GPU path with
   ?gpu=1 for diagnostics.  Real fix is the async-readback refactor
   tracked upstream.

2. Granular [flare-trace] console.log checkpoints

   Sprinkled through loadFlareEngine + runFlareInference: fetch, OPFS
   probe, cache hit/miss, download progress, FlareEngine.load entry,
   begin_stream_with_params, decode tokens.  Works alongside CDP's
   Runtime.consoleAPICalled event stream, which delivers events even
   when the main thread is frozen (Runtime.evaluate isn't).  Without
   this it took multiple hours to localise the prefill deadlock — the
   sync polling was hitting the same freeze as the UI.

3. OPFS caching + streaming download

   The 138 MB Q8_0 GGUF streams directly to an Origin Private File
   System file as it downloads (cache-name keyed on the source URL).
   Subsequent runs skip the download entirely — the example drops from
   ~4 s first load to ~0.4 s on cache hit.  Falls back to an in-memory
   Uint8Array on browsers without OPFS.

Jest: 62 passing.  Prettier: clean.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

📝 Walkthrough

Walkthrough

Adds comprehensive console tracing instrumentation for Flare engine initialization and inference execution. Implements OPFS-backed model caching with remote fallback for GGUF downloads, keyed by filename. Introduces a runtime GPU toggle parameter to conditionally enable GPU initialization.

Changes

Cohort / File(s) Summary
Flare Benchmark Instrumentation & Caching
examples/benchmark/index.html
Adds detailed console trace milestones for WASM/module fetching, import success/failure, OPFS probing, parsing, and FlareEngine lifecycle. Replaces direct GGUF download with OPFS-backed caching: checks for cached model file, loads from cache if present, streams remote GGUF into OPFS on miss, or falls back to in-memory streaming if OPFS unavailable. Introduces ?gpu=1 runtime toggle to conditionally skip GPU initialization.

Sequence Diagram

sequenceDiagram
    participant User
    participant Browser
    participant OPFS as Browser OPFS
    participant Remote as Remote Server
    participant FlareEngine
    participant GPU

    User->>Browser: Load benchmark (?gpu=1)
    Browser->>Browser: Parse GPU toggle
    
    Browser->>OPFS: Check cached model
    alt Cache Hit
        OPFS-->>Browser: Return cached GGUF
    else Cache Miss
        Browser->>Remote: Stream GGUF
        Remote-->>Browser: GGUF bytes
        Browser->>OPFS: Store in cache
        Browser-->>Browser: Cache stored
    end
    
    Browser->>FlareEngine: Load model with GGUF
    FlareEngine->>FlareEngine: Parse & initialize
    
    alt gpu=1 param set
        FlareEngine->>GPU: Initialize GPU
        GPU-->>FlareEngine: GPU ready
    else GPU disabled
        FlareEngine->>FlareEngine: Use CPU fallback
    end
    
    FlareEngine-->>Browser: Engine ready
    Browser->>Browser: Encode/Prefill/Decode
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #310: Modifies Flare CDN version and WebGPU worker configuration, directly related to the GPU initialization changes and WASM module loading.
  • #309: Modifies the same benchmark file's Flare model loading and GPU initialization flow, including GGUF handling and GPU init conditionals.
  • #307: Foundational changes to the same examples/benchmark/index.html file that this PR extends with instrumentation and OPFS caching.

Suggested labels

size/M


🐰 A cache here, a GPU toggle there,
OPFS whispers files with care,
Console traces paint the way,
Flare engines bloom and play!
Brrrr 🚀✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically addresses the three main changes: CPU default behavior, trace logging instrumentation, and OPFS caching implementation for improved reliability and diagnostics.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bench-cpu-default-plus-traces

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/benchmark/index.html (2)

638-664: Cache key and "valid size" heuristic are both fragile.

Two concerns in the cache lookup:

  1. cacheName is keyed on the filename only (config.url.split('/').pop()). Two different sources publishing GGUFs with the same basename (e.g. requantized variants of model.gguf) would silently collide. The PR description says "keyed by source URL" — consider matching that intent, e.g. hashing the full URL.
  2. The > 100 * 1024 * 1024 heuristic is a magic constant that:
    • excludes any future small/quantized models you may add (e.g. tiny <100 MB GGUFs would always re-download), and
    • happily accepts a 100+ MB partial/corrupt file as "cached" and feeds it straight to FlareEngine.load, which would surface as a confusing parse error rather than a cache miss.

A more robust check is to persist the expected size (e.g. from Content-Length) alongside the file (separate ${cacheName}.meta or a single JSON manifest) and compare exactly, falling back to re-download on mismatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 638 - 664, Replace the fragile
filename-only key and size heuristic: compute cacheName from the full source URL
(e.g., hash config.url instead of using config.url.split('/').pop()) so
different sources with the same basename don't collide; when downloading, save a
small metadata record (e.g., `${cacheName}.meta`) containing the expected size
(Content-Length) and any checksum, and when probing OPFS
(navigator.storage.getDirectory(), opfsDir, fileHandle, info.size) read that
metadata and compare the expected size (and/or checksum) exactly to info.size
before treating the file as a cache hit; remove the magic `> 100 * 1024 * 1024`
check, and if metadata is missing or mismatched, treat as cache miss and
re-download, then write both the file and its metadata and only set bytes = new
Uint8Array(await info.arrayBuffer()) after a successful metadata match.

694-697: Also remove the OPFS entry on download failure.

writable.abort() should discard pending writes per spec, but the directory entry created at line 668 ({ create: true }) remains. If the spec/implementation ever leaves a partial blob behind, or once you tighten the cache validation per the comment above, that orphan can poison the next run. A defensive opfsDir.removeEntry(cacheName).catch(() => {}) after abort makes the failure path self-cleaning.

♻️ Proposed fix
             } catch (err) {
               try { await writable.abort(); } catch {}
+              try { await opfsDir.removeEntry(cacheName); } catch {}
               throw err;
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 694 - 697, After catching the
download error and calling writable.abort(), also attempt to remove the OPFS
entry created for the cache to avoid leaving a partial/orphaned file;
specifically, after the existing writable.abort() call in the catch block, call
opfsDir.removeEntry(cacheName).catch(() => {}) so the failure path cleans up the
directory entry for cacheName without throwing. Ensure this runs in the inner
catch block that handles abort failures and uses the same opfsDir and cacheName
identifiers used when creating the entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/benchmark/index.html`:
- Around line 671-692: The download loop currently both writes each streamed
chunk to OPFS (using writable.write) and also accumulates them in the chunks
array, then builds a new Uint8Array(bytes) and copies chunks into it, causing
~3× memory peak; remove the in-memory buffering (do not push into chunks or
build the contiguous bytes from them), keep streaming to writable and progress
updates (resp, reader, writable, contentLength, updateProgress), then after
await writable.close() read the file back from OPFS the same way the cache-hit
path does to obtain a single contiguous Uint8Array (replacing the bytes/chunks
copy logic) and preserve the T(...) size log by using the read result.

---

Nitpick comments:
In `@examples/benchmark/index.html`:
- Around line 638-664: Replace the fragile filename-only key and size heuristic:
compute cacheName from the full source URL (e.g., hash config.url instead of
using config.url.split('/').pop()) so different sources with the same basename
don't collide; when downloading, save a small metadata record (e.g.,
`${cacheName}.meta`) containing the expected size (Content-Length) and any
checksum, and when probing OPFS (navigator.storage.getDirectory(), opfsDir,
fileHandle, info.size) read that metadata and compare the expected size (and/or
checksum) exactly to info.size before treating the file as a cache hit; remove
the magic `> 100 * 1024 * 1024` check, and if metadata is missing or mismatched,
treat as cache miss and re-download, then write both the file and its metadata
and only set bytes = new Uint8Array(await info.arrayBuffer()) after a successful
metadata match.
- Around line 694-697: After catching the download error and calling
writable.abort(), also attempt to remove the OPFS entry created for the cache to
avoid leaving a partial/orphaned file; specifically, after the existing
writable.abort() call in the catch block, call
opfsDir.removeEntry(cacheName).catch(() => {}) so the failure path cleans up the
directory entry for cacheName without throwing. Ensure this runs in the inner
catch block that handles abort failures and uses the same opfsDir and cacheName
identifiers used when creating the entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39ead71d-a82f-44fc-866c-60bfc78b2a53

📥 Commits

Reviewing files that changed from the base of the PR and between 85c9e7c and f4dc516.

📒 Files selected for processing (1)
  • examples/benchmark/index.html

Comment on lines +671 to +692
const resp = await fetch(config.url);
if (!resp.ok) throw new Error(`HTTP ${resp.status} fetching model`);
const contentLength = parseInt(resp.headers.get('content-length') || '0', 10);
const reader = resp.body.getReader();
const chunks = [];
let received = 0;
while (true) {
const { done, value } = await reader.read();
if (done) break;
await writable.write(value);
chunks.push(value);
received += value.length;
if (contentLength > 0) {
updateProgress(Math.round((received / contentLength) * 100));
}
}
await writable.close();
T(`downloaded ${(received / 1024 / 1024).toFixed(1)} MB to OPFS`);
bytes = new Uint8Array(received);
let off = 0;
for (const c of chunks) { bytes.set(c, off); off += c.length; }
chunks.length = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Streaming to OPFS while also buffering every chunk defeats the memory benefit.

The download loop both await writable.write(value) and chunks.push(value), then allocates a fresh new Uint8Array(received) and copies the chunks into it. Peak memory is roughly 3× the model size during the copy (chunks array + new contiguous buffer + OPFS internal buffering), instead of the ~1× the OPFS streaming was meant to enable.

For SmolLM2-135M Q8_0 (~138 MB) this is mostly a wart, but the same code path is used for llama-3.2-1b Q8_0 (~1.3 GB), where this is likely to OOM tabs that would otherwise succeed.

Mirror the cache-hit path: after writable.close(), just read the bytes back from OPFS.

♻️ Proposed fix
             const reader = resp.body.getReader();
-            const chunks = [];
             let received = 0;
             while (true) {
               const { done, value } = await reader.read();
               if (done) break;
               await writable.write(value);
-              chunks.push(value);
               received += value.length;
               if (contentLength > 0) {
                 updateProgress(Math.round((received / contentLength) * 100));
               }
             }
             await writable.close();
             T(`downloaded ${(received / 1024 / 1024).toFixed(1)} MB to OPFS`);
-            bytes = new Uint8Array(received);
-            let off = 0;
-            for (const c of chunks) { bytes.set(c, off); off += c.length; }
-            chunks.length = 0;
+            const cachedFile = await fileHandle.getFile();
+            bytes = new Uint8Array(await cachedFile.arrayBuffer());
             log(`Downloaded ${(received / 1024 / 1024).toFixed(1)} MB`, 'info');
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/benchmark/index.html` around lines 671 - 692, The download loop
currently both writes each streamed chunk to OPFS (using writable.write) and
also accumulates them in the chunks array, then builds a new Uint8Array(bytes)
and copies chunks into it, causing ~3× memory peak; remove the in-memory
buffering (do not push into chunks or build the contiguous bytes from them),
keep streaming to writable and progress updates (resp, reader, writable,
contentLength, updateProgress), then after await writable.close() read the file
back from OPFS the same way the cache-hit path does to obtain a single
contiguous Uint8Array (replacing the bytes/chunks copy logic) and preserve the
T(...) size log by using the read result.

@sauravpanda sauravpanda merged commit 4f73bfa into main Apr 23, 2026
10 checks passed
@sauravpanda sauravpanda deleted the bench-cpu-default-plus-traces branch April 23, 2026 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant