Benchmark: CPU default, trace logging, OPFS caching#315
Conversation
Three changes needed to make the Flare benchmark runnable on memory-pressured Chrome sessions and diagnosable when it isn't: 1. CPU-only default (opt-in GPU via ?gpu=1) wgpu's sync GPU readback path (map_async + device.poll(Wait) + recv) deadlocks on Chrome main-thread wasm — the map_async callback is serviced by JS microtasks that can't run during a sync WASM call and device.poll(Wait) is a no-op on wasm32. Every GPU compute pass in forward_prefill and forward_single_token_gpu deadlocks as a result. Skip init_gpu by default so the benchmark actually runs (67 tok/s CPU SIMD on SmolLM2-135M). Opt into the deadlocking GPU path with ?gpu=1 for diagnostics. Real fix is the async-readback refactor tracked upstream. 2. Granular [flare-trace] console.log checkpoints Sprinkled through loadFlareEngine + runFlareInference: fetch, OPFS probe, cache hit/miss, download progress, FlareEngine.load entry, begin_stream_with_params, decode tokens. Works alongside CDP's Runtime.consoleAPICalled event stream, which delivers events even when the main thread is frozen (Runtime.evaluate isn't). Without this it took multiple hours to localise the prefill deadlock — the sync polling was hitting the same freeze as the UI. 3. OPFS caching + streaming download The 138 MB Q8_0 GGUF streams directly to an Origin Private File System file as it downloads (cache-name keyed on the source URL). Subsequent runs skip the download entirely — the example drops from ~4 s first load to ~0.4 s on cache hit. Falls back to an in-memory Uint8Array on browsers without OPFS. Jest: 62 passing. Prettier: clean.
📝 WalkthroughWalkthroughAdds comprehensive console tracing instrumentation for Flare engine initialization and inference execution. Implements OPFS-backed model caching with remote fallback for GGUF downloads, keyed by filename. Introduces a runtime GPU toggle parameter to conditionally enable GPU initialization. Changes
Sequence DiagramsequenceDiagram
participant User
participant Browser
participant OPFS as Browser OPFS
participant Remote as Remote Server
participant FlareEngine
participant GPU
User->>Browser: Load benchmark (?gpu=1)
Browser->>Browser: Parse GPU toggle
Browser->>OPFS: Check cached model
alt Cache Hit
OPFS-->>Browser: Return cached GGUF
else Cache Miss
Browser->>Remote: Stream GGUF
Remote-->>Browser: GGUF bytes
Browser->>OPFS: Store in cache
Browser-->>Browser: Cache stored
end
Browser->>FlareEngine: Load model with GGUF
FlareEngine->>FlareEngine: Parse & initialize
alt gpu=1 param set
FlareEngine->>GPU: Initialize GPU
GPU-->>FlareEngine: GPU ready
else GPU disabled
FlareEngine->>FlareEngine: Use CPU fallback
end
FlareEngine-->>Browser: Engine ready
Browser->>Browser: Encode/Prefill/Decode
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/benchmark/index.html (2)
638-664: Cache key and "valid size" heuristic are both fragile.Two concerns in the cache lookup:
cacheNameis keyed on the filename only (config.url.split('/').pop()). Two different sources publishing GGUFs with the same basename (e.g. requantized variants ofmodel.gguf) would silently collide. The PR description says "keyed by source URL" — consider matching that intent, e.g. hashing the full URL.- The
> 100 * 1024 * 1024heuristic is a magic constant that:
- excludes any future small/quantized models you may add (e.g. tiny <100 MB GGUFs would always re-download), and
- happily accepts a 100+ MB partial/corrupt file as "cached" and feeds it straight to
FlareEngine.load, which would surface as a confusing parse error rather than a cache miss.A more robust check is to persist the expected size (e.g. from
Content-Length) alongside the file (separate${cacheName}.metaor a single JSON manifest) and compare exactly, falling back to re-download on mismatch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmark/index.html` around lines 638 - 664, Replace the fragile filename-only key and size heuristic: compute cacheName from the full source URL (e.g., hash config.url instead of using config.url.split('/').pop()) so different sources with the same basename don't collide; when downloading, save a small metadata record (e.g., `${cacheName}.meta`) containing the expected size (Content-Length) and any checksum, and when probing OPFS (navigator.storage.getDirectory(), opfsDir, fileHandle, info.size) read that metadata and compare the expected size (and/or checksum) exactly to info.size before treating the file as a cache hit; remove the magic `> 100 * 1024 * 1024` check, and if metadata is missing or mismatched, treat as cache miss and re-download, then write both the file and its metadata and only set bytes = new Uint8Array(await info.arrayBuffer()) after a successful metadata match.
694-697: Also remove the OPFS entry on download failure.
writable.abort()should discard pending writes per spec, but the directory entry created at line 668 ({ create: true }) remains. If the spec/implementation ever leaves a partial blob behind, or once you tighten the cache validation per the comment above, that orphan can poison the next run. A defensiveopfsDir.removeEntry(cacheName).catch(() => {})after abort makes the failure path self-cleaning.♻️ Proposed fix
} catch (err) { try { await writable.abort(); } catch {} + try { await opfsDir.removeEntry(cacheName); } catch {} throw err; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/benchmark/index.html` around lines 694 - 697, After catching the download error and calling writable.abort(), also attempt to remove the OPFS entry created for the cache to avoid leaving a partial/orphaned file; specifically, after the existing writable.abort() call in the catch block, call opfsDir.removeEntry(cacheName).catch(() => {}) so the failure path cleans up the directory entry for cacheName without throwing. Ensure this runs in the inner catch block that handles abort failures and uses the same opfsDir and cacheName identifiers used when creating the entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/benchmark/index.html`:
- Around line 671-692: The download loop currently both writes each streamed
chunk to OPFS (using writable.write) and also accumulates them in the chunks
array, then builds a new Uint8Array(bytes) and copies chunks into it, causing
~3× memory peak; remove the in-memory buffering (do not push into chunks or
build the contiguous bytes from them), keep streaming to writable and progress
updates (resp, reader, writable, contentLength, updateProgress), then after
await writable.close() read the file back from OPFS the same way the cache-hit
path does to obtain a single contiguous Uint8Array (replacing the bytes/chunks
copy logic) and preserve the T(...) size log by using the read result.
---
Nitpick comments:
In `@examples/benchmark/index.html`:
- Around line 638-664: Replace the fragile filename-only key and size heuristic:
compute cacheName from the full source URL (e.g., hash config.url instead of
using config.url.split('/').pop()) so different sources with the same basename
don't collide; when downloading, save a small metadata record (e.g.,
`${cacheName}.meta`) containing the expected size (Content-Length) and any
checksum, and when probing OPFS (navigator.storage.getDirectory(), opfsDir,
fileHandle, info.size) read that metadata and compare the expected size (and/or
checksum) exactly to info.size before treating the file as a cache hit; remove
the magic `> 100 * 1024 * 1024` check, and if metadata is missing or mismatched,
treat as cache miss and re-download, then write both the file and its metadata
and only set bytes = new Uint8Array(await info.arrayBuffer()) after a successful
metadata match.
- Around line 694-697: After catching the download error and calling
writable.abort(), also attempt to remove the OPFS entry created for the cache to
avoid leaving a partial/orphaned file; specifically, after the existing
writable.abort() call in the catch block, call
opfsDir.removeEntry(cacheName).catch(() => {}) so the failure path cleans up the
directory entry for cacheName without throwing. Ensure this runs in the inner
catch block that handles abort failures and uses the same opfsDir and cacheName
identifiers used when creating the entry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 39ead71d-a82f-44fc-866c-60bfc78b2a53
📒 Files selected for processing (1)
examples/benchmark/index.html
| const resp = await fetch(config.url); | ||
| if (!resp.ok) throw new Error(`HTTP ${resp.status} fetching model`); | ||
| const contentLength = parseInt(resp.headers.get('content-length') || '0', 10); | ||
| const reader = resp.body.getReader(); | ||
| const chunks = []; | ||
| let received = 0; | ||
| while (true) { | ||
| const { done, value } = await reader.read(); | ||
| if (done) break; | ||
| await writable.write(value); | ||
| chunks.push(value); | ||
| received += value.length; | ||
| if (contentLength > 0) { | ||
| updateProgress(Math.round((received / contentLength) * 100)); | ||
| } | ||
| } | ||
| await writable.close(); | ||
| T(`downloaded ${(received / 1024 / 1024).toFixed(1)} MB to OPFS`); | ||
| bytes = new Uint8Array(received); | ||
| let off = 0; | ||
| for (const c of chunks) { bytes.set(c, off); off += c.length; } | ||
| chunks.length = 0; |
There was a problem hiding this comment.
Streaming to OPFS while also buffering every chunk defeats the memory benefit.
The download loop both await writable.write(value) and chunks.push(value), then allocates a fresh new Uint8Array(received) and copies the chunks into it. Peak memory is roughly 3× the model size during the copy (chunks array + new contiguous buffer + OPFS internal buffering), instead of the ~1× the OPFS streaming was meant to enable.
For SmolLM2-135M Q8_0 (~138 MB) this is mostly a wart, but the same code path is used for llama-3.2-1b Q8_0 (~1.3 GB), where this is likely to OOM tabs that would otherwise succeed.
Mirror the cache-hit path: after writable.close(), just read the bytes back from OPFS.
♻️ Proposed fix
const reader = resp.body.getReader();
- const chunks = [];
let received = 0;
while (true) {
const { done, value } = await reader.read();
if (done) break;
await writable.write(value);
- chunks.push(value);
received += value.length;
if (contentLength > 0) {
updateProgress(Math.round((received / contentLength) * 100));
}
}
await writable.close();
T(`downloaded ${(received / 1024 / 1024).toFixed(1)} MB to OPFS`);
- bytes = new Uint8Array(received);
- let off = 0;
- for (const c of chunks) { bytes.set(c, off); off += c.length; }
- chunks.length = 0;
+ const cachedFile = await fileHandle.getFile();
+ bytes = new Uint8Array(await cachedFile.arrayBuffer());
log(`Downloaded ${(received / 1024 / 1024).toFixed(1)} MB`, 'info');🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/benchmark/index.html` around lines 671 - 692, The download loop
currently both writes each streamed chunk to OPFS (using writable.write) and
also accumulates them in the chunks array, then builds a new Uint8Array(bytes)
and copies chunks into it, causing ~3× memory peak; remove the in-memory
buffering (do not push into chunks or build the contiguous bytes from them),
keep streaming to writable and progress updates (resp, reader, writable,
contentLength, updateProgress), then after await writable.close() read the file
back from OPFS the same way the cache-hit path does to obtain a single
contiguous Uint8Array (replacing the bytes/chunks copy logic) and preserve the
T(...) size log by using the read result.
Summary
Three benchmark-only changes so the Flare engine actually runs and is diagnosable. No functional changes to the library itself; paired with flarellm #500 which lands the Rust-side primitives that the follow-up work builds on.
1. CPU-only default (GPU opt-in via
?gpu=1)The sync GPU readback path in
wgpu(map_async→device.poll(Wait)→receiver.recv()) deadlocks on Chrome main-thread wasm:map_async's callback is serviced by JS microtasksdevice.poll(Wait)is a no-op on thewasm32targetEvery
forward_prefillGPU dispatch hits this and never returns. Running CPU-only keeps the benchmark functional (67 tok/s decode, 148 ms TTFT on SmolLM2-135M Q8_0) while the async-readback refactor lands upstream.?gpu=1keeps the opt-in for future diagnostics.2.
[flare-trace]console checkpointsFine-grained
console.loglines throughloadFlareEngine+runFlareInference: fetch, OPFS probe, cache hit/miss, download progress,FlareEngine.loadentry/return,begin_stream_with_params, per-N decode progress. These paid for themselves immediately: the deadlock above took hours to localise with DOM-polling (Runtime.evaluate blocks on the same frozen main thread); CDP'sRuntime.consoleAPICalledstream delivers events even while the UI is frozen, so trace events pinpoint the stall site in real time.3. OPFS caching + streaming download
The 138 MB Q8_0 GGUF streams straight to an Origin Private File System file as it downloads (keyed on the source URL). Subsequent runs skip the download entirely: load drops from ~4 s cold to ~0.4 s on cache hit. Falls back to a regular in-memory
Uint8Arrayon browsers without OPFS.Test plan
npx jest— 62 passingnpx prettier --check "src/**/*.ts"— clean?gpu=1reproduces thedispatch_and_readback: recv()deadlock (expected until the async refactor ships)Follow-up
Async-readback propagation in flare-web so
?gpu=1actually accelerates inference. Tracked with the primitive landed in flarellm #500.Summary by CodeRabbit
New Features
Improvements