Skip to content

perf(vanilla-epoll): zero-alloc crud-write + fortunes render (follow-up to #888)#894

Open
enghitalo wants to merge 36 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-crud-fortunes-render
Open

perf(vanilla-epoll): zero-alloc crud-write + fortunes render (follow-up to #888)#894
enghitalo wants to merge 36 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-crud-fortunes-render

Conversation

@enghitalo

Copy link
Copy Markdown
Contributor

Follow-up to #888 (merged). The #888 re-bench showed crud (733MiB→1.8GiB) and fortunes (551→1100MiB) still growing run-over-run — a per-request render leak the ConnState pool didn't cover. callgrind pinned both:

  • fortunes: sort_with_compare compiles to v_stable_sort, which allocates an O(n) merge temp per request (14.6% of the fortunes Ir). Replaced with an in-place fortunes_insertion_sort (rows are ~dozens; O(n²) is cheaper than leaking a temp under -gc none). callgrind: v_stable_sort is now 0 in the dump.
  • crud create/update: json.decode(CrudCreate, body) allocates the decoded name/category strings + struct per write. Added parse_crud_body_fast, a zero-alloc reader that borrows name/category as []u8 slices into the request buffer and parses id/price/quantity in place; json.decode stays as the fallback for inputs the fast path declines (escaped strings, missing fields). callgrind: 600 POSTs → zero allocator-primitive calls, json.decode never entered.

Verification

  • parse_crud_body_fast matches json.decode on the create/update fixtures (parity), and handles whitespace, negative ints, and the key-as-substring-of-value case (the :-after-key check disambiguates). Escaped/missing-field bodies correctly fall back to json.decode.
  • All index reads are guarded by < buf.len (bounds-safe despite @[direct_array_access]; no push_many/pointer arithmetic). Insertion sort is in-place + stable (no temp).

Note

parse_crud_body_fast is a heuristic key scan tuned for the well-formed crud bodies this endpoint receives; pathological JSON with a key: pattern inside a string value could mis-parse rather than fall back — acceptable for the fixed benchmark request shapes, with json.decode as the correctness path otherwise.

🤖 Generated with Claude Code

enghitalo and others added 30 commits June 15, 2026 09:20
json-comp recompressed the gzip body on EVERY request even though the output
for a given (count, m) is fully deterministic — and gzip CPU, not allocation,
dominates that profile. Cache the COMPLETE gzipped response per (count, m) and
append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only
a handful of (count, m) pairs, so the cache stays tiny.

Also route on the path WITHOUT allocating: a tos() view into the request buffer
instead of all_before('?')'s per-request string copy (one alloc per request on
the hot path), shaving GC churn off baseline/json too.

Local before/after (16-core loopback, gcannon, single listener):
  json-comp  58K -> 390K req/s  (+570%, 6.7x)

Correctness verified: gzip body decodes to the right items/count/total; the
cached response is byte-identical across requests; all other routes unchanged.
Applies to both the epoll and io_uring variants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the remaining small per-request allocations on the hot path:
  • qint/qstr took a `string` key and called `key.bytes()` every request (one
    []u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…).
    Keys are now precomputed `const []u8` (qk_*), built once at init.
  • /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a
    substring copy. parse_u_at() reads the digits straight from the path view.

Local before/after (16-core loopback) is within noise (baseline ~528K→530K,
json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866
removed — but allocation scaled hard on the 64-core arena (json +322% there),
so this trims more GC churn for that environment at zero cost. Note: @[manualfree]
is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree
only affects -autofree), so reducing allocations is the lever, not manualfree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scape

Folds the DB-path work into this PR so everything lands together:
  • async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via
    db.pg, lazily prepared per pooled connection) instead of exec_param_many's
    per-request server-side SQL re-parse — local +9%.
  • escape_html (fortunes) does ONE pass with a no-alloc fast path instead of
    replace_each's five full-string passes — local +27% fortunes.

DB profiles remain bound by the stdlib db.pg driver (text protocol), so this
narrows the gap without closing it. Both backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V
(vlang/v#27468) while bulk push_many, allocation and indexed writes are
unaffected. The two hot single-element `<<` sites are now bulk writes:

  - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's
    back-to-front into the [20]u8 scratch and flushes with one push_many.
  - write_json_response() pushed the item separator `,` and closing `}`
    one byte at a time; the closing `}` is now fused with the separator
    into a single '},' / '}' push_many.

Output is byte-identical (verified across counts 0..4096 and edge-value
integers). This makes the JSON hot path fast on both the 0.5.1 release and
current master, independent of the upstream codegen regression. Both
epoll and io_uring backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Build V from source at the 0.5.1 tag instead of the prebuilt release zip.
Plain `make` can't build an old tag: its latest_vc step `git pull`s the
newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with
`unknown ident \`native\``). So pin vc to the commit cut for 0.5.1
(vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own
bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps.

Pinned by tag, not a master commit, because post-0.5.1 master carries a
codegen regression (single-element array push 4-7x slower, vlang/v#27468).
Both backends; verified the source-built compiler serves /json and /pipeline
correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The static handler copied each asset's full prebuilt response (up to ~300 KB)
into the per-connection write_buf every request — a userspace copy plus a large
*scanned* write_buf that grows the GC's stop-the-world cost at high conn counts
(why vanilla sat ~4x behind nginx/swerver on the static profile).

Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's
life) and a precomputed response head; serve the head into write_buf and stream
the body zero-copy via core.queue_file (sendfile(2), already wired through the
epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the
body is never copied, and the kernel pushes file pages straight to the socket —
the same model nginx and swerver use.

Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s
(2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bodies)

The lib now streams (drains) request bodies larger than 1 MiB instead of
buffering them, so for a large upload req.body is empty — but the byte count the
upload profile wants is the declared Content-Length. Answer by
req.content_length() (falls back to the buffered body length when absent, which
also covers small bodies that still take the buffered path).

Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine
drain); the Dockerfile clones lib main, so that PR must merge before this builds.
Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303
req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB
buffering). epoll only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed
…_many

Replace the remaining `out << <[]u8>` appends (static header, error consts, the
four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper
that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the
gz-cache key is unrelated and kept as is.

Note: V already lowers `array << array` to array_push_many, so this is codegen-
neutral — a consistency / regression-safety change (the whole write path now
takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays
the way the single-element path did, vlang/v#27468). The hot single-element `<<`
was already armored by ws/wi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lo/vanilla#32)

Convert the framework from the blocking db.pg ConnectionPool to vanilla's native
async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB
endpoints now PARK on the PG socket (ac.watch) and resume in a continuation
instead of blocking a worker thread per query — closing the async-db gap (MDA2AV#32).

- ServerConfig: request_handler → async_handler + make_state. Each worker owns a
  per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own
  cache-aside and json-comp caches; the dataset/prefixes/static assets stay
  shared read-only.
- async-db, fortunes, crud (list/get/create/update) issue a query, park, and
  render in a single resume continuation that switches on a small per-request
  stash. crud_list folds page+total into ONE window-count query (count(*) OVER())
  instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache).
- DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW
  from its binary form — no json.encode reflection, no decode/re-encode.
- Drops the db.pg dependency entirely, so the framework also builds on master V
  (master removed pg.ConnectionPool); the non-DB hot paths are unchanged.

Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every
endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud
list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50;
/json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across
workers under SO_REUSEPORT — by design, vs the old shared+mutex cache).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…1 build fix)

The previous run failed because the framework's Docker cloned vanilla main BEFORE
the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1
tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla
PR MDA2AV#40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial),
verified to compile under `v -prod .` on the true 0.5.1 tag for both
vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it
re-clones the fixed vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ixes

The Docker `RUN git clone … vanilla` layer was cached indefinitely on the
self-hosted runner, so re-runs kept building against a STALE vanilla checkout —
which is why MDA2AV#877 stayed red even after the build fix (vanilla MDA2AV#40) merged: the
build never re-cloned to get it.

Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both
vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main
moves, invalidating this layer's cache and forcing a fresh clone. Adding the
step also re-clones on this build (new layer structure), so it now picks up MDA2AV#40.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check
does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With
SO_REUSEPORT the two requests land on different workers, so a per-worker cache
returns MISS both times → validation fails.

Move the cache-aside (and the json-comp gzip cache) into the process-shared
`Shared` (renamed from SharedRO), guarded by RwMutexes since workers are
separate threads — restoring the original shared-cache semantics. The async
Postgres pool stays per-worker (make_state); only the caches are shared.

Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now
returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986,
page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under
`v -prod .` on the true V 0.5.1 tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…st CPUs

Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker
Postgres pool size against core.max_thread_pool_size (usable cores) instead of
runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so
per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of
64/128=1 — matching the async path's threads≈cores model.

Experiment for MDA2AV#32: test whether removing the 128-on-N-cores oversubscription
recovers the async-db / api-16 regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… pg_async conversion)

Three clean, post-cache-bust benchmarks agree the native async pg_async path is a
net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync
showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2×
(io_uring even handicapped by the cpuset change). The async path is bound by DB
concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for
sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was
crud, which is cache-bound (skips the DB) — preserved by the sync framework too.

Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool
+ request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile,
upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache).
pg_async stays in the vanilla library as a capability for the case it actually wins
(latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays.

Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed
plaintext "ok". Match it on `target` immediately after the path is sliced and
blit a precomputed full-response constant + return — before the '?'-scan, the
route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw
is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant
route=='/pipeline' arm is dropped from the dispatch chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…content)

callgrind on the render path showed ~26% of its instructions were the zero-fill
of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re-
zeroed every request as the GC reuses it. Size it from the actual rows instead
(160 + 96/row + message bytes); an outlier grows once. The other builders here
were already content-sized.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t-state runtime

Restore the native-pg_async async DB path (per-worker PgPool via make_state;
async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready
continuation that pumps the result, renders by kind, releases the conn). This is
the proven MDA2AV#32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena
on the old map-based reactor + per-request malloc), re-applied now that PR MDA2AV#41
replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc)
— the overhead that sank it is gone.

Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency
by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11
cores idle, waiting on PG). Park/resume frees the worker to keep many queries in
flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the
EXPERIMENT to confirm MDA2AV#41 turns the old regression into a win; needs an arena run.

Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline
short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ache)

validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`:
restoring the MDA2AV#32 async conversion brought back per-worker caches, but SO_REUSEPORT
routes the two probe GETs to different workers, so each MISSes its cold cache.
Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO,
mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool
stays per-worker (no lock). Builds clean; this unblocks the async benchmark.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…code_into

The MDA2AV#884 async-db regression was NOT the sub-ms-PG crossover — it was pool
starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256
across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 →
only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load
(~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then
SHEDS the overflow as an empty 200, so the closed-loop clients spin and real
throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud
was unaffected (+208%) only because it is cache-HIT served and never touches the
pool.

Fixes:
1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight),
   sized to Postgres max_connections.
2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) —
   the same no-boxing entry the sync build now uses; recovers the json-comp/json
   non-DB delta that the async dispatch path was paying.

Builds clean (pg_async, post-0.5.1 local V; vanilla MDA2AV#44 with decode_into is now
in main). Follow-ups (not here): queue on pool-full instead of shedding empty
200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ment 3)

Wire the framework onto vanilla's cross-request pipelining. park() now picks the
least-loaded connection via acquire_pipelined() (shed only when all conns are at
the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter
starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then
park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's
shed-on-full bool is now checked. on_db_ready drops the exclusive release: a
pipelined connection is not held exclusively, its in-flight slot frees when
async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the
connection's parked requests front-first so the FIFO reply aligns with each
request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain
conns×N concurrency. Builds against the local pipelining vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…re merge

TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's
feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the
cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to
refs/heads/main once MDA2AV#45 lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this
branch lost the `has_pipeline_prefix` fast path the MDA2AV#877 branch had: handle() ran
decode_into + parse_http1_request_line on EVERY request, so the highest-RPS
/pipeline test paid the full HTTP parse (~55% of the per-request CPU; the
in-handle parse alone ~17%) for a fixed response.

Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE
any parsing. The request is already framed by the reactor, so decode adds nothing
here. After: the per-request /pipeline profile has ZERO parse functions, and local
throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining
gap + the boehm-vs-gc-none delta is per-request allocation/GC (next).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merged)

vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free
recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining
pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The
async-db +329% / pipelined +1622% results were against this exact code (main now
== the merged branch), so no re-bench needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Park DB queries on the pooled connection via ac.watch_persistent so a
client disconnecting mid-query no longer closes (and forces a reconnect +
re-auth on) the pooled connection: the runtime drains the orphaned reply
in order and keeps the conn open for reuse. Both the initial park and the
not-ready re-arm use it (the single-watch path resets the slot, so the
re-arm must re-stamp the pool-owned flag).

Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it
merges + vanilla main is updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
render_async_db / render_fortunes / render_crud_list each allocated a fresh
response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships
`-gc none`, so those buffers are never freed — a multi-GiB leak under DB load
(async-db measured ~12 KiB/request total, tens of GiB on the arena).

Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset
to len 0 each response; it grows to a high-water mark then stays. Safe because a
worker serves one request at a time (no concurrency). Paired with the pg_async
per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263
bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead
flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db
(48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap.

(render_fortunes still allocates its Fortune vector + per-row message copies;
that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the remaining per-request heap allocations on the DB routes (which leak
under the binary's `-gc none` build). With the pg_async/request_parser Stage B
(reused submit scratch, no .bytes() in the wire builders, no-alloc query parse),
async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B,
-98.7%); the Boehm build is dead flat.

- Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in
  every start_* with reused per-worker buffers — param_scratch (int params as
  decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The
  borrowed slices are copied by write_bind synchronously inside park, so they never
  outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never
  reallocates mid-request (which would dangle already-pushed slices).
- Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a
  borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts
  (no per-request `.bytes()`).
- Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request;
  returned only on the terminal .done path — never on the not-ready re-arm, where
  it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form
  for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485).
- /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()),
  an explicit byte comparator for the sort, and escape_html_into that escapes
  directly into the render scratch (no Builder/string per row).

Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the
<script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT,
baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B
(enghitalo/vanilla); land after that merges.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows
for i64::MIN — its magnitude isn't representable as i64 — leaving x negative
so the digit loop never ran and only '-' was emitted. Build the magnitude in
u64 instead (-(n+1)+1, with the +1 done in u64).

Unreachable from current routes (ids are 32-bit, query ints are clamped) so
this is not a live bug, but it's a latent correctness hole in a general
integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
enghitalo and others added 5 commits June 18, 2026 14:49
Both built their body with `n.str()` — an int->string heap allocation on every
request that leaks under -gc none. callgrind on /baseline11 showed 1.002
allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB
RSS in the arena. Format the int into the reused per-worker scratch via a new
emit_int() helper instead (the same render-scratch pattern the DB paths use).

The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean
per request; this closes the general/plaintext response path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er load

Under DB-pipeline saturation, park() returns the caller's fallback. The read
paths shed to a benign empty 200, but crud create/update/get shed to
bad_request (400) / not_found (404) — misreporting a backpressure shed as a
client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected
status" on crud, while every other framework AND the previously-recorded
vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed —
confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn
and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation
with DATABASE_MAX_CONN=1).

Only the shed fallback for the three crud write/get paths becomes 503 Service
Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body)
and 404 (missing item) are unchanged.

Reducing the shed itself (pool / max_inflight capacity, which trades memory) and
a holistic shed policy (read paths also shed to an empty 200) are deferred to a
measured investigation tracked upstream in vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (reused scratch)

/baseline11's chunked-POST path parsed the body with `dechunk(s string) string`,
which built a strings.Builder + .str() (and strconv_hex's trim_space) per request
— a permanent leak under -gc none. In the arena baseline mix (1/3 of the
templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the MDA2AV#888
re-bench still showed after emit_int closed the GET path.

Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker
scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into),
append data bytes via push_many, parse the integer in place (parse_i64_slice).
No allocation in steady state. Verified byte-identical to the old dechunk on
valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions)
and safe on malformed input.

Also hardens a latent OOB read / DoS: the chunk-size range check is now
overflow-safe — `size > end - data_start` instead of `data_start + size > end`,
which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard
and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable
worker segfault). The old dechunk hit the same input as a bounds-checked panic;
this makes it a controlled, bounded reject. parse_hex_slice now accumulates in
i64 and saturates so the size value itself can't wrap. A 5-lens adversarial
review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity)
returned GO after this fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-epoll

@github-actions

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

@github-actions

Copy link
Copy Markdown
Contributor

Benchmark Results

Framework: vanilla-epoll | Test: all tests

Test Conn RPS CPU Mem Δ RPS Δ Mem
baseline 512 3,790,014 6486.8% 63MiB +0.2% -3.1%
baseline 4096 4,182,904 6451.8% 149MiB +0.2% +5.7%
pipelined 512 39,570,630 6710.9% 63MiB -1.1% +1.6%
pipelined 4096 39,831,476 6728.6% 152MiB +0.7% +0.7%
limited-conn 512 1,009,157 3164.3% 58MiB +0.8% +1.8%
limited-conn 4096 1,017,001 2975.9% 61MiB ~0% -35.1%
json 4096 2,185,881 6417.6% 183MiB -0.2% +25.3%
json-comp 512 2,154,743 6038.2% 64MiB -0.9% -3.0%
json-comp 4096 2,354,617 6029.0% 119MiB -1.6% +3.5%
json-comp 16384 2,342,381 5937.1% 209MiB +0.6% +19.4%
upload 32 20,783 1131.3% 121MiB -0.3% +0.8%
upload 256 24,371 3752.4% 383MiB ~0% -2.0%
api-4 256 69,902 358.5% 128MiB +5.8% -15.8%
api-16 1024 237,695 1323.2% 267MiB +1.9% -30.8%
static 1024 1,092,895 5565.5% 258MiB -1.2% -0.4%
static 4096 1,139,124 5643.8% 424MiB -0.2% +1.2%
static 6800 1,182,094 5776.8% 680MiB +0.9% +25.9%
async-db 1024 292,357 3080.0% 707MiB +12.5% +24.0%
crud 4096 263,090 737.4% 703MiB +1.6% -4.1%
fortunes 1024 142,091 6483.1% 692MiB -23.0% -13.3%
Full log
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   13.84ms   7.30ms   38.30ms   74.80ms   106.80ms

  4009173 requests in 15.00s, 4005259 responses
  Throughput: 266.97K req/s
  Bandwidth:  80.92MB/s
  Status codes: 2xx=3946353, 3xx=0, 4xx=0, 5xx=58907
  Latency samples: 4005247 / 4005259 responses (100.0%)
  Reconnects: 18331
  Per-template: 90832,109555,138435,169886,200221,228009,250955,262987,270830,273716,277179,278195,275107,276353,274902,272632,118993,65423,81919,89118
  Per-template-ok: 88392,108006,136363,167587,197328,224883,247351,259423,266749,269737,273451,274260,271338,272449,270768,268813,118993,63792,79804,86853

  WARNING: 58906/4005259 responses (1.5%) had unexpected status (expected 2xx)
[info] CPU 737.4% | Mem 703MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   14.95ms   8.89ms   39.60ms   72.50ms   102.90ms

  3781142 requests in 15.00s, 3780886 responses
  Throughput: 252.01K req/s
  Bandwidth:  77.96MB/s
  Status codes: 2xx=3723875, 3xx=0, 4xx=0, 5xx=57011
  Latency samples: 3780881 / 3780886 responses (100.0%)
  Reconnects: 17147
  Per-template: 90974,108003,134334,163233,190039,213001,232257,244970,248564,248658,251705,253515,252948,254639,254385,251850,134412,78067,84991,90336
  Per-template-ok: 88537,106513,132450,160978,187430,209822,228645,241452,244748,245011,247912,249729,249139,250733,250587,248047,134412,76193,83212,88320

  WARNING: 57011/3780886 responses (1.5%) had unexpected status (expected 2xx)
[info] CPU 702.6% | Mem 1.1GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   15.21ms   10.80ms   38.50ms   68.20ms   94.40ms

  3724778 requests in 15.00s, 3724778 responses
  Throughput: 248.27K req/s
  Bandwidth:  77.03MB/s
  Status codes: 2xx=3678694, 3xx=0, 4xx=0, 5xx=46084
  Latency samples: 3724776 / 3724778 responses (100.0%)
  Reconnects: 16886
  Per-template: 93681,111592,135084,164360,187610,210079,229957,240915,243400,242183,241567,244116,244765,245477,243610,244924,135880,84657,88386,92533
  Per-template-ok: 91775,110336,133399,162460,185386,207545,227231,237681,240457,239152,238649,241172,241793,242480,240487,241822,135880,83150,86841,90996

  WARNING: 46084/3724778 responses (1.2%) had unexpected status (expected 2xx)
[info] CPU 699.2% | Mem 1.7GiB

=== Best: 263090 req/s (CPU: 737.4%, Mem: 703MiB) ===
[info] input BW: 22.58MB/s (avg template: 90 bytes)
[info] saved results/crud/4096/vanilla-epoll.json
httparena-bench-vanilla-epoll
httparena-bench-vanilla-epoll

==============================================
=== vanilla-epoll / fortunes / 1024c (tool=gcannon) ===
==============================================
[info] resetting postgres for a clean per-profile baseline
[info] starting postgres sidecar
httparena-postgres
[info] postgres ready (seeded)
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.96ms   4.58ms   9.20ms   41.70ms   99.70ms

  698252 requests in 5.00s, 698252 responses
  Throughput: 139.59K req/s
  Bandwidth:  3.23GB/s
  Status codes: 2xx=698252, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 698248 / 698252 responses (100.0%)
[info] CPU 6179.1% | Mem 493MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.82ms   4.48ms   9.35ms   37.10ms   82.30ms

  710457 requests in 5.00s, 710457 responses
  Throughput: 142.04K req/s
  Bandwidth:  3.29GB/s
  Status codes: 2xx=710457, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 710457 / 710457 responses (100.0%)
[info] CPU 6483.1% | Mem 692MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.77ms   4.46ms   9.40ms   36.90ms   84.00ms

  707173 requests in 5.00s, 707173 responses
  Throughput: 141.38K req/s
  Bandwidth:  3.28GB/s
  Status codes: 2xx=707173, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 707168 / 707173 responses (100.0%)
[info] CPU 6406.2% | Mem 993MiB

=== Best: 142091 req/s (CPU: 6483.1%, Mem: 692MiB) ===
[info] saved results/fortunes/1024/vanilla-epoll.json
httparena-bench-vanilla-epoll
httparena-bench-vanilla-epoll
[info] skip: vanilla-epoll does not subscribe to baseline-h2
[info] skip: vanilla-epoll does not subscribe to static-h2
[info] skip: vanilla-epoll does not subscribe to baseline-h2c
[info] skip: vanilla-epoll does not subscribe to json-h2c
[info] skip: vanilla-epoll does not subscribe to baseline-h3
[info] skip: vanilla-epoll does not subscribe to static-h3
[info] skip: vanilla-epoll does not subscribe to gateway-64
[info] skip: vanilla-epoll does not subscribe to gateway-h3
[info] skip: vanilla-epoll does not subscribe to production-stack
[info] skip: vanilla-epoll does not subscribe to unary-grpc
[info] skip: vanilla-epoll does not subscribe to unary-grpc-tls
[info] skip: vanilla-epoll does not subscribe to stream-grpc
[info] skip: vanilla-epoll does not subscribe to stream-grpc-tls
[info] skip: vanilla-epoll does not subscribe to echo-ws
[info] skip: vanilla-epoll does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-16-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-4-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/async-db-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/crud-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/fortunes-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-16384.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-6800.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-32.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
httparena-postgres
httparena-redis
[info] restoring loopback MTU to 65536

…y_fast)

crud create/update parsed the JSON body with json.decode(CrudCreate, body), which
allocates the decoded name/category strings + struct per write — a permanent leak
under -gc none. Add parse_crud_body_fast: a zero-alloc reader that BORROWS
name/category as []u8 slices into the request buffer and parses id/price/quantity
in place; json.decode stays as the fallback for inputs the fast path declines
(escaped strings, missing fields). callgrind: 600 POSTs → zero allocator-primitive
calls, json.decode never entered. Parity-tested vs json.decode on the create/update
fixtures incl. whitespace, negatives, and key-as-substring-of-value; bounds-safe.

(The fortunes sort experiment from the first cut of this PR was reverted: on the
20-row fortune set the v_stable_sort temp is negligible and not the dominant
arena leak — see below — so the standard sort_with_compare is kept for simplicity.
The real crud/fortunes/async-db arena growth is a separate, scale-dependent DB-path
leak under investigation.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@enghitalo enghitalo force-pushed the perf/vanilla-crud-fortunes-render branch from d1c86fd to e980f86 Compare June 20, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant