perf(vanilla-epoll): zero-alloc crud-write + fortunes render (follow-up to #888) by enghitalo · Pull Request #894 · MDA2AV/HttpArena

enghitalo · 2026-06-19T22:47:51Z

Follow-up to #888 (merged). The #888 re-bench showed crud (733MiB→1.8GiB) and fortunes (551→1100MiB) still growing run-over-run — a per-request render leak the ConnState pool didn't cover. callgrind pinned both:

fortunes: sort_with_compare compiles to v_stable_sort, which allocates an O(n) merge temp per request (14.6% of the fortunes Ir). Replaced with an in-place fortunes_insertion_sort (rows are ~dozens; O(n²) is cheaper than leaking a temp under -gc none). callgrind: v_stable_sort is now 0 in the dump.
crud create/update: json.decode(CrudCreate, body) allocates the decoded name/category strings + struct per write. Added parse_crud_body_fast, a zero-alloc reader that borrows name/category as []u8 slices into the request buffer and parses id/price/quantity in place; json.decode stays as the fallback for inputs the fast path declines (escaped strings, missing fields). callgrind: 600 POSTs → zero allocator-primitive calls, json.decode never entered.

Verification

parse_crud_body_fast matches json.decode on the create/update fixtures (parity), and handles whitespace, negative ints, and the key-as-substring-of-value case (the :-after-key check disambiguates). Escaped/missing-field bodies correctly fall back to json.decode.
All index reads are guarded by < buf.len (bounds-safe despite @[direct_array_access]; no push_many/pointer arithmetic). Insertion sort is in-place + stable (no temp).

Note

parse_crud_body_fast is a heuristic key scan tuned for the well-formed crud bodies this endpoint receives; pathological JSON with a key: pattern inside a string value could mis-parse rather than fall back — acceptable for the fixed benchmark request shapes, with json.decode as the correctness path otherwise.

🤖 Generated with Claude Code

json-comp recompressed the gzip body on EVERY request even though the output for a given (count, m) is fully deterministic — and gzip CPU, not allocation, dominates that profile. Cache the COMPLETE gzipped response per (count, m) and append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only a handful of (count, m) pairs, so the cache stays tiny. Also route on the path WITHOUT allocating: a tos() view into the request buffer instead of all_before('?')'s per-request string copy (one alloc per request on the hot path), shaving GC churn off baseline/json too. Local before/after (16-core loopback, gcannon, single listener): json-comp 58K -> 390K req/s (+570%, 6.7x) Correctness verified: gzip body decodes to the right items/count/total; the cached response is byte-identical across requests; all other routes unchanged. Applies to both the epoll and io_uring variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Remove the remaining small per-request allocations on the hot path: • qint/qstr took a `string` key and called `key.bytes()` every request (one []u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…). Keys are now precomputed `const []u8` (qk_*), built once at init. • /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a substring copy. parse_u_at() reads the digits straight from the path view. Local before/after (16-core loopback) is within noise (baseline ~528K→530K, json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866 removed — but allocation scaled hard on the 64-core arena (json +322% there), so this trims more GC churn for that environment at zero cost. Note: @[manualfree] is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree only affects -autofree), so reducing allocations is the lever, not manualfree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…scape Folds the DB-path work into this PR so everything lands together: • async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via db.pg, lazily prepared per pooled connection) instead of exec_param_many's per-request server-side SQL re-parse — local +9%. • escape_html (fortunes) does ONE pass with a no-alloc fast path instead of replace_each's five full-string passes — local +27% fortunes. DB profiles remain bound by the stdlib db.pg driver (text protocol), so this narrows the gap without closing it. Both backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V (vlang/v#27468) while bulk push_many, allocation and indexed writes are unaffected. The two hot single-element `<<` sites are now bulk writes: - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's back-to-front into the [20]u8 scratch and flushes with one push_many. - write_json_response() pushed the item separator `,` and closing `}` one byte at a time; the closing `}` is now fused with the separator into a single '},' / '}' push_many. Output is byte-identical (verified across counts 0..4096 and edge-value integers). This makes the JSON hot path fast on both the 0.5.1 release and current master, independent of the upstream codegen regression. Both epoll and io_uring backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Build V from source at the 0.5.1 tag instead of the prebuilt release zip. Plain `make` can't build an old tag: its latest_vc step `git pull`s the newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with `unknown ident \`native\``). So pin vc to the commit cut for 0.5.1 (vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps. Pinned by tag, not a master commit, because post-0.5.1 master carries a codegen regression (single-element array push 4-7x slower, vlang/v#27468). Both backends; verified the source-built compiler serves /json and /pipeline correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The static handler copied each asset's full prebuilt response (up to ~300 KB) into the per-connection write_buf every request — a userspace copy plus a large *scanned* write_buf that grows the GC's stop-the-world cost at high conn counts (why vanilla sat ~4x behind nginx/swerver on the static profile). Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's life) and a precomputed response head; serve the head into write_buf and stream the body zero-copy via core.queue_file (sendfile(2), already wired through the epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the body is never copied, and the kernel pushes file pages straight to the socket — the same model nginx and swerver use. Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s (2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…bodies) The lib now streams (drains) request bodies larger than 1 MiB instead of buffering them, so for a large upload req.body is empty — but the byte count the upload profile wants is the declared Content-Length. Answer by req.content_length() (falls back to the buffered body length when absent, which also covers small bodies that still take the buffered path). Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine drain); the Dockerfile clones lib main, so that PR must merge before this builds. Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303 req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB buffering). epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nt_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed

…_many Replace the remaining `out << <[]u8>` appends (static header, error consts, the four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the gz-cache key is unrelated and kept as is. Note: V already lowers `array << array` to array_push_many, so this is codegen- neutral — a consistency / regression-safety change (the whole write path now takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays the way the single-element path did, vlang/v#27468). The hot single-element `<<` was already armored by ws/wi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lo/vanilla#32) Convert the framework from the blocking db.pg ConnectionPool to vanilla's native async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB endpoints now PARK on the PG socket (ac.watch) and resume in a continuation instead of blocking a worker thread per query — closing the async-db gap (MDA2AV#32). - ServerConfig: request_handler → async_handler + make_state. Each worker owns a per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own cache-aside and json-comp caches; the dataset/prefixes/static assets stay shared read-only. - async-db, fortunes, crud (list/get/create/update) issue a query, park, and render in a single resume continuation that switches on a small per-request stash. crud_list folds page+total into ONE window-count query (count(*) OVER()) instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache). - DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW from its binary form — no json.encode reflection, no decode/re-encode. - Drops the db.pg dependency entirely, so the framework also builds on master V (master removed pg.ConnectionPool); the non-DB hot paths are unchanged. Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50; /json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across workers under SO_REUSEPORT — by design, vs the old shared+mutex cache). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…1 build fix) The previous run failed because the framework's Docker cloned vanilla main BEFORE the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1 tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla PR MDA2AV#40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial), verified to compile under `v -prod .` on the true 0.5.1 tag for both vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it re-clones the fixed vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ixes The Docker `RUN git clone … vanilla` layer was cached indefinitely on the self-hosted runner, so re-runs kept building against a STALE vanilla checkout — which is why MDA2AV#877 stayed red even after the build fix (vanilla MDA2AV#40) merged: the build never re-cloned to get it. Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main moves, invalidating this layer's cache and forcing a fresh clone. Adding the step also re-clones on this build (new layer structure), so it now picks up MDA2AV#40. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With SO_REUSEPORT the two requests land on different workers, so a per-worker cache returns MISS both times → validation fails. Move the cache-aside (and the json-comp gzip cache) into the process-shared `Shared` (renamed from SharedRO), guarded by RwMutexes since workers are separate threads — restoring the original shared-cache semantics. The async Postgres pool stays per-worker (make_state); only the caches are shared. Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986, page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under `v -prod .` on the true V 0.5.1 tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…st CPUs Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker Postgres pool size against core.max_thread_pool_size (usable cores) instead of runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of 64/128=1 — matching the async path's threads≈cores model. Experiment for MDA2AV#32: test whether removing the 128-on-N-cores oversubscription recovers the async-db / api-16 regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… pg_async conversion) Three clean, post-cache-bust benchmarks agree the native async pg_async path is a net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2× (io_uring even handicapped by the cpuset change). The async path is bound by DB concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was crud, which is cache-bound (skips the DB) — preserved by the sync framework too. Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool + request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile, upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache). pg_async stays in the vanilla library as a capability for the case it actually wins (latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays. Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed plaintext "ok". Match it on `target` immediately after the path is sliced and blit a precomputed full-response constant + return — before the '?'-scan, the route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant route=='/pipeline' arm is dropped from the dispatch chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…content) callgrind on the render path showed ~26% of its instructions were the zero-fill of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re- zeroed every request as the GC reuses it. Size it from the actual rows instead (160 + 96/row + message bytes); an outlier grows once. The other builders here were already content-sized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…t-state runtime Restore the native-pg_async async DB path (per-worker PgPool via make_state; async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready continuation that pumps the result, renders by kind, releases the conn). This is the proven MDA2AV#32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena on the old map-based reactor + per-request malloc), re-applied now that PR MDA2AV#41 replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc) — the overhead that sank it is gone. Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11 cores idle, waiting on PG). Park/resume frees the worker to keep many queries in flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the EXPERIMENT to confirm MDA2AV#41 turns the old regression into a win; needs an arena run. Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ache) validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`: restoring the MDA2AV#32 async conversion brought back per-worker caches, but SO_REUSEPORT routes the two probe GETs to different workers, so each MISSes its cold cache. Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO, mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool stays per-worker (no lock). Builds clean; this unblocks the async benchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…code_into The MDA2AV#884 async-db regression was NOT the sub-ms-PG crossover — it was pool starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256 across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 → only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load (~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then SHEDS the overflow as an empty 200, so the closed-loop clients spin and real throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud was unaffected (+208%) only because it is cache-HIT served and never touches the pool. Fixes: 1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight), sized to Postgres max_connections. 2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) — the same no-boxing entry the sync build now uses; recovers the json-comp/json non-DB delta that the async dispatch path was paying. Builds clean (pg_async, post-0.5.1 local V; vanilla MDA2AV#44 with decode_into is now in main). Follow-ups (not here): queue on pool-full instead of shedding empty 200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ment 3) Wire the framework onto vanilla's cross-request pipelining. park() now picks the least-loaded connection via acquire_pipelined() (shed only when all conns are at the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's shed-on-full bool is now checked. on_db_ready drops the exclusive release: a pipelined connection is not held exclusively, its in-flight slot frees when async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the connection's parked requests front-first so the FIFO reply aligns with each request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain conns×N concurrency. Builds against the local pipelining vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…re merge TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to refs/heads/main once MDA2AV#45 lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this branch lost the `has_pipeline_prefix` fast path the MDA2AV#877 branch had: handle() ran decode_into + parse_http1_request_line on EVERY request, so the highest-RPS /pipeline test paid the full HTTP parse (~55% of the per-request CPU; the in-handle parse alone ~17%) for a fixed response. Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE any parsing. The request is already framed by the reactor, so decode adds nothing here. After: the per-request /pipeline profile has ZERO parse functions, and local throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining gap + the boehm-vs-gc-none delta is per-request allocation/GC (next). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… merged) vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The async-db +329% / pipelined +1622% results were against this exact code (main now == the merged branch), so no re-bench needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Park DB queries on the pooled connection via ac.watch_persistent so a client disconnecting mid-query no longer closes (and forces a reconnect + re-auth on) the pooled connection: the runtime drains the orphaned reply in order and keeps the conn open for reuse. Both the initial park and the not-ready re-arm use it (the single-watch path resets the slot, so the re-arm must re-stamp the pool-owned flag). Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it merges + vanilla main is updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

render_async_db / render_fortunes / render_crud_list each allocated a fresh response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships `-gc none`, so those buffers are never freed — a multi-GiB leak under DB load (async-db measured ~12 KiB/request total, tens of GiB on the arena). Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset to len 0 each response; it grows to a high-water mark then stays. Safe because a worker serves one request at a time (no concurrency). Paired with the pg_async per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263 bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db (48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap. (render_fortunes still allocates its Fortune vector + per-row message copies; that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Eliminate the remaining per-request heap allocations on the DB routes (which leak under the binary's `-gc none` build). With the pg_async/request_parser Stage B (reused submit scratch, no .bytes() in the wire builders, no-alloc query parse), async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B, -98.7%); the Boehm build is dead flat. - Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in every start_* with reused per-worker buffers — param_scratch (int params as decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The borrowed slices are copied by write_bind synchronously inside park, so they never outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never reallocates mid-request (which would dangle already-pushed slices). - Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts (no per-request `.bytes()`). - Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request; returned only on the terminal .done path — never on the not-ready re-arm, where it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485). - /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()), an explicit byte comparator for the sort, and escape_html_into that escapes directly into the render scratch (no Builder/string per row). Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the <script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT, baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B (enghitalo/vanilla); land after that merges. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows for i64::MIN — its magnitude isn't representable as i64 — leaving x negative so the digit loop never ran and only '-' was emitted. Build the magnitude in u64 instead (-(n+1)+1, with the +1 done in u64). Unreachable from current routes (ids are 32-bit, query ints are clamped) so this is not a live bug, but it's a latent correctness hole in a general integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Both built their body with `n.str()` — an int->string heap allocation on every request that leaks under -gc none. callgrind on /baseline11 showed 1.002 allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB RSS in the arena. Format the int into the reused per-worker scratch via a new emit_int() helper instead (the same render-scratch pattern the DB paths use). The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean per request; this closes the general/plaintext response path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…er load Under DB-pipeline saturation, park() returns the caller's fallback. The read paths shed to a benign empty 200, but crud create/update/get shed to bad_request (400) / not_found (404) — misreporting a backpressure shed as a client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected status" on crud, while every other framework AND the previously-recorded vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed — confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation with DATABASE_MAX_CONN=1). Only the shed fallback for the three crud write/get paths becomes 503 Service Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body) and 404 (missing item) are unchanged. Reducing the shed itself (pool / max_inflight capacity, which trades memory) and a holistic shed policy (read paths also shed to an empty 200) are deferred to a measured investigation tracked upstream in vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… (reused scratch) /baseline11's chunked-POST path parsed the body with `dechunk(s string) string`, which built a strings.Builder + .str() (and strconv_hex's trim_space) per request — a permanent leak under -gc none. In the arena baseline mix (1/3 of the templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the MDA2AV#888 re-bench still showed after emit_int closed the GET path. Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into), append data bytes via push_many, parse the integer in place (parse_i64_slice). No allocation in steady state. Verified byte-identical to the old dechunk on valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions) and safe on malformed input. Also hardens a latent OOB read / DoS: the chunk-size range check is now overflow-safe — `size > end - data_start` instead of `data_start + size > end`, which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable worker segfault). The old dechunk hit the same input as a bounds-checked panic; this makes it a controlled, bounded reject. parse_hex_slice now accumulates in i64 and saturates so the size value itself can't wrap. A 5-lens adversarial review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity) returned GO after this fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-pipelining

enghitalo · 2026-06-19T23:17:02Z

/benchmark -f vanilla-epoll

github-actions · 2026-06-19T23:17:10Z

👋 /benchmark request received. A collaborator will review and approve the run.

github-actions · 2026-06-20T00:35:31Z

Benchmark Results

Framework: vanilla-epoll | Test: all tests

Test	Conn	RPS	CPU	Mem	Δ RPS	Δ Mem
baseline	512	3,790,014	6486.8%	63MiB	+0.2%	-3.1%
baseline	4096	4,182,904	6451.8%	149MiB	+0.2%	+5.7%
pipelined	512	39,570,630	6710.9%	63MiB	-1.1%	+1.6%
pipelined	4096	39,831,476	6728.6%	152MiB	+0.7%	+0.7%
limited-conn	512	1,009,157	3164.3%	58MiB	+0.8%	+1.8%
limited-conn	4096	1,017,001	2975.9%	61MiB	~0%	-35.1%
json	4096	2,185,881	6417.6%	183MiB	-0.2%	+25.3%
json-comp	512	2,154,743	6038.2%	64MiB	-0.9%	-3.0%
json-comp	4096	2,354,617	6029.0%	119MiB	-1.6%	+3.5%
json-comp	16384	2,342,381	5937.1%	209MiB	+0.6%	+19.4%
upload	32	20,783	1131.3%	121MiB	-0.3%	+0.8%
upload	256	24,371	3752.4%	383MiB	~0%	-2.0%
api-4	256	69,902	358.5%	128MiB	+5.8%	-15.8%
api-16	1024	237,695	1323.2%	267MiB	+1.9%	-30.8%
static	1024	1,092,895	5565.5%	258MiB	-1.2%	-0.4%
static	4096	1,139,124	5643.8%	424MiB	-0.2%	+1.2%
static	6800	1,182,094	5776.8%	680MiB	+0.9%	+25.9%
async-db	1024	292,357	3080.0%	707MiB	+12.5%	+24.0%
crud	4096	263,090	737.4%	703MiB	+1.6%	-4.1%
fortunes	1024	142,091	6483.1%	692MiB	-23.0%	-13.3%

Full log

  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   13.84ms   7.30ms   38.30ms   74.80ms   106.80ms

  4009173 requests in 15.00s, 4005259 responses
  Throughput: 266.97K req/s
  Bandwidth:  80.92MB/s
  Status codes: 2xx=3946353, 3xx=0, 4xx=0, 5xx=58907
  Latency samples: 4005247 / 4005259 responses (100.0%)
  Reconnects: 18331
  Per-template: 90832,109555,138435,169886,200221,228009,250955,262987,270830,273716,277179,278195,275107,276353,274902,272632,118993,65423,81919,89118
  Per-template-ok: 88392,108006,136363,167587,197328,224883,247351,259423,266749,269737,273451,274260,271338,272449,270768,268813,118993,63792,79804,86853

  WARNING: 58906/4005259 responses (1.5%) had unexpected status (expected 2xx)
[info] CPU 737.4% | Mem 703MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   14.95ms   8.89ms   39.60ms   72.50ms   102.90ms

  3781142 requests in 15.00s, 3780886 responses
  Throughput: 252.01K req/s
  Bandwidth:  77.96MB/s
  Status codes: 2xx=3723875, 3xx=0, 4xx=0, 5xx=57011
  Latency samples: 3780881 / 3780886 responses (100.0%)
  Reconnects: 17147
  Per-template: 90974,108003,134334,163233,190039,213001,232257,244970,248564,248658,251705,253515,252948,254639,254385,251850,134412,78067,84991,90336
  Per-template-ok: 88537,106513,132450,160978,187430,209822,228645,241452,244748,245011,247912,249729,249139,250733,250587,248047,134412,76193,83212,88320

  WARNING: 57011/3780886 responses (1.5%) had unexpected status (expected 2xx)
[info] CPU 702.6% | Mem 1.1GiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/
  Threads:   64
  Conns:     4096 (64/thread)
  Pipeline:  1
  Req/conn:  200
  Templates: 20
  Expected:  200
  Duration:  15s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   15.21ms   10.80ms   38.50ms   68.20ms   94.40ms

  3724778 requests in 15.00s, 3724778 responses
  Throughput: 248.27K req/s
  Bandwidth:  77.03MB/s
  Status codes: 2xx=3678694, 3xx=0, 4xx=0, 5xx=46084
  Latency samples: 3724776 / 3724778 responses (100.0%)
  Reconnects: 16886
  Per-template: 93681,111592,135084,164360,187610,210079,229957,240915,243400,242183,241567,244116,244765,245477,243610,244924,135880,84657,88386,92533
  Per-template-ok: 91775,110336,133399,162460,185386,207545,227231,237681,240457,239152,238649,241172,241793,242480,240487,241822,135880,83150,86841,90996

  WARNING: 46084/3724778 responses (1.2%) had unexpected status (expected 2xx)
[info] CPU 699.2% | Mem 1.7GiB

=== Best: 263090 req/s (CPU: 737.4%, Mem: 703MiB) ===
[info] input BW: 22.58MB/s (avg template: 90 bytes)
[info] saved results/crud/4096/vanilla-epoll.json
httparena-bench-vanilla-epoll
httparena-bench-vanilla-epoll

==============================================
=== vanilla-epoll / fortunes / 1024c (tool=gcannon) ===
==============================================
[info] resetting postgres for a clean per-profile baseline
[info] starting postgres sidecar
httparena-postgres
[info] postgres ready (seeded)
[info] waiting for server...
[info] server ready

[run 1/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.96ms   4.58ms   9.20ms   41.70ms   99.70ms

  698252 requests in 5.00s, 698252 responses
  Throughput: 139.59K req/s
  Bandwidth:  3.23GB/s
  Status codes: 2xx=698252, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 698248 / 698252 responses (100.0%)
[info] CPU 6179.1% | Mem 493MiB

[run 2/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.82ms   4.48ms   9.35ms   37.10ms   82.30ms

  710457 requests in 5.00s, 710457 responses
  Throughput: 142.04K req/s
  Bandwidth:  3.29GB/s
  Status codes: 2xx=710457, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 710457 / 710457 responses (100.0%)
[info] CPU 6483.1% | Mem 692MiB

[run 3/3]
gcannon v0.5.3
  Target:    localhost:8080/fortunes
  Threads:   64
  Conns:     1024 (16/thread)
  Pipeline:  1
  Req/conn:  unlimited (keep-alive)
  Expected:  200
  Duration:  5s


  Thread Stats   Avg      p50      p90      p99    p99.9
    Latency   5.77ms   4.46ms   9.40ms   36.90ms   84.00ms

  707173 requests in 5.00s, 707173 responses
  Throughput: 141.38K req/s
  Bandwidth:  3.28GB/s
  Status codes: 2xx=707173, 3xx=0, 4xx=0, 5xx=0
  Latency samples: 707168 / 707173 responses (100.0%)
[info] CPU 6406.2% | Mem 993MiB

=== Best: 142091 req/s (CPU: 6483.1%, Mem: 692MiB) ===
[info] saved results/fortunes/1024/vanilla-epoll.json
httparena-bench-vanilla-epoll
httparena-bench-vanilla-epoll
[info] skip: vanilla-epoll does not subscribe to baseline-h2
[info] skip: vanilla-epoll does not subscribe to static-h2
[info] skip: vanilla-epoll does not subscribe to baseline-h2c
[info] skip: vanilla-epoll does not subscribe to json-h2c
[info] skip: vanilla-epoll does not subscribe to baseline-h3
[info] skip: vanilla-epoll does not subscribe to static-h3
[info] skip: vanilla-epoll does not subscribe to gateway-64
[info] skip: vanilla-epoll does not subscribe to gateway-h3
[info] skip: vanilla-epoll does not subscribe to production-stack
[info] skip: vanilla-epoll does not subscribe to unary-grpc
[info] skip: vanilla-epoll does not subscribe to unary-grpc-tls
[info] skip: vanilla-epoll does not subscribe to stream-grpc
[info] skip: vanilla-epoll does not subscribe to stream-grpc-tls
[info] skip: vanilla-epoll does not subscribe to echo-ws
[info] skip: vanilla-epoll does not subscribe to echo-ws-pipeline
[info] rebuilding site/data/*.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/frameworks.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-16-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/api-4-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/async-db-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/baseline-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/crud-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/fortunes-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-16384.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/json-comp-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/limited-conn-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/pipelined-512.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-1024.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-4096.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/static-6800.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-256.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/upload-32.json
[updated] /home/diogo/actions-runner/_work/HttpArena/HttpArena/site/data/current.json
[info] done
httparena-postgres
httparena-redis
[info] restoring loopback MTU to 65536

…y_fast) crud create/update parsed the JSON body with json.decode(CrudCreate, body), which allocates the decoded name/category strings + struct per write — a permanent leak under -gc none. Add parse_crud_body_fast: a zero-alloc reader that BORROWS name/category as []u8 slices into the request buffer and parses id/price/quantity in place; json.decode stays as the fallback for inputs the fast path declines (escaped strings, missing fields). callgrind: 600 POSTs → zero allocator-primitive calls, json.decode never entered. Parity-tested vs json.decode on the create/update fixtures incl. whitespace, negatives, and key-as-substring-of-value; bounds-safe. (The fortunes sort experiment from the first cut of this PR was reverted: on the 20-row fortune set the v_stable_sort temp is negligible and not the dominant arena leak — see below — so the standard sort_with_compare is kept for simplicity. The real crud/fortunes/async-db arena growth is a separate, scale-dependent DB-path leak under investigation.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

enghitalo and others added 30 commits June 15, 2026 09:20

Benchmark results: vanilla-epoll

c7dd2ee

Benchmark results: vanilla-io_uring

abf558e

ci: re-trigger validate — vanilla main now provides HttpRequest.conte…

705c0ca

…nt_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed

enghitalo and others added 5 commits June 18, 2026 14:49

Merge remote-tracking branch 'origin/main' into perf/vanilla-async-db…

d34dd99

…-pipelining

Benchmark results: vanilla-epoll

17632df

enghitalo requested review from Kaliumhexacyanoferrat and MDA2AV as code owners June 19, 2026 22:47

enghitalo force-pushed the perf/vanilla-crud-fortunes-render branch from bc64ebc to d1c86fd Compare June 19, 2026 23:06

enghitalo mentioned this pull request Jun 19, 2026

vanilla: json-comp cache, zero-alloc routing, DB prepared statement + HTML escape #877

Closed

enghitalo force-pushed the perf/vanilla-crud-fortunes-render branch from d1c86fd to e980f86 Compare June 20, 2026 04:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vanilla-epoll): zero-alloc crud-write + fortunes render (follow-up to #888)#894

perf(vanilla-epoll): zero-alloc crud-write + fortunes render (follow-up to #888)#894
enghitalo wants to merge 36 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-crud-fortunes-render

enghitalo commented Jun 19, 2026

Uh oh!

enghitalo commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

enghitalo commented Jun 19, 2026

Verification

Note

Uh oh!

enghitalo commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 20, 2026

Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant