perf(vanilla-epoll): cross-request pipelining for async-db by enghitalo · Pull Request #1 · enghitalo/HttpArena

enghitalo · 2026-06-17T20:15:28Z

Wires the framework onto vanilla's cross-request pipelining (vanilla#45).

Why

On the 128-core box, per-worker pools are tiny (DATABASE_MAX_CONN=256 / 128 = 2 conns/worker). The old one-in-flight-per-conn capped per-worker DB concurrency at 2; under closed-loop load park() shed the overflow as empty 200s — that was MDA2AV#884's async-db/fortunes regression (pool starvation, not a crossover).

Change

park() picks the least-loaded connection (acquire_pipelined) and sheds only when every conn is at the max_inflight cap — so a connection now carries up to N=8 in-flight queries (per-worker ceiling conns × N).
async_submit's shed-on-full bool is checked; on_db_ready drops the exclusive release (a pipelined conn isn't held exclusively; its slot frees when the reply is popped). The reactor's per-fd watch queue fans each reply to its request in submission order.
async-db / fortunes / crud-list all flow through park(), so all gain the concurrency.

CI note

The Dockerfile is temporarily pinned to vanilla feat/pg-async-pipelining (vanilla#45) so this can benchmark before MDA2AV#45 merges — revert to refs/heads/main once it lands.

Validated

Driver against real PG18 (FIFO order + mid-pipeline error isolation), reactor unit-tested, framework compiles against the pipelining lib. The async-db benchmark here is the end-to-end gate (it's what drives the concurrent parks → the reactor's multi-client drain under load).

🤖 Generated with Claude Code

…ment 3) Wire the framework onto vanilla's cross-request pipelining. park() now picks the least-loaded connection via acquire_pipelined() (shed only when all conns are at the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's shed-on-full bool is now checked. on_db_ready drops the exclusive release: a pipelined connection is not held exclusively, its in-flight slot frees when async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the connection's parked requests front-first so the FIFO reply aligns with each request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain conns×N concurrency. Builds against the local pipelining vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…re merge TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to refs/heads/main once MDA2AV#45 lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

enghitalo · 2026-06-17T20:15:45Z

/benchmark -f vanilla-epoll -t async-db

github-actions · 2026-06-17T20:15:54Z

👋 /benchmark request received. A collaborator will review and approve the run.

enghitalo · 2026-06-17T21:14:13Z

Superseded by MDA2AV#888 — benchmarking on upstream (fork CI looped).

* zix 0.4.x-rc1 * zix drop WebSocket (split to zix-ws instead) * zix 0.4.x-rc1 x86_64 musl alpine * zix head comment info * zix: move to 0.4.x-rc2 * trigger action * clearing space * attempt to resolve with retry * ci: retrigger #1 * re-strategize using two source and retry * Attempt rc2 test 2 * make retry 6 * url wrap arround double quote * using git clone over https * finalizing 0.4.x-rc2 * accident junk * preparing 0.4.x * bump: zix 0.4.x * updating meta * correcting/seperation concern * switching dispatch model * ci: retrigger number 1 * ci: retrigger number 1 * ci: retrigger number 1 * ci: retrigger number 1 * ci: retrigger number 2 * Benchmark results: zix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…AV#884) (MDA2AV#888) * vanilla: cache json-comp gzip responses + zero-alloc routing json-comp recompressed the gzip body on EVERY request even though the output for a given (count, m) is fully deterministic — and gzip CPU, not allocation, dominates that profile. Cache the COMPLETE gzipped response per (count, m) and append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only a handful of (count, m) pairs, so the cache stays tiny. Also route on the path WITHOUT allocating: a tos() view into the request buffer instead of all_before('?')'s per-request string copy (one alloc per request on the hot path), shaving GC churn off baseline/json too. Local before/after (16-core loopback, gcannon, single listener): json-comp 58K -> 390K req/s (+570%, 6.7x) Correctness verified: gzip body decodes to the right items/count/total; the cached response is byte-identical across requests; all other routes unchanged. Applies to both the epoll and io_uring variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla: precompute query-key bytes + parse path ints in place Remove the remaining small per-request allocations on the hot path: • qint/qstr took a `string` key and called `key.bytes()` every request (one []u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…). Keys are now precomputed `const []u8` (qk_*), built once at init. • /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a substring copy. parse_u_at() reads the digits straight from the path view. Local before/after (16-core loopback) is within noise (baseline ~528K→530K, json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866 removed — but allocation scaled hard on the 64-core arena (json +322% there), so this trims more GC churn for that environment at zero cost. Note: @[manualfree] is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree only affects -autofree), so reducing allocations is the lever, not manualfree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Benchmark results: vanilla-epoll * vanilla: DB path — prepared statement (async-db) + single-pass HTML escape Folds the DB-path work into this PR so everything lands together: • async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via db.pg, lazily prepared per pooled connection) instead of exec_param_many's per-request server-side SQL re-parse — local +9%. • escape_html (fortunes) does ONE pass with a no-alloc fast path instead of replace_each's five full-string passes — local +27% fortunes. DB profiles remain bound by the stdlib db.pg driver (text protocol), so this narrows the gap without closing it. Both backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Benchmark results: vanilla-io_uring * vanilla: armor hot-path byte writes against the V `<<` regression Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V (vlang/v#27468) while bulk push_many, allocation and indexed writes are unaffected. The two hot single-element `<<` sites are now bulk writes: - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's back-to-front into the [20]u8 scratch and flushes with one push_many. - write_json_response() pushed the item separator `,` and closing `}` one byte at a time; the closing `}` is now fused with the separator into a single '},' / '}' push_many. Output is byte-identical (verified across counts 0..4096 and edge-value integers). This makes the JSON hot path fast on both the 0.5.1 release and current master, independent of the upstream codegen regression. Both epoll and io_uring backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla: build pinned V 0.5.1 from source (pinned vc bootstrap) Build V from source at the 0.5.1 tag instead of the prebuilt release zip. Plain `make` can't build an old tag: its latest_vc step `git pull`s the newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with `unknown ident \`native\``). So pin vc to the commit cut for 0.5.1 (vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps. Pinned by tag, not a master commit, because post-0.5.1 master carries a codegen regression (single-element array push 4-7x slower, vlang/v#27468). Both backends; verified the source-built compiler serves /json and /pipeline correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla-epoll: serve static assets with sendfile(2) (zero-copy) The static handler copied each asset's full prebuilt response (up to ~300 KB) into the per-connection write_buf every request — a userspace copy plus a large *scanned* write_buf that grows the GC's stop-the-world cost at high conn counts (why vanilla sat ~4x behind nginx/swerver on the static profile). Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's life) and a precomputed response head; serve the head into write_buf and stream the body zero-copy via core.queue_file (sendfile(2), already wired through the epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the body is never copied, and the kernel pushes file pages straight to the socket — the same model nginx and swerver use. Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s (2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla-epoll: answer /upload by Content-Length (engine drains large bodies) The lib now streams (drains) request bodies larger than 1 MiB instead of buffering them, so for a large upload req.body is empty — but the byte count the upload profile wants is the declared Content-Length. Answer by req.content_length() (falls back to the buffered body length when absent, which also covers small bodies that still take the buffered path). Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine drain); the Dockerfile clones lib main, so that PR must merge before this builds. Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303 req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB buffering). epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: re-trigger validate — vanilla main now provides HttpRequest.content_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed * refactor(vanilla-epoll): route write-buffer appends through wb()/push_many Replace the remaining `out << <[]u8>` appends (static header, error consts, the four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the gz-cache key is unrelated and kept as is. Note: V already lowers `array << array` to array_push_many, so this is codegen- neutral — a consistency / regression-safety change (the whole write path now takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays the way the single-element path did, vlang/v#27468). The hot single-element `<<` was already armored by ws/wi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(vanilla-epoll): async-db via the native pg_async driver (enghitalo/vanilla#32) Convert the framework from the blocking db.pg ConnectionPool to vanilla's native async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB endpoints now PARK on the PG socket (ac.watch) and resume in a continuation instead of blocking a worker thread per query — closing the async-db gap (MDA2AV#32). - ServerConfig: request_handler → async_handler + make_state. Each worker owns a per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own cache-aside and json-comp caches; the dataset/prefixes/static assets stay shared read-only. - async-db, fortunes, crud (list/get/create/update) issue a query, park, and render in a single resume continuation that switches on a small per-request stash. crud_list folds page+total into ONE window-count query (count(*) OVER()) instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache). - DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW from its binary form — no json.encode reflection, no decode/re-encode. - Drops the db.pg dependency entirely, so the framework also builds on master V (master removed pg.ConnectionPool); the non-DB hot paths are unchanged. Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50; /json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across workers under SO_REUSEPORT — by design, vs the old shared+mutex cache). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci: re-trigger MDA2AV#877 — vanilla MDA2AV#40 merged (net-import 0.5.1 build fix) The previous run failed because the framework's Docker cloned vanilla main BEFORE the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1 tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla PR MDA2AV#40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial), verified to compile under `v -prod .` on the true 0.5.1 tag for both vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it re-clones the fixed vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(ci): cache-bust the vanilla clone so the build picks up library fixes The Docker `RUN git clone … vanilla` layer was cached indefinitely on the self-hosted runner, so re-runs kept building against a STALE vanilla checkout — which is why MDA2AV#877 stayed red even after the build fix (vanilla MDA2AV#40) merged: the build never re-cloned to get it. Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main moves, invalidating this layer's cache and forcing a fresh clone. Adding the step also re-clones on this build (new layer structure), so it now picks up MDA2AV#40. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): share the crud + json-comp caches across workers The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With SO_REUSEPORT the two requests land on different workers, so a per-worker cache returns MISS both times → validation fails. Move the cache-aside (and the json-comp gzip cache) into the process-shared `Shared` (renamed from SharedRO), guarded by RwMutexes since workers are separate threads — restoring the original shared-cache semantics. The async Postgres pool stays per-worker (make_state); only the caches are shared. Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986, page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under `v -prod .` on the true V 0.5.1 tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): size the per-worker pool by usable cores, not host CPUs Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker Postgres pool size against core.max_thread_pool_size (usable cores) instead of runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of 64/128=1 — matching the async path's threads≈cores model. Experiment for MDA2AV#32: test whether removing the 128-on-N-cores oversubscription recovers the async-db / api-16 regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert(vanilla-epoll): restore the sync db.pg DB path (drop the async pg_async conversion) Three clean, post-cache-bust benchmarks agree the native async pg_async path is a net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2× (io_uring even handicapped by the cpuset change). The async path is bound by DB concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was crud, which is cache-bound (skips the DB) — preserved by the sync framework too. Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool + request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile, upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache). pg_async stays in the vanilla library as a capability for the case it actually wins (latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays. Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): short-circuit /pipeline with a precomputed constant The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed plaintext "ok". Match it on `target` immediately after the path is sliced and blit a precomputed full-response constant + return — before the '?'-scan, the route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant route=='/pipeline' arm is dropped from the dispatch chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): right-size the fortunes render builder (32KB -> content) callgrind on the render path showed ~26% of its instructions were the zero-fill of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re- zeroed every request as the GC reuses it. Size it from the actual rows instead (160 + 96/row + message bytes); an outlier grows once. The other builders here were already content-sized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): re-enable async-db (pg_async park/resume) on flat-state runtime Restore the native-pg_async async DB path (per-worker PgPool via make_state; async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready continuation that pumps the result, renders by kind, releases the conn). This is the proven MDA2AV#32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena on the old map-based reactor + per-request malloc), re-applied now that PR MDA2AV#41 replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc) — the overhead that sank it is gone. Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11 cores idle, waiting on PG). Park/resume frees the worker to keep many queries in flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the EXPERIMENT to confirm MDA2AV#41 turns the old regression into a win; needs an arena run. Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): process-shared crud + json-comp caches (async X-Cache) validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`: restoring the MDA2AV#32 async conversion brought back per-worker caches, but SO_REUSEPORT routes the two probe GETs to different workers, so each MISSes its cold cache. Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO, mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool stays per-worker (no lock). Builds clean; this unblocks the async benchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): un-starve the async pool (drop the >8 clamp) + decode_into The MDA2AV#884 async-db regression was NOT the sub-ms-PG crossover — it was pool starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256 across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 → only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load (~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then SHEDS the overflow as an empty 200, so the closed-loop clients spin and real throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud was unaffected (+208%) only because it is cache-HIT served and never touches the pool. Fixes: 1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight), sized to Postgres max_connections. 2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) — the same no-boxing entry the sync build now uses; recovers the json-comp/json non-DB delta that the async dispatch path was paying. Builds clean (pg_async, post-0.5.1 local V; vanilla MDA2AV#44 with decode_into is now in main). Follow-ups (not here): queue on pool-full instead of shedding empty 200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(vanilla-epoll): pipeline async-db queries across requests (increment 3) Wire the framework onto vanilla's cross-request pipelining. park() now picks the least-loaded connection via acquire_pipelined() (shed only when all conns are at the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's shed-on-full bool is now checked. on_db_ready drops the exclusive release: a pipelined connection is not held exclusively, its in-flight slot frees when async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the connection's parked requests front-first so the FIFO reply aligns with each request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain conns×N concurrency. Builds against the local pipelining vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(vanilla-epoll): pin vanilla to the pipelining branch to bench before merge TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to refs/heads/main once MDA2AV#45 lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): restore /pipeline skip-decode fast path Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this branch lost the `has_pipeline_prefix` fast path the MDA2AV#877 branch had: handle() ran decode_into + parse_http1_request_line on EVERY request, so the highest-RPS /pipeline test paid the full HTTP parse (~55% of the per-request CPU; the in-handle parse alone ~17%) for a fixed response. Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE any parsing. The request is already framed by the reactor, so decode adds nothing here. After: the per-request /pipeline profile has ZERO parse functions, and local throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining gap + the boehm-vs-gc-none delta is per-request allocation/GC (next). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(vanilla-epoll): build against vanilla main (pipelining + recv-path merged) vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The async-db +329% / pipelined +1622% results were against this exact code (main now == the merged branch), so no re-bench needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): arm pooled PG conns with watch_persistent Park DB queries on the pooled connection via ac.watch_persistent so a client disconnecting mid-query no longer closes (and forces a reconnect + re-auth on) the pooled connection: the runtime drains the orphaned reply in order and keeps the conn open for reuse. Both the initial park and the not-ready re-arm use it (the single-watch path resets the slot, so the re-arm must re-stamp the pool-owned flag). Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it merges + vanilla main is updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): render DB responses into a reused per-worker buffer render_async_db / render_fortunes / render_crud_list each allocated a fresh response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships `-gc none`, so those buffers are never freed — a multi-GiB leak under DB load (async-db measured ~12 KiB/request total, tens of GiB on the arena). Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset to len 0 each response; it grows to a high-water mark then stays. Safe because a worker serves one request at a time (no concurrency). Paired with the pg_async per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263 bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db (48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap. (render_fortunes still allocates its Fortune vector + per-row message copies; that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): zero-alloc DB request path (Stage B, -gc none) Eliminate the remaining per-request heap allocations on the DB routes (which leak under the binary's `-gc none` build). With the pg_async/request_parser Stage B (reused submit scratch, no .bytes() in the wire builders, no-alloc query parse), async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B, -98.7%); the Boehm build is dead flat. - Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in every start_* with reused per-worker buffers — param_scratch (int params as decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The borrowed slices are copied by write_bind synchronously inside park, so they never outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never reallocates mid-request (which would dangle already-pushed slices). - Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts (no per-request `.bytes()`). - Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request; returned only on the terminal .done path — never on the not-ready re-arm, where it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485). - /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()), an explicit byte comparator for the sort, and escape_html_into that escapes directly into the render scratch (no Builder/string per row). Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the <script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT, baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B (enghitalo/vanilla); land after that merges. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): handle i64::MIN in wi() integer formatter wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows for i64::MIN — its magnitude isn't representable as i64 — leaving x negative so the digit loop never ran and only '-' was emitted. Build the magnitude in u64 instead (-(n+1)+1, with the +1 done in u64). Unreachable from current routes (ids are 32-bit, query ints are clamped) so this is not a live bug, but it's a latent correctness hole in a general integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): zero-alloc int responses on /baseline11 and /upload Both built their body with `n.str()` — an int->string heap allocation on every request that leaks under -gc none. callgrind on /baseline11 showed 1.002 allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB RSS in the arena. Format the int into the reused per-worker scratch via a new emit_int() helper instead (the same render-scratch pattern the DB paths use). The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean per request; this closes the general/plaintext response path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): 503 (not 400/404) when the crud DB pool sheds under load Under DB-pipeline saturation, park() returns the caller's fallback. The read paths shed to a benign empty 200, but crud create/update/get shed to bad_request (400) / not_found (404) — misreporting a backpressure shed as a client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected status" on crud, while every other framework AND the previously-recorded vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed — confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation with DATABASE_MAX_CONN=1). Only the shed fallback for the three crud write/get paths becomes 503 Service Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body) and 404 (missing item) are unchanged. Reducing the shed itself (pool / max_inflight capacity, which trades memory) and a holistic shed policy (read paths also shed to an empty 200) are deferred to a measured investigation tracked upstream in vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf+fix(vanilla-epoll): zero-alloc chunked-body parse on /baseline11 (reused scratch) /baseline11's chunked-POST path parsed the body with `dechunk(s string) string`, which built a strings.Builder + .str() (and strconv_hex's trim_space) per request — a permanent leak under -gc none. In the arena baseline mix (1/3 of the templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the MDA2AV#888 re-bench still showed after emit_int closed the GET path. Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into), append data bytes via push_many, parse the integer in place (parse_i64_slice). No allocation in steady state. Verified byte-identical to the old dechunk on valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions) and safe on malformed input. Also hardens a latent OOB read / DoS: the chunk-size range check is now overflow-safe — `size > end - data_start` instead of `data_start + size > end`, which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable worker segfault). The old dechunk hit the same input as a bounds-checked panic; this makes it a controlled, bounded reject. parse_hex_slice now accumulates in i64 and saturates so the size value itself can't wrap. A 5-lens adversarial review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity) returned GO after this fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Benchmark results: vanilla-epoll --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

enghitalo and others added 2 commits June 17, 2026 17:05

enghitalo closed this Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vanilla-epoll): cross-request pipelining for async-db#1

perf(vanilla-epoll): cross-request pipelining for async-db#1
enghitalo wants to merge 2 commits into
perf/vanilla-async-db-flatstatefrom
perf/vanilla-async-db-pipelining

enghitalo commented Jun 17, 2026

Uh oh!

enghitalo commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

enghitalo commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

enghitalo commented Jun 17, 2026

Why

Change

CI note

Validated

Uh oh!

enghitalo commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

enghitalo commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant