Skip to content

perf(vanilla-epoll): cross-request pipelining for async-db#1

Closed
enghitalo wants to merge 2 commits into
perf/vanilla-async-db-flatstatefrom
perf/vanilla-async-db-pipelining
Closed

perf(vanilla-epoll): cross-request pipelining for async-db#1
enghitalo wants to merge 2 commits into
perf/vanilla-async-db-flatstatefrom
perf/vanilla-async-db-pipelining

Conversation

@enghitalo

Copy link
Copy Markdown
Owner

Wires the framework onto vanilla's cross-request pipelining (vanilla#45).

Why

On the 128-core box, per-worker pools are tiny (DATABASE_MAX_CONN=256 / 128 = 2 conns/worker). The old one-in-flight-per-conn capped per-worker DB concurrency at 2; under closed-loop load park() shed the overflow as empty 200s — that was MDA2AV#884's async-db/fortunes regression (pool starvation, not a crossover).

Change

  • park() picks the least-loaded connection (acquire_pipelined) and sheds only when every conn is at the max_inflight cap — so a connection now carries up to N=8 in-flight queries (per-worker ceiling conns × N).
  • async_submit's shed-on-full bool is checked; on_db_ready drops the exclusive release (a pipelined conn isn't held exclusively; its slot frees when the reply is popped). The reactor's per-fd watch queue fans each reply to its request in submission order.
  • async-db / fortunes / crud-list all flow through park(), so all gain the concurrency.

CI note

The Dockerfile is temporarily pinned to vanilla feat/pg-async-pipelining (vanilla#45) so this can benchmark before MDA2AV#45 merges — revert to refs/heads/main once it lands.

Validated

Driver against real PG18 (FIFO order + mid-pipeline error isolation), reactor unit-tested, framework compiles against the pipelining lib. The async-db benchmark here is the end-to-end gate (it's what drives the concurrent parks → the reactor's multi-client drain under load).

🤖 Generated with Claude Code

enghitalo and others added 2 commits June 17, 2026 17:05
…ment 3)

Wire the framework onto vanilla's cross-request pipelining. park() now picks the
least-loaded connection via acquire_pipelined() (shed only when all conns are at
the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter
starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then
park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's
shed-on-full bool is now checked. on_db_ready drops the exclusive release: a
pipelined connection is not held exclusively, its in-flight slot frees when
async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the
connection's parked requests front-first so the FIFO reply aligns with each
request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain
conns×N concurrency. Builds against the local pipelining vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…re merge

TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's
feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the
cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to
refs/heads/main once MDA2AV#45 lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Owner Author

/benchmark -f vanilla-epoll -t async-db

@github-actions

Copy link
Copy Markdown

👋 /benchmark request received. A collaborator will review and approve the run.

@enghitalo

Copy link
Copy Markdown
Owner Author

Superseded by MDA2AV#888 — benchmarking on upstream (fork CI looped).

@enghitalo enghitalo closed this Jun 17, 2026
enghitalo pushed a commit that referenced this pull request Jun 19, 2026
* zix 0.4.x-rc1

* zix drop WebSocket (split to zix-ws instead)

* zix 0.4.x-rc1 x86_64 musl alpine

* zix head comment info

* zix: move to 0.4.x-rc2

* trigger action

* clearing space

* attempt to resolve with retry

* ci: retrigger #1

* re-strategize using two source and retry

* Attempt rc2 test 2

* make retry 6

* url wrap arround double quote

* using git clone over https

* finalizing 0.4.x-rc2

* accident junk

* preparing 0.4.x

* bump: zix 0.4.x

* updating meta

* correcting/seperation concern

* switching dispatch model

* ci: retrigger number 1

* ci: retrigger number 1

* ci: retrigger number 1

* ci: retrigger number 1

* ci: retrigger number 2

* Benchmark results: zix

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
enghitalo added a commit that referenced this pull request Jun 19, 2026
…AV#884) (MDA2AV#888)

* vanilla: cache json-comp gzip responses + zero-alloc routing

json-comp recompressed the gzip body on EVERY request even though the output
for a given (count, m) is fully deterministic — and gzip CPU, not allocation,
dominates that profile. Cache the COMPLETE gzipped response per (count, m) and
append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only
a handful of (count, m) pairs, so the cache stays tiny.

Also route on the path WITHOUT allocating: a tos() view into the request buffer
instead of all_before('?')'s per-request string copy (one alloc per request on
the hot path), shaving GC churn off baseline/json too.

Local before/after (16-core loopback, gcannon, single listener):
  json-comp  58K -> 390K req/s  (+570%, 6.7x)

Correctness verified: gzip body decodes to the right items/count/total; the
cached response is byte-identical across requests; all other routes unchanged.
Applies to both the epoll and io_uring variants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* vanilla: precompute query-key bytes + parse path ints in place

Remove the remaining small per-request allocations on the hot path:
  • qint/qstr took a `string` key and called `key.bytes()` every request (one
    []u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…).
    Keys are now precomputed `const []u8` (qk_*), built once at init.
  • /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a
    substring copy. parse_u_at() reads the digits straight from the path view.

Local before/after (16-core loopback) is within noise (baseline ~528K→530K,
json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866
removed — but allocation scaled hard on the 64-core arena (json +322% there),
so this trims more GC churn for that environment at zero cost. Note: @[manualfree]
is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree
only affects -autofree), so reducing allocations is the lever, not manualfree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Benchmark results: vanilla-epoll

* vanilla: DB path — prepared statement (async-db) + single-pass HTML escape

Folds the DB-path work into this PR so everything lands together:
  • async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via
    db.pg, lazily prepared per pooled connection) instead of exec_param_many's
    per-request server-side SQL re-parse — local +9%.
  • escape_html (fortunes) does ONE pass with a no-alloc fast path instead of
    replace_each's five full-string passes — local +27% fortunes.

DB profiles remain bound by the stdlib db.pg driver (text protocol), so this
narrows the gap without closing it. Both backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Benchmark results: vanilla-io_uring

* vanilla: armor hot-path byte writes against the V `<<` regression

Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V
(vlang/v#27468) while bulk push_many, allocation and indexed writes are
unaffected. The two hot single-element `<<` sites are now bulk writes:

  - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's
    back-to-front into the [20]u8 scratch and flushes with one push_many.
  - write_json_response() pushed the item separator `,` and closing `}`
    one byte at a time; the closing `}` is now fused with the separator
    into a single '},' / '}' push_many.

Output is byte-identical (verified across counts 0..4096 and edge-value
integers). This makes the JSON hot path fast on both the 0.5.1 release and
current master, independent of the upstream codegen regression. Both
epoll and io_uring backends.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* vanilla: build pinned V 0.5.1 from source (pinned vc bootstrap)

Build V from source at the 0.5.1 tag instead of the prebuilt release zip.
Plain `make` can't build an old tag: its latest_vc step `git pull`s the
newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with
`unknown ident \`native\``). So pin vc to the commit cut for 0.5.1
(vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own
bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps.

Pinned by tag, not a master commit, because post-0.5.1 master carries a
codegen regression (single-element array push 4-7x slower, vlang/v#27468).
Both backends; verified the source-built compiler serves /json and /pipeline
correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* vanilla-epoll: serve static assets with sendfile(2) (zero-copy)

The static handler copied each asset's full prebuilt response (up to ~300 KB)
into the per-connection write_buf every request — a userspace copy plus a large
*scanned* write_buf that grows the GC's stop-the-world cost at high conn counts
(why vanilla sat ~4x behind nginx/swerver on the static profile).

Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's
life) and a precomputed response head; serve the head into write_buf and stream
the body zero-copy via core.queue_file (sendfile(2), already wired through the
epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the
body is never copied, and the kernel pushes file pages straight to the socket —
the same model nginx and swerver use.

Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s
(2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* vanilla-epoll: answer /upload by Content-Length (engine drains large bodies)

The lib now streams (drains) request bodies larger than 1 MiB instead of
buffering them, so for a large upload req.body is empty — but the byte count the
upload profile wants is the declared Content-Length. Answer by
req.content_length() (falls back to the buffered body length when absent, which
also covers small bodies that still take the buffered path).

Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine
drain); the Dockerfile clones lib main, so that PR must merge before this builds.
Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303
req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB
buffering). epoll only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: re-trigger validate — vanilla main now provides HttpRequest.content_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed

* refactor(vanilla-epoll): route write-buffer appends through wb()/push_many

Replace the remaining `out << <[]u8>` appends (static header, error consts, the
four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper
that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the
gz-cache key is unrelated and kept as is.

Note: V already lowers `array << array` to array_push_many, so this is codegen-
neutral — a consistency / regression-safety change (the whole write path now
takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays
the way the single-element path did, vlang/v#27468). The hot single-element `<<`
was already armored by ws/wi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(vanilla-epoll): async-db via the native pg_async driver (enghitalo/vanilla#32)

Convert the framework from the blocking db.pg ConnectionPool to vanilla's native
async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB
endpoints now PARK on the PG socket (ac.watch) and resume in a continuation
instead of blocking a worker thread per query — closing the async-db gap (MDA2AV#32).

- ServerConfig: request_handler → async_handler + make_state. Each worker owns a
  per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own
  cache-aside and json-comp caches; the dataset/prefixes/static assets stay
  shared read-only.
- async-db, fortunes, crud (list/get/create/update) issue a query, park, and
  render in a single resume continuation that switches on a small per-request
  stash. crud_list folds page+total into ONE window-count query (count(*) OVER())
  instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache).
- DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW
  from its binary form — no json.encode reflection, no decode/re-encode.
- Drops the db.pg dependency entirely, so the framework also builds on master V
  (master removed pg.ConnectionPool); the non-DB hot paths are unchanged.

Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every
endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud
list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50;
/json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across
workers under SO_REUSEPORT — by design, vs the old shared+mutex cache).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci: re-trigger MDA2AV#877 — vanilla MDA2AV#40 merged (net-import 0.5.1 build fix)

The previous run failed because the framework's Docker cloned vanilla main BEFORE
the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1
tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla
PR MDA2AV#40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial),
verified to compile under `v -prod .` on the true 0.5.1 tag for both
vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it
re-clones the fixed vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(ci): cache-bust the vanilla clone so the build picks up library fixes

The Docker `RUN git clone … vanilla` layer was cached indefinitely on the
self-hosted runner, so re-runs kept building against a STALE vanilla checkout —
which is why MDA2AV#877 stayed red even after the build fix (vanilla MDA2AV#40) merged: the
build never re-cloned to get it.

Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both
vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main
moves, invalidating this layer's cache and forcing a fresh clone. Adding the
step also re-clones on this build (new layer structure), so it now picks up MDA2AV#40.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(vanilla-epoll): share the crud + json-comp caches across workers

The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check
does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With
SO_REUSEPORT the two requests land on different workers, so a per-worker cache
returns MISS both times → validation fails.

Move the cache-aside (and the json-comp gzip cache) into the process-shared
`Shared` (renamed from SharedRO), guarded by RwMutexes since workers are
separate threads — restoring the original shared-cache semantics. The async
Postgres pool stays per-worker (make_state); only the caches are shared.

Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now
returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986,
page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under
`v -prod .` on the true V 0.5.1 tag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): size the per-worker pool by usable cores, not host CPUs

Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker
Postgres pool size against core.max_thread_pool_size (usable cores) instead of
runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so
per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of
64/128=1 — matching the async path's threads≈cores model.

Experiment for MDA2AV#32: test whether removing the 128-on-N-cores oversubscription
recovers the async-db / api-16 regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* revert(vanilla-epoll): restore the sync db.pg DB path (drop the async pg_async conversion)

Three clean, post-cache-bust benchmarks agree the native async pg_async path is a
net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync
showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2×
(io_uring even handicapped by the cpuset change). The async path is bound by DB
concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for
sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was
crud, which is cache-bound (skips the DB) — preserved by the sync framework too.

Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool
+ request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile,
upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache).
pg_async stays in the vanilla library as a capability for the case it actually wins
(latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays.

Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): short-circuit /pipeline with a precomputed constant

The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed
plaintext "ok". Match it on `target` immediately after the path is sliced and
blit a precomputed full-response constant + return — before the '?'-scan, the
route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw
is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant
route=='/pipeline' arm is dropped from the dispatch chain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): right-size the fortunes render builder (32KB -> content)

callgrind on the render path showed ~26% of its instructions were the zero-fill
of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re-
zeroed every request as the GC reuses it. Size it from the actual rows instead
(160 + 96/row + message bytes); an outlier grows once. The other builders here
were already content-sized.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): re-enable async-db (pg_async park/resume) on flat-state runtime

Restore the native-pg_async async DB path (per-worker PgPool via make_state;
async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready
continuation that pumps the result, renders by kind, releases the conn). This is
the proven MDA2AV#32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena
on the old map-based reactor + per-request malloc), re-applied now that PR MDA2AV#41
replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc)
— the overhead that sank it is gone.

Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency
by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11
cores idle, waiting on PG). Park/resume frees the worker to keep many queries in
flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the
EXPERIMENT to confirm MDA2AV#41 turns the old regression into a win; needs an arena run.

Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline
short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(vanilla-epoll): process-shared crud + json-comp caches (async X-Cache)

validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`:
restoring the MDA2AV#32 async conversion brought back per-worker caches, but SO_REUSEPORT
routes the two probe GETs to different workers, so each MISSes its cold cache.
Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO,
mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool
stays per-worker (no lock). Builds clean; this unblocks the async benchmark.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(vanilla-epoll): un-starve the async pool (drop the >8 clamp) + decode_into

The MDA2AV#884 async-db regression was NOT the sub-ms-PG crossover — it was pool
starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256
across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 →
only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load
(~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then
SHEDS the overflow as an empty 200, so the closed-loop clients spin and real
throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud
was unaffected (+208%) only because it is cache-HIT served and never touches the
pool.

Fixes:
1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight),
   sized to Postgres max_connections.
2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) —
   the same no-boxing entry the sync build now uses; recovers the json-comp/json
   non-DB delta that the async dispatch path was paying.

Builds clean (pg_async, post-0.5.1 local V; vanilla MDA2AV#44 with decode_into is now
in main). Follow-ups (not here): queue on pool-full instead of shedding empty
200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* feat(vanilla-epoll): pipeline async-db queries across requests (increment 3)

Wire the framework onto vanilla's cross-request pipelining. park() now picks the
least-loaded connection via acquire_pipelined() (shed only when all conns are at
the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter
starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then
park sheds the overflow as empty 200s; PR MDA2AV#884's regression). async_submit's
shed-on-full bool is now checked. on_db_ready drops the exclusive release: a
pipelined connection is not held exclusively, its in-flight slot frees when
async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the
connection's parked requests front-first so the FIFO reply aligns with each
request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain
conns×N concurrency. Builds against the local pipelining vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(vanilla-epoll): pin vanilla to the pipelining branch to bench before merge

TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's
feat/pg-async-pipelining (PR MDA2AV#45) so this arena PR builds against the
cross-request pipelining library and can benchmark before MDA2AV#45 merges. Revert to
refs/heads/main once MDA2AV#45 lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): restore /pipeline skip-decode fast path

Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this
branch lost the `has_pipeline_prefix` fast path the MDA2AV#877 branch had: handle() ran
decode_into + parse_http1_request_line on EVERY request, so the highest-RPS
/pipeline test paid the full HTTP parse (~55% of the per-request CPU; the
in-handle parse alone ~17%) for a fixed response.

Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE
any parsing. The request is already framed by the reactor, so decode adds nothing
here. After: the per-request /pipeline profile has ZERO parse functions, and local
throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining
gap + the boehm-vs-gc-none delta is per-request allocation/GC (next).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ci(vanilla-epoll): build against vanilla main (pipelining + recv-path merged)

vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free
recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining
pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The
async-db +329% / pipelined +1622% results were against this exact code (main now
== the merged branch), so no re-bench needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): arm pooled PG conns with watch_persistent

Park DB queries on the pooled connection via ac.watch_persistent so a
client disconnecting mid-query no longer closes (and forces a reconnect +
re-auth on) the pooled connection: the runtime drains the orphaned reply
in order and keeps the conn open for reuse. Both the initial park and the
not-ready re-arm use it (the single-watch path resets the slot, so the
re-arm must re-stamp the pool-owned flag).

Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it
merges + vanilla main is updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): render DB responses into a reused per-worker buffer

render_async_db / render_fortunes / render_crud_list each allocated a fresh
response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships
`-gc none`, so those buffers are never freed — a multi-GiB leak under DB load
(async-db measured ~12 KiB/request total, tens of GiB on the arena).

Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset
to len 0 each response; it grows to a high-water mark then stays. Safe because a
worker serves one request at a time (no concurrency). Paired with the pg_async
per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263
bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead
flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db
(48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap.

(render_fortunes still allocates its Fortune vector + per-row message copies;
that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): zero-alloc DB request path (Stage B, -gc none)

Eliminate the remaining per-request heap allocations on the DB routes (which leak
under the binary's `-gc none` build). With the pg_async/request_parser Stage B
(reused submit scratch, no .bytes() in the wire builders, no-alloc query parse),
async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B,
-98.7%); the Boehm build is dead flat.

- Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in
  every start_* with reused per-worker buffers — param_scratch (int params as
  decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The
  borrowed slices are copied by write_bind synchronously inside park, so they never
  outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never
  reallocates mid-request (which would dangle already-pushed slices).
- Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a
  borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts
  (no per-request `.bytes()`).
- Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request;
  returned only on the terminal .done path — never on the not-ready re-arm, where
  it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form
  for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485).
- /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()),
  an explicit byte comparator for the sort, and escape_html_into that escapes
  directly into the render scratch (no Builder/string per row).

Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the
<script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT,
baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B
(enghitalo/vanilla); land after that merges.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(vanilla-epoll): handle i64::MIN in wi() integer formatter

wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows
for i64::MIN — its magnitude isn't representable as i64 — leaving x negative
so the digit loop never ran and only '-' was emitted. Build the magnitude in
u64 instead (-(n+1)+1, with the +1 done in u64).

Unreachable from current routes (ids are 32-bit, query ints are clamped) so
this is not a live bug, but it's a latent correctness hole in a general
integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf(vanilla-epoll): zero-alloc int responses on /baseline11 and /upload

Both built their body with `n.str()` — an int->string heap allocation on every
request that leaks under -gc none. callgrind on /baseline11 showed 1.002
allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB
RSS in the arena. Format the int into the reused per-worker scratch via a new
emit_int() helper instead (the same render-scratch pattern the DB paths use).

The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean
per request; this closes the general/plaintext response path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(vanilla-epoll): 503 (not 400/404) when the crud DB pool sheds under load

Under DB-pipeline saturation, park() returns the caller's fallback. The read
paths shed to a benign empty 200, but crud create/update/get shed to
bad_request (400) / not_found (404) — misreporting a backpressure shed as a
client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected
status" on crud, while every other framework AND the previously-recorded
vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed —
confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn
and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation
with DATABASE_MAX_CONN=1).

Only the shed fallback for the three crud write/get paths becomes 503 Service
Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body)
and 404 (missing item) are unchanged.

Reducing the shed itself (pool / max_inflight capacity, which trades memory) and
a holistic shed policy (read paths also shed to an empty 200) are deferred to a
measured investigation tracked upstream in vanilla.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* perf+fix(vanilla-epoll): zero-alloc chunked-body parse on /baseline11 (reused scratch)

/baseline11's chunked-POST path parsed the body with `dechunk(s string) string`,
which built a strings.Builder + .str() (and strconv_hex's trim_space) per request
— a permanent leak under -gc none. In the arena baseline mix (1/3 of the
templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the MDA2AV#888
re-bench still showed after emit_int closed the GET path.

Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker
scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into),
append data bytes via push_many, parse the integer in place (parse_i64_slice).
No allocation in steady state. Verified byte-identical to the old dechunk on
valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions)
and safe on malformed input.

Also hardens a latent OOB read / DoS: the chunk-size range check is now
overflow-safe — `size > end - data_start` instead of `data_start + size > end`,
which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard
and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable
worker segfault). The old dechunk hit the same input as a bounds-checked panic;
this makes it a controlled, bounded reject. parse_hex_slice now accumulates in
i64 and saturates so the size value itself can't wrap. A 5-lens adversarial
review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity)
returned GO after this fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Benchmark results: vanilla-epoll

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant