vanilla-epoll: re-enable async-db (pg_async park/resume) on the flat-state runtime#884
Draft
enghitalo wants to merge 22 commits into
Draft
vanilla-epoll: re-enable async-db (pg_async park/resume) on the flat-state runtime#884enghitalo wants to merge 22 commits into
enghitalo wants to merge 22 commits into
Conversation
json-comp recompressed the gzip body on EVERY request even though the output
for a given (count, m) is fully deterministic — and gzip CPU, not allocation,
dominates that profile. Cache the COMPLETE gzipped response per (count, m) and
append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only
a handful of (count, m) pairs, so the cache stays tiny.
Also route on the path WITHOUT allocating: a tos() view into the request buffer
instead of all_before('?')'s per-request string copy (one alloc per request on
the hot path), shaving GC churn off baseline/json too.
Local before/after (16-core loopback, gcannon, single listener):
json-comp 58K -> 390K req/s (+570%, 6.7x)
Correctness verified: gzip body decodes to the right items/count/total; the
cached response is byte-identical across requests; all other routes unchanged.
Applies to both the epoll and io_uring variants.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the remaining small per-request allocations on the hot path:
• qint/qstr took a `string` key and called `key.bytes()` every request (one
[]u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…).
Keys are now precomputed `const []u8` (qk_*), built once at init.
• /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a
substring copy. parse_u_at() reads the digits straight from the path view.
Local before/after (16-core loopback) is within noise (baseline ~528K→530K,
json ~206K→212K) — these allocs are tiny next to the response builder MDA2AV#866
removed — but allocation scaled hard on the 64-core arena (json +322% there),
so this trims more GC churn for that environment at zero cost. Note: @[manualfree]
is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree
only affects -autofree), so reducing allocations is the lever, not manualfree.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scape
Folds the DB-path work into this PR so everything lands together:
• async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via
db.pg, lazily prepared per pooled connection) instead of exec_param_many's
per-request server-side SQL re-parse — local +9%.
• escape_html (fortunes) does ONE pass with a no-alloc fast path instead of
replace_each's five full-string passes — local +27% fortunes.
DB profiles remain bound by the stdlib db.pg driver (text protocol), so this
narrows the gap without closing it. Both backends.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V (vlang/v#27468) while bulk push_many, allocation and indexed writes are unaffected. The two hot single-element `<<` sites are now bulk writes: - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's back-to-front into the [20]u8 scratch and flushes with one push_many. - write_json_response() pushed the item separator `,` and closing `}` one byte at a time; the closing `}` is now fused with the separator into a single '},' / '}' push_many. Output is byte-identical (verified across counts 0..4096 and edge-value integers). This makes the JSON hot path fast on both the 0.5.1 release and current master, independent of the upstream codegen regression. Both epoll and io_uring backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Build V from source at the 0.5.1 tag instead of the prebuilt release zip. Plain `make` can't build an old tag: its latest_vc step `git pull`s the newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with `unknown ident \`native\``). So pin vc to the commit cut for 0.5.1 (vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps. Pinned by tag, not a master commit, because post-0.5.1 master carries a codegen regression (single-element array push 4-7x slower, vlang/v#27468). Both backends; verified the source-built compiler serves /json and /pipeline correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The static handler copied each asset's full prebuilt response (up to ~300 KB) into the per-connection write_buf every request — a userspace copy plus a large *scanned* write_buf that grows the GC's stop-the-world cost at high conn counts (why vanilla sat ~4x behind nginx/swerver on the static profile). Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's life) and a precomputed response head; serve the head into write_buf and stream the body zero-copy via core.queue_file (sendfile(2), already wired through the epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the body is never copied, and the kernel pushes file pages straight to the socket — the same model nginx and swerver use. Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s (2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bodies) The lib now streams (drains) request bodies larger than 1 MiB instead of buffering them, so for a large upload req.body is empty — but the byte count the upload profile wants is the declared Content-Length. Answer by req.content_length() (falls back to the buffered body length when absent, which also covers small bodies that still take the buffered path). Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine drain); the Dockerfile clones lib main, so that PR must merge before this builds. Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303 req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB buffering). epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt_length() (drain MDA2AV#31 merged); the prior run cloned vanilla before it landed
…_many Replace the remaining `out << <[]u8>` appends (static header, error consts, the four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the gz-cache key is unrelated and kept as is. Note: V already lowers `array << array` to array_push_many, so this is codegen- neutral — a consistency / regression-safety change (the whole write path now takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays the way the single-element path did, vlang/v#27468). The hot single-element `<<` was already armored by ws/wi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lo/vanilla#32) Convert the framework from the blocking db.pg ConnectionPool to vanilla's native async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB endpoints now PARK on the PG socket (ac.watch) and resume in a continuation instead of blocking a worker thread per query — closing the async-db gap (MDA2AV#32). - ServerConfig: request_handler → async_handler + make_state. Each worker owns a per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own cache-aside and json-comp caches; the dataset/prefixes/static assets stay shared read-only. - async-db, fortunes, crud (list/get/create/update) issue a query, park, and render in a single resume continuation that switches on a small per-request stash. crud_list folds page+total into ONE window-count query (count(*) OVER()) instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache). - DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW from its binary form — no json.encode reflection, no decode/re-encode. - Drops the db.pg dependency entirely, so the framework also builds on master V (master removed pg.ConnectionPool); the non-DB hot paths are unchanged. Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50; /json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across workers under SO_REUSEPORT — by design, vs the old shared+mutex cache). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…1 build fix) The previous run failed because the framework's Docker cloned vanilla main BEFORE the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1 tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla PR MDA2AV#40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial), verified to compile under `v -prod .` on the true 0.5.1 tag for both vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it re-clones the fixed vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ixes The Docker `RUN git clone … vanilla` layer was cached indefinitely on the self-hosted runner, so re-runs kept building against a STALE vanilla checkout — which is why MDA2AV#877 stayed red even after the build fix (vanilla MDA2AV#40) merged: the build never re-cloned to get it. Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main moves, invalidating this layer's cache and forcing a fresh clone. Adding the step also re-clones on this build (new layer structure), so it now picks up MDA2AV#40. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With SO_REUSEPORT the two requests land on different workers, so a per-worker cache returns MISS both times → validation fails. Move the cache-aside (and the json-comp gzip cache) into the process-shared `Shared` (renamed from SharedRO), guarded by RwMutexes since workers are separate threads — restoring the original shared-cache semantics. The async Postgres pool stays per-worker (make_state); only the caches are shared. Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986, page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under `v -prod .` on the true V 0.5.1 tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…st CPUs Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker Postgres pool size against core.max_thread_pool_size (usable cores) instead of runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of 64/128=1 — matching the async path's threads≈cores model. Experiment for MDA2AV#32: test whether removing the 128-on-N-cores oversubscription recovers the async-db / api-16 regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… pg_async conversion) Three clean, post-cache-bust benchmarks agree the native async pg_async path is a net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2× (io_uring even handicapped by the cpuset change). The async path is bound by DB concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was crud, which is cache-bound (skips the DB) — preserved by the sync framework too. Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool + request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile, upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache). pg_async stays in the vanilla library as a capability for the case it actually wins (latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays. Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed plaintext "ok". Match it on `target` immediately after the path is sliced and blit a precomputed full-response constant + return — before the '?'-scan, the route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant route=='/pipeline' arm is dropped from the dispatch chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…content) callgrind on the render path showed ~26% of its instructions were the zero-fill of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re- zeroed every request as the GC reuses it. Size it from the actual rows instead (160 + 96/row + message bytes); an outlier grows once. The other builders here were already content-sized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t-state runtime Restore the native-pg_async async DB path (per-worker PgPool via make_state; async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready continuation that pumps the result, renders by kind, releases the conn). This is the proven MDA2AV#32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena on the old map-based reactor + per-request malloc), re-applied now that PR MDA2AV#41 replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc) — the overhead that sank it is gone. Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11 cores idle, waiting on PG). Park/resume frees the worker to keep many queries in flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the EXPERIMENT to confirm MDA2AV#41 turns the old regression into a win; needs an arena run. Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
/benchmark -f vanilla-epoll |
Contributor
|
👋 |
…ache) validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`: restoring the MDA2AV#32 async conversion brought back per-worker caches, but SO_REUSEPORT routes the two probe GETs to different workers, so each MISSes its cold cache. Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO, mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool stays per-worker (no lock). Builds clean; this unblocks the async benchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
/benchmark -f vanilla-epoll |
Contributor
|
👋 |
Contributor
Benchmark ResultsFramework:
Full log |
…code_into The MDA2AV#884 async-db regression was NOT the sub-ms-PG crossover — it was pool starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256 across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 → only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load (~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then SHEDS the overflow as an empty 200, so the closed-loop clients spin and real throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud was unaffected (+208%) only because it is cache-HIT served and never touches the pool. Fixes: 1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight), sized to Postgres max_connections. 2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) — the same no-boxing entry the sync build now uses; recovers the json-comp/json non-DB delta that the async dispatch path was paying. Builds clean (pg_async, post-0.5.1 local V; vanilla MDA2AV#44 with decode_into is now in main). Follow-ups (not here): queue on pool-full instead of shedding empty 200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
/benchmark -f vanilla-epoll |
Contributor
|
👋 |
Contributor
Benchmark ResultsFramework:
Full log |
Contributor
Benchmark ResultsFramework:
Full log |
This was referenced Jun 17, 2026
MDA2AV
pushed a commit
that referenced
this pull request
Jun 19, 2026
#888) * vanilla: cache json-comp gzip responses + zero-alloc routing json-comp recompressed the gzip body on EVERY request even though the output for a given (count, m) is fully deterministic — and gzip CPU, not allocation, dominates that profile. Cache the COMPLETE gzipped response per (count, m) and append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only a handful of (count, m) pairs, so the cache stays tiny. Also route on the path WITHOUT allocating: a tos() view into the request buffer instead of all_before('?')'s per-request string copy (one alloc per request on the hot path), shaving GC churn off baseline/json too. Local before/after (16-core loopback, gcannon, single listener): json-comp 58K -> 390K req/s (+570%, 6.7x) Correctness verified: gzip body decodes to the right items/count/total; the cached response is byte-identical across requests; all other routes unchanged. Applies to both the epoll and io_uring variants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla: precompute query-key bytes + parse path ints in place Remove the remaining small per-request allocations on the hot path: • qint/qstr took a `string` key and called `key.bytes()` every request (one []u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…). Keys are now precomputed `const []u8` (qk_*), built once at init. • /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a substring copy. parse_u_at() reads the digits straight from the path view. Local before/after (16-core loopback) is within noise (baseline ~528K→530K, json ~206K→212K) — these allocs are tiny next to the response builder #866 removed — but allocation scaled hard on the 64-core arena (json +322% there), so this trims more GC churn for that environment at zero cost. Note: @[manualfree] is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree only affects -autofree), so reducing allocations is the lever, not manualfree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Benchmark results: vanilla-epoll * vanilla: DB path — prepared statement (async-db) + single-pass HTML escape Folds the DB-path work into this PR so everything lands together: • async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via db.pg, lazily prepared per pooled connection) instead of exec_param_many's per-request server-side SQL re-parse — local +9%. • escape_html (fortunes) does ONE pass with a no-alloc fast path instead of replace_each's five full-string passes — local +27% fortunes. DB profiles remain bound by the stdlib db.pg driver (text protocol), so this narrows the gap without closing it. Both backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Benchmark results: vanilla-io_uring * vanilla: armor hot-path byte writes against the V `<<` regression Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V (vlang/v#27468) while bulk push_many, allocation and indexed writes are unaffected. The two hot single-element `<<` sites are now bulk writes: - wi() built integer digits with `out << tmp[i]` per digit; it now itoa's back-to-front into the [20]u8 scratch and flushes with one push_many. - write_json_response() pushed the item separator `,` and closing `}` one byte at a time; the closing `}` is now fused with the separator into a single '},' / '}' push_many. Output is byte-identical (verified across counts 0..4096 and edge-value integers). This makes the JSON hot path fast on both the 0.5.1 release and current master, independent of the upstream codegen regression. Both epoll and io_uring backends. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla: build pinned V 0.5.1 from source (pinned vc bootstrap) Build V from source at the 0.5.1 tag instead of the prebuilt release zip. Plain `make` can't build an old tag: its latest_vc step `git pull`s the newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with `unknown ident \`native\``). So pin vc to the commit cut for 0.5.1 (vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps. Pinned by tag, not a master commit, because post-0.5.1 master carries a codegen regression (single-element array push 4-7x slower, vlang/v#27468). Both backends; verified the source-built compiler serves /json and /pipeline correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla-epoll: serve static assets with sendfile(2) (zero-copy) The static handler copied each asset's full prebuilt response (up to ~300 KB) into the per-connection write_buf every request — a userspace copy plus a large *scanned* write_buf that grows the GC's stop-the-world cost at high conn counts (why vanilla sat ~4x behind nginx/swerver on the static profile). Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's life) and a precomputed response head; serve the head into write_buf and stream the body zero-copy via core.queue_file (sendfile(2), already wired through the epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the body is never copied, and the kernel pushes file pages straight to the socket — the same model nginx and swerver use. Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s (2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * vanilla-epoll: answer /upload by Content-Length (engine drains large bodies) The lib now streams (drains) request bodies larger than 1 MiB instead of buffering them, so for a large upload req.body is empty — but the byte count the upload profile wants is the declared Content-Length. Answer by req.content_length() (falls back to the buffered body length when absent, which also covers small bodies that still take the buffered path). Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine drain); the Dockerfile clones lib main, so that PR must merge before this builds. Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303 req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB buffering). epoll only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci: re-trigger validate — vanilla main now provides HttpRequest.content_length() (drain #31 merged); the prior run cloned vanilla before it landed * refactor(vanilla-epoll): route write-buffer appends through wb()/push_many Replace the remaining `out << <[]u8>` appends (static header, error consts, the four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the gz-cache key is unrelated and kept as is. Note: V already lowers `array << array` to array_push_many, so this is codegen- neutral — a consistency / regression-safety change (the whole write path now takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays the way the single-element path did, vlang/v#27468). The hot single-element `<<` was already armored by ws/wi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(vanilla-epoll): async-db via the native pg_async driver (enghitalo/vanilla#32) Convert the framework from the blocking db.pg ConnectionPool to vanilla's native async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB endpoints now PARK on the PG socket (ac.watch) and resume in a continuation instead of blocking a worker thread per query — closing the async-db gap (#32). - ServerConfig: request_handler → async_handler + make_state. Each worker owns a per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own cache-aside and json-comp caches; the dataset/prefixes/static assets stay shared read-only. - async-db, fortunes, crud (list/get/create/update) issue a query, park, and render in a single resume continuation that switches on a small per-request stash. crud_list folds page+total into ONE window-count query (count(*) OVER()) instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache). - DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW from its binary form — no json.encode reflection, no decode/re-encode. - Drops the db.pg dependency entirely, so the framework also builds on master V (master removed pg.ConnectionPool); the non-DB hot paths are unchanged. Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50; /json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across workers under SO_REUSEPORT — by design, vs the old shared+mutex cache). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci: re-trigger #877 — vanilla #40 merged (net-import 0.5.1 build fix) The previous run failed because the framework's Docker cloned vanilla main BEFORE the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1 tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla PR #40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial), verified to compile under `v -prod .` on the true 0.5.1 tag for both vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it re-clones the fixed vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(ci): cache-bust the vanilla clone so the build picks up library fixes The Docker `RUN git clone … vanilla` layer was cached indefinitely on the self-hosted runner, so re-runs kept building against a STALE vanilla checkout — which is why #877 stayed red even after the build fix (vanilla #40) merged: the build never re-cloned to get it. Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main moves, invalidating this layer's cache and forcing a fresh clone. Adding the step also re-clones on this build (new layer structure), so it now picks up #40. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): share the crud + json-comp caches across workers The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With SO_REUSEPORT the two requests land on different workers, so a per-worker cache returns MISS both times → validation fails. Move the cache-aside (and the json-comp gzip cache) into the process-shared `Shared` (renamed from SharedRO), guarded by RwMutexes since workers are separate threads — restoring the original shared-cache semantics. The async Postgres pool stays per-worker (make_state); only the caches are shared. Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986, page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under `v -prod .` on the true V 0.5.1 tag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): size the per-worker pool by usable cores, not host CPUs Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker Postgres pool size against core.max_thread_pool_size (usable cores) instead of runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of 64/128=1 — matching the async path's threads≈cores model. Experiment for #32: test whether removing the 128-on-N-cores oversubscription recovers the async-db / api-16 regression. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert(vanilla-epoll): restore the sync db.pg DB path (drop the async pg_async conversion) Three clean, post-cache-bust benchmarks agree the native async pg_async path is a net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2× (io_uring even handicapped by the cpuset change). The async path is bound by DB concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was crud, which is cache-bound (skips the DB) — preserved by the sync framework too. Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool + request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile, upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache). pg_async stays in the vanilla library as a capability for the case it actually wins (latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays. Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): short-circuit /pipeline with a precomputed constant The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed plaintext "ok". Match it on `target` immediately after the path is sliced and blit a precomputed full-response constant + return — before the '?'-scan, the route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant route=='/pipeline' arm is dropped from the dispatch chain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): right-size the fortunes render builder (32KB -> content) callgrind on the render path showed ~26% of its instructions were the zero-fill of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re- zeroed every request as the GC reuses it. Size it from the actual rows instead (160 + 96/row + message bytes); an outlier grows once. The other builders here were already content-sized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): re-enable async-db (pg_async park/resume) on flat-state runtime Restore the native-pg_async async DB path (per-worker PgPool via make_state; async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready continuation that pumps the result, renders by kind, releases the conn). This is the proven #32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena on the old map-based reactor + per-request malloc), re-applied now that PR #41 replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc) — the overhead that sank it is gone. Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11 cores idle, waiting on PG). Park/resume frees the worker to keep many queries in flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the EXPERIMENT to confirm #41 turns the old regression into a win; needs an arena run. Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): process-shared crud + json-comp caches (async X-Cache) validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`: restoring the #32 async conversion brought back per-worker caches, but SO_REUSEPORT routes the two probe GETs to different workers, so each MISSes its cold cache. Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO, mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool stays per-worker (no lock). Builds clean; this unblocks the async benchmark. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): un-starve the async pool (drop the >8 clamp) + decode_into The #884 async-db regression was NOT the sub-ms-PG crossover — it was pool starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256 across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 → only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load (~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then SHEDS the overflow as an empty 200, so the closed-loop clients spin and real throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud was unaffected (+208%) only because it is cache-HIT served and never touches the pool. Fixes: 1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight), sized to Postgres max_connections. 2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) — the same no-boxing entry the sync build now uses; recovers the json-comp/json non-DB delta that the async dispatch path was paying. Builds clean (pg_async, post-0.5.1 local V; vanilla #44 with decode_into is now in main). Follow-ups (not here): queue on pool-full instead of shedding empty 200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(vanilla-epoll): pipeline async-db queries across requests (increment 3) Wire the framework onto vanilla's cross-request pipelining. park() now picks the least-loaded connection via acquire_pipelined() (shed only when all conns are at the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then park sheds the overflow as empty 200s; PR #884's regression). async_submit's shed-on-full bool is now checked. on_db_ready drops the exclusive release: a pipelined connection is not held exclusively, its in-flight slot frees when async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the connection's parked requests front-first so the FIFO reply aligns with each request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain conns×N concurrency. Builds against the local pipelining vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(vanilla-epoll): pin vanilla to the pipelining branch to bench before merge TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's feat/pg-async-pipelining (PR #45) so this arena PR builds against the cross-request pipelining library and can benchmark before #45 merges. Revert to refs/heads/main once #45 lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): restore /pipeline skip-decode fast path Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this branch lost the `has_pipeline_prefix` fast path the #877 branch had: handle() ran decode_into + parse_http1_request_line on EVERY request, so the highest-RPS /pipeline test paid the full HTTP parse (~55% of the per-request CPU; the in-handle parse alone ~17%) for a fixed response. Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE any parsing. The request is already framed by the reactor, so decode adds nothing here. After: the per-request /pipeline profile has ZERO parse functions, and local throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining gap + the boehm-vs-gc-none delta is per-request allocation/GC (next). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * ci(vanilla-epoll): build against vanilla main (pipelining + recv-path merged) vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The async-db +329% / pipelined +1622% results were against this exact code (main now == the merged branch), so no re-bench needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): arm pooled PG conns with watch_persistent Park DB queries on the pooled connection via ac.watch_persistent so a client disconnecting mid-query no longer closes (and forces a reconnect + re-auth on) the pooled connection: the runtime drains the orphaned reply in order and keeps the conn open for reuse. Both the initial park and the not-ready re-arm use it (the single-watch path resets the slot, so the re-arm must re-stamp the pool-owned flag). Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it merges + vanilla main is updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): render DB responses into a reused per-worker buffer render_async_db / render_fortunes / render_crud_list each allocated a fresh response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships `-gc none`, so those buffers are never freed — a multi-GiB leak under DB load (async-db measured ~12 KiB/request total, tens of GiB on the arena). Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset to len 0 each response; it grows to a high-water mark then stays. Safe because a worker serves one request at a time (no concurrency). Paired with the pg_async per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263 bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db (48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap. (render_fortunes still allocates its Fortune vector + per-row message copies; that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): zero-alloc DB request path (Stage B, -gc none) Eliminate the remaining per-request heap allocations on the DB routes (which leak under the binary's `-gc none` build). With the pg_async/request_parser Stage B (reused submit scratch, no .bytes() in the wire builders, no-alloc query parse), async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B, -98.7%); the Boehm build is dead flat. - Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in every start_* with reused per-worker buffers — param_scratch (int params as decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The borrowed slices are copied by write_bind synchronously inside park, so they never outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never reallocates mid-request (which would dangle already-pushed slices). - Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts (no per-request `.bytes()`). - Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request; returned only on the terminal .done path — never on the not-ready re-arm, where it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485). - /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()), an explicit byte comparator for the sort, and escape_html_into that escapes directly into the render scratch (no Builder/string per row). Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the <script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT, baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B (enghitalo/vanilla); land after that merges. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): handle i64::MIN in wi() integer formatter wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows for i64::MIN — its magnitude isn't representable as i64 — leaving x negative so the digit loop never ran and only '-' was emitted. Build the magnitude in u64 instead (-(n+1)+1, with the +1 done in u64). Unreachable from current routes (ids are 32-bit, query ints are clamped) so this is not a live bug, but it's a latent correctness hole in a general integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(vanilla-epoll): zero-alloc int responses on /baseline11 and /upload Both built their body with `n.str()` — an int->string heap allocation on every request that leaks under -gc none. callgrind on /baseline11 showed 1.002 allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB RSS in the arena. Format the int into the reused per-worker scratch via a new emit_int() helper instead (the same render-scratch pattern the DB paths use). The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean per request; this closes the general/plaintext response path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(vanilla-epoll): 503 (not 400/404) when the crud DB pool sheds under load Under DB-pipeline saturation, park() returns the caller's fallback. The read paths shed to a benign empty 200, but crud create/update/get shed to bad_request (400) / not_found (404) — misreporting a backpressure shed as a client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected status" on crud, while every other framework AND the previously-recorded vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed — confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation with DATABASE_MAX_CONN=1). Only the shed fallback for the three crud write/get paths becomes 503 Service Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body) and 404 (missing item) are unchanged. Reducing the shed itself (pool / max_inflight capacity, which trades memory) and a holistic shed policy (read paths also shed to an empty 200) are deferred to a measured investigation tracked upstream in vanilla. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf+fix(vanilla-epoll): zero-alloc chunked-body parse on /baseline11 (reused scratch) /baseline11's chunked-POST path parsed the body with `dechunk(s string) string`, which built a strings.Builder + .str() (and strconv_hex's trim_space) per request — a permanent leak under -gc none. In the arena baseline mix (1/3 of the templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the #888 re-bench still showed after emit_int closed the GET path. Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into), append data bytes via push_many, parse the integer in place (parse_i64_slice). No allocation in steady state. Verified byte-identical to the old dechunk on valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions) and safe on malformed input. Also hardens a latent OOB read / DoS: the chunk-size range check is now overflow-safe — `size > end - data_start` instead of `data_start + size > end`, which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable worker segfault). The old dechunk hit the same input as a bounds-checked panic; this makes it a controlled, bounded reject. parse_hex_slice now accumulates in i64 and saturates so the size value itself can't wrap. A 5-lens adversarial review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity) returned GO after this fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Benchmark results: vanilla-epoll --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Experiment — benchmark vs #877 (the sync baseline).
What
Switches the
vanilla-epollDB path from the synchronousdb.pgconnection pool back to the nativepg_asyncpark-and-resume model: per-workerPgPoolviamake_state, anasync_handlerthat submits the query andac.watch(pg_fd)→.suspend, and a singleon_db_readycontinuation that pumps the result, renders by kind, and releases the connection. Non-DB routes (/pipeline,/baseline11,/json,/static,/upload) return.doneimmediately.Why
On the sync path,
fortunes(~3.0k rps) andasync-db(~10.9k rps) are capped at ~16-way concurrency by thread-per-core blocking — each worker thread blocks onacquire → exec → release, so the 64-connection pool sits ~75% idle and the server runs at ~460% CPU (≈11 cores idle, parked inlibpq recvwaiting on Postgres). Park-and-resume frees the worker to keep many queries in flight → uses the whole pool → the model the currentasync-dbleader (swerver, ~370k) uses.Why now (it was reverted before)
This is the #32 conversion that was reverted (regressed the arena) because the old map-based reactor + a per-request malloc per watch cost more than the sync path won at sub-ms local-PG latency. enghitalo/vanilla#41 replaced that with a flat fd-indexed reactor (no hashmap, no per-request allocation) — the exact overhead that sank it. This PR is the experiment to confirm #41 turns the regression into a win.
Notes for review
/pipelineshort-circuit) plus the async DB re-conversion — the delta vs vanilla: json-comp cache, zero-alloc routing, DB prepared statement + HTML escape #877 is the DB path going async. Build verified clean against the flat-state runtime.validate.shbefore); only the runtime underneath changed.What to watch in the benchmark
fortunes,async-db,crud(target 4–12× as concurrency lifts from ~16 to pool size).baseline,json,json-comp(Add spring framework #2),pipelined,static(now ~1.2× from Metadata improvements to add more info #1) — these are non-DB and return.doneimmediately, so they should match vanilla: json-comp cache, zero-alloc routing, DB prepared statement + HTML escape #877./benchmark -f vanilla-epoll🤖 Generated with Claude Code