#888)
* vanilla: cache json-comp gzip responses + zero-alloc routing
json-comp recompressed the gzip body on EVERY request even though the output
for a given (count, m) is fully deterministic — and gzip CPU, not allocation,
dominates that profile. Cache the COMPLETE gzipped response per (count, m) and
append the cached copy on a hit (bounded map, RwMutex). The benchmark hits only
a handful of (count, m) pairs, so the cache stays tiny.
Also route on the path WITHOUT allocating: a tos() view into the request buffer
instead of all_before('?')'s per-request string copy (one alloc per request on
the hot path), shaving GC churn off baseline/json too.
Local before/after (16-core loopback, gcannon, single listener):
json-comp 58K -> 390K req/s (+570%, 6.7x)
Correctness verified: gzip body decodes to the right items/count/total; the
cached response is byte-identical across requests; all other routes unchanged.
Applies to both the epoll and io_uring variants.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* vanilla: precompute query-key bytes + parse path ints in place
Remove the remaining small per-request allocations on the hot path:
• qint/qstr took a `string` key and called `key.bytes()` every request (one
[]u8 alloc per parameter — baseline parses a+b, async-db min+max+limit…).
Keys are now precomputed `const []u8` (qk_*), built once at init.
• /json/<n> and /crud/items/<id> parsed the id via route[n..].i64(), a
substring copy. parse_u_at() reads the digits straight from the path view.
Local before/after (16-core loopback) is within noise (baseline ~528K→530K,
json ~206K→212K) — these allocs are tiny next to the response builder #866
removed — but allocation scaled hard on the 64-core arena (json +322% there),
so this trims more GC churn for that environment at zero cost. Note: @[manualfree]
is a no-op under the GC build the arena uses (`v -prod` = Boehm GC; manualfree
only affects -autofree), so reducing allocations is the lever, not manualfree.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Benchmark results: vanilla-epoll
* vanilla: DB path — prepared statement (async-db) + single-pass HTML escape
Folds the DB-path work into this PR so everything lands together:
• async-db uses a PostgreSQL prepared statement (PQprepare/PQexecPrepared via
db.pg, lazily prepared per pooled connection) instead of exec_param_many's
per-request server-side SQL re-parse — local +9%.
• escape_html (fortunes) does ONE pass with a no-alloc fast path instead of
replace_each's five full-string passes — local +27% fortunes.
DB profiles remain bound by the stdlib db.pg driver (text protocol), so this
narrows the gap without closing it. Both backends.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Benchmark results: vanilla-io_uring
* vanilla: armor hot-path byte writes against the V `<<` regression
Single-element array push (`arr << x`) is 4-7x slower on post-0.5.1 V
(vlang/v#27468) while bulk push_many, allocation and indexed writes are
unaffected. The two hot single-element `<<` sites are now bulk writes:
- wi() built integer digits with `out << tmp[i]` per digit; it now itoa's
back-to-front into the [20]u8 scratch and flushes with one push_many.
- write_json_response() pushed the item separator `,` and closing `}`
one byte at a time; the closing `}` is now fused with the separator
into a single '},' / '}' push_many.
Output is byte-identical (verified across counts 0..4096 and edge-value
integers). This makes the JSON hot path fast on both the 0.5.1 release and
current master, independent of the upstream codegen regression. Both
epoll and io_uring backends.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* vanilla: build pinned V 0.5.1 from source (pinned vc bootstrap)
Build V from source at the 0.5.1 tag instead of the prebuilt release zip.
Plain `make` can't build an old tag: its latest_vc step `git pull`s the
newest vlang/vc bootstrap, which no longer matches 0.5.1's vlib (fails with
`unknown ident \`native\``). So pin vc to the commit cut for 0.5.1
(vlang/vc f461dfeb = "[v:master] 0c3183c - V 0.5.1") and run make's own
bootstrap recipe (cc -> v1 -> v2 -> v). Drop curl/unzip from the build deps.
Pinned by tag, not a master commit, because post-0.5.1 master carries a
codegen regression (single-element array push 4-7x slower, vlang/v#27468).
Both backends; verified the source-built compiler serves /json and /pipeline
correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* vanilla-epoll: serve static assets with sendfile(2) (zero-copy)
The static handler copied each asset's full prebuilt response (up to ~300 KB)
into the per-connection write_buf every request — a userspace copy plus a large
*scanned* write_buf that grows the GC's stop-the-world cost at high conn counts
(why vanilla sat ~4x behind nginx/swerver on the static profile).
Preload each asset's fd once (O_RDONLY, page-cached, borrowed for the server's
life) and a precomputed response head; serve the head into write_buf and stream
the body zero-copy via core.queue_file (sendfile(2), already wired through the
epoll backend's deferred-send + EPOLLOUT path). write_buf no longer grows, the
body is never copied, and the kernel pushes file pages straight to the socket —
the same model nginx and swerver use.
Local (vendor.js 307 KB, 64c, wrk): 25.7K -> 59.3K req/s, 7.36 -> 16.97 GB/s
(2.3x). Output verified byte-identical (md5) incl. keep-alive. epoll only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* vanilla-epoll: answer /upload by Content-Length (engine drains large bodies)
The lib now streams (drains) request bodies larger than 1 MiB instead of
buffering them, so for a large upload req.body is empty — but the byte count the
upload profile wants is the declared Content-Length. Answer by
req.content_length() (falls back to the buffered body length when absent, which
also covers small bodies that still take the buffered path).
Depends on enghitalo/vanilla#31 (adds HttpRequest.content_length() + the engine
drain); the Dockerfile clones lib main, so that PR must merge before this builds.
Local (source-built V 0.5.1): upload single-conn 45 req/s / 907 MB/s, 32c 303
req/s / 6.1 GB/s — matching the top upload servers; RSS 14 MB (was ~1 GB
buffering). epoll only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* ci: re-trigger validate — vanilla main now provides HttpRequest.content_length() (drain #31 merged); the prior run cloned vanilla before it landed
* refactor(vanilla-epoll): route write-buffer appends through wb()/push_many
Replace the remaining `out << <[]u8>` appends (static header, error consts, the
four crud_* results, and the json-comp gzip-cache hit/store) with a wb() helper
that calls push_many, uniform with the existing ws/wi. The bit-shift `<<` in the
gz-cache key is unrelated and kept as is.
Note: V already lowers `array << array` to array_push_many, so this is codegen-
neutral — a consistency / regression-safety change (the whole write path now
takes push_many's fast path explicitly, robust if `<<` ever regresses for arrays
the way the single-element path did, vlang/v#27468). The hot single-element `<<`
was already armored by ws/wi.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(vanilla-epoll): async-db via the native pg_async driver (enghitalo/vanilla#32)
Convert the framework from the blocking db.pg ConnectionPool to vanilla's native
async Postgres driver (pg_async, vanilla#39) on the epoll async runtime. The DB
endpoints now PARK on the PG socket (ac.watch) and resume in a continuation
instead of blocking a worker thread per query — closing the async-db gap (#32).
- ServerConfig: request_handler → async_handler + make_state. Each worker owns a
per-worker pg_async.PgPool (no cross-worker sharing, no locks) plus its own
cache-aside and json-comp caches; the dataset/prefixes/static assets stay
shared read-only.
- async-db, fortunes, crud (list/get/create/update) issue a query, park, and
render in a single resume continuation that switches on a small per-request
stash. crud_list folds page+total into ONE window-count query (count(*) OVER())
instead of two round-trips. crud_get keeps a per-worker cache-aside (X-Cache).
- DB responses are now hand-built (ws/wi/wb), and JSONB (tags) is emitted RAW
from its binary form — no json.encode reflection, no decode/re-encode.
- Drops the db.pg dependency entirely, so the framework also builds on master V
(master removed pg.ConnectionPool); the non-DB hot paths are unchanged.
Validated on V master against PostgreSQL 18 (items 100k + fortune 199): every
endpoint correct (async-db items incl. binary jsonb, sorted fortunes, crud
list/get/create/update, X-Cache). Throughput: async-db ~14.3k rps @ 4.35ms p50;
/json ~376k rps. Per-worker caches warm under load (a re-GET may MISS across
workers under SO_REUSEPORT — by design, vs the old shared+mutex cache).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* ci: re-trigger #877 — vanilla #40 merged (net-import 0.5.1 build fix)
The previous run failed because the framework's Docker cloned vanilla main BEFORE
the fix landed: V's `net` declares C.socket with typed-enum params on the 0.5.1
tag, clashing with http_server.socket's int C.socket (socket_tcp.c.v). vanilla
PR #40 removes the net imports (socket_tcp → C.htons; pg_async → raw libc dial),
verified to compile under `v -prod .` on the true 0.5.1 tag for both
vanilla-epoll and vanilla-io_uring. This empty commit re-runs validate so it
re-clones the fixed vanilla.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(ci): cache-bust the vanilla clone so the build picks up library fixes
The Docker `RUN git clone … vanilla` layer was cached indefinitely on the
self-hosted runner, so re-runs kept building against a STALE vanilla checkout —
which is why #877 stayed red even after the build fix (vanilla #40) merged: the
build never re-cloned to get it.
Add `ADD https://api.github.com/.../refs/heads/main` before the clone in both
vanilla Dockerfiles. The fetched ref (main's SHA) changes whenever vanilla main
moves, invalidating this layer's cache and forcing a fresh clone. Adding the
step also re-clones on this build (new layer structure), so it now picks up #40.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(vanilla-epoll): share the crud + json-comp caches across workers
The crud cache-aside was per-worker (WorkerCtx), but validate.sh's crud check
does two GETs to /crud/items/42 and requires X-Cache MISS then HIT. With
SO_REUSEPORT the two requests land on different workers, so a per-worker cache
returns MISS both times → validation fails.
Move the cache-aside (and the json-comp gzip cache) into the process-shared
`Shared` (renamed from SharedRO), guarded by RwMutexes since workers are
separate threads — restoring the original shared-cache semantics. The async
Postgres pool stays per-worker (make_state); only the caches are shared.
Verified against the real pgdb-seed.sql + dataset.json: GET /crud/items/42 now
returns MISS then HIT; async-db (count=limit), crud list (5 items, total 9986,
page 1), and fortunes (202 <tr>) all match validate.sh's checks. Compiles under
`v -prod .` on the true V 0.5.1 tag.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): size the per-worker pool by usable cores, not host CPUs
Pair with vanilla's cpuset-aware max_thread_pool_size: compute the per-worker
Postgres pool size against core.max_thread_pool_size (usable cores) instead of
runtime.nr_cpus() (host count). Under api-N the engine now spawns N workers, so
per_worker = total/N gives a sane pool (e.g. 64/4=16, 64/16=4) instead of
64/128=1 — matching the async path's threads≈cores model.
Experiment for #32: test whether removing the 128-on-N-cores oversubscription
recovers the async-db / api-16 regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* revert(vanilla-epoll): restore the sync db.pg DB path (drop the async pg_async conversion)
Three clean, post-cache-bust benchmarks agree the native async pg_async path is a
net loss on the arena's LOCAL low-latency DB profiles: epoll-async vs io_uring-sync
showed sync winning api-4 ~4.9×, fortunes ~3.6×, api-16 ~1.6×, async-db ~1.2×
(io_uring even handicapped by the cpuset change). The async path is bound by DB
concurrency (pool conns) and never beats sync libpq's concurrency-via-threads for
sub-ms queries; cpuset tuning only traded api-16 for api-4. The only async win was
crud, which is cache-bound (skips the DB) — preserved by the sync framework too.
Restore main.v to the pre-conversion sync version (d1a0e73): db.pg ConnectionPool
+ request_handler, keeping ALL the sync-path wins (pipelined, static via sendfile,
upload streaming-drain, json-comp gzip cache, zero-alloc routing, shared X-Cache).
pg_async stays in the vanilla library as a capability for the case it actually wins
(latency-bound / network Postgres). The Dockerfile vanilla-clone cache-bust stays.
Verified: builds under `v -prod .` on the V 0.5.1 tag against current vanilla main.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): short-circuit /pipeline with a precomputed constant
The pipelined profile (the arena's highest-RPS test, ~35M rps) is a fixed
plaintext "ok". Match it on `target` immediately after the path is sliced and
blit a precomputed full-response constant + return — before the '?'-scan, the
route slice, and write_resp's 6-part piecewise header build. requests/pipeline.raw
is `GET /pipeline` with no query, so the exact-match is correct; the now-redundant
route=='/pipeline' arm is dropped from the dispatch chain.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): right-size the fortunes render builder (32KB -> content)
callgrind on the render path showed ~26% of its instructions were the zero-fill
of strings.new_builder(32768) — a flat 32 KB block for a ~1.5 KB response, re-
zeroed every request as the GC reuses it. Size it from the actual rows instead
(160 + 96/row + message bytes); an outlier grows once. The other builders here
were already content-sized.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): re-enable async-db (pg_async park/resume) on flat-state runtime
Restore the native-pg_async async DB path (per-worker PgPool via make_state;
async_handler; submit -> ac.watch(pg fd) -> .suspend -> single on_db_ready
continuation that pumps the result, renders by kind, releases the conn). This is
the proven #32 conversion (was reverted at 84f3dc9 because it REGRESSED the arena
on the old map-based reactor + per-request malloc), re-applied now that PR #41
replaced that with the flat fd-indexed reactor (no hashmap, no per-request alloc)
— the overhead that sank it is gone.
Why: fortunes (2,990 rps) / async-db (10,927) are capped at ~16-way concurrency
by sync thread-per-core blocking (the 64-conn pool sits 75% idle; CPU ~460% = 11
cores idle, waiting on PG). Park/resume frees the worker to keep many queries in
flight -> uses the whole pool -> the swerver (#1, 293k) model. This is the
EXPERIMENT to confirm #41 turns the old regression into a win; needs an arena run.
Keeps the recent sync wins (json-comp gzip cache, route-slice, /pipeline
short-circuit). Builds clean on the flat-state runtime (post-0.5.1 local V).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(vanilla-epoll): process-shared crud + json-comp caches (async X-Cache)
validate.sh failed `[crud cache-aside]: expected MISS then HIT, got MISS MISS`:
restoring the #32 async conversion brought back per-worker caches, but SO_REUSEPORT
routes the two probe GETs to different workers, so each MISSes its cold cache.
Move the crud + gz caches out of per-worker WorkerCtx into the shared SharedRO,
mutex-guarded (RwMutex) — the same process-shared model the sync path uses. Pool
stays per-worker (no lock). Builds clean; this unblocks the async benchmark.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(vanilla-epoll): un-starve the async pool (drop the >8 clamp) + decode_into
The #884 async-db regression was NOT the sub-ms-PG crossover — it was pool
starvation + load-shedding (per the regression analysis). DATABASE_MAX_CONN=256
across 16 workers should give 16 conns/worker, but a `min(8)` clamp forced 8 →
only 128 of the 256 budget used. With one-in-flight-per-conn and closed-loop load
(~64 client conns/worker), the 8-slot ceiling is hit constantly; park() then
SHEDS the overflow as an empty 200, so the closed-loop clients spin and real
throughput collapses to ~1 core (async-db -30%, fortunes -77%, api -75%). crud
was unaffected (+208%) only because it is cache-HIT served and never touches the
pool.
Fixes:
1. Drop the >8 clamp → use the full 256 budget = 16 conns/worker (2x in-flight),
sized to Postgres max_connections.
2. Adopt request_parser.decode_into (no `!HttpRequest` boxing, ~13% of parse) —
the same no-boxing entry the sync build now uses; recovers the json-comp/json
non-DB delta that the async dispatch path was paying.
Builds clean (pg_async, post-0.5.1 local V; vanilla #44 with decode_into is now
in main). Follow-ups (not here): queue on pool-full instead of shedding empty
200s; hoist AsyncCtx out of the async_drain per-request loop (vanilla).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(vanilla-epoll): pipeline async-db queries across requests (increment 3)
Wire the framework onto vanilla's cross-request pipelining. park() now picks the
least-loaded connection via acquire_pipelined() (shed only when all conns are at
the max_inflight cap) instead of acquire()'s one-in-flight-per-conn — the latter
starved the per-worker pool (2 conns/worker on the arena box ⇒ ceiling of 2, then
park sheds the overflow as empty 200s; PR #884's regression). async_submit's
shed-on-full bool is now checked. on_db_ready drops the exclusive release: a
pipelined connection is not held exclusively, its in-flight slot frees when
async_on_readable pops the reply, and the reactor (per-fd watch queue) runs the
connection's parked requests front-first so the FIFO reply aligns with each
request's Stash. async-db/fortunes/crud-list all flow through park(), so all gain
conns×N concurrency. Builds against the local pipelining vanilla.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* ci(vanilla-epoll): pin vanilla to the pipelining branch to bench before merge
TEMPORARY: point the Dockerfile clone + cache-bust ADD at vanilla's
feat/pg-async-pipelining (PR #45) so this arena PR builds against the
cross-request pipelining library and can benchmark before #45 merges. Revert to
refs/heads/main once #45 lands.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): restore /pipeline skip-decode fast path
Local profiling (callgrind on a gcc-built binary, isolated gcannon) showed this
branch lost the `has_pipeline_prefix` fast path the #877 branch had: handle() ran
decode_into + parse_http1_request_line on EVERY request, so the highest-RPS
/pipeline test paid the full HTTP parse (~55% of the per-request CPU; the
in-handle parse alone ~17%) for a fixed response.
Restore it: match the raw `GET /pipeline ` prefix and blit pipeline_resp BEFORE
any parsing. The request is already framed by the reactor, so decode adds nothing
here. After: the per-request /pipeline profile has ZERO parse functions, and local
throughput rose from 90% to 96% of the bare-C epoll floor (gc none). The remaining
gap + the boehm-vs-gc-none delta is per-request allocation/GC (next).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* ci(vanilla-epoll): build against vanilla main (pipelining + recv-path merged)
vanilla#45 (cross-request pipelining: driver + reactor + pool + the alloc-free
recv path) is merged to vanilla main, so drop the temporary feat/pg-async-pipelining
pin and clone main again (with the main-ref cache-bust). Keeps -gc none. The
async-db +329% / pipelined +1622% results were against this exact code (main now
== the merged branch), so no re-bench needed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): arm pooled PG conns with watch_persistent
Park DB queries on the pooled connection via ac.watch_persistent so a
client disconnecting mid-query no longer closes (and forces a reconnect +
re-auth on) the pooled connection: the runtime drains the orphaned reply
in order and keeps the conn open for reuse. Both the initial park and the
not-ready re-arm use it (the single-watch path resets the slot, so the
re-arm must re-stamp the pool-owned flag).
Depends on vanilla watch_persistent (enghitalo/vanilla#47); land after it
merges + vanilla main is updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): render DB responses into a reused per-worker buffer
render_async_db / render_fortunes / render_crud_list each allocated a fresh
response-body []u8 (4 KiB / 32 KiB / 8 KiB) per request. The binary ships
`-gc none`, so those buffers are never freed — a multi-GiB leak under DB load
(async-db measured ~12 KiB/request total, tens of GiB on the arena).
Build the body in a single per-worker scratch buffer (WorkerCtx.scratch), reset
to len 0 each response; it grows to a high-water mark then stays. Safe because a
worker serves one request at a time (no concurrency). Paired with the pg_async
per-connection frames-buffer pool, this takes async-db from 11,971 -> 1,263
bytes/request leaked under `-gc none` (-89.5%) locally; the Boehm build is dead
flat (0 B/req, 41 MiB) and `-gc none` is now FASTER than Boehm on async-db
(48.8K vs 44.2K req/s) since it no longer thrashes an ever-growing heap.
(render_fortunes still allocates its Fortune vector + per-row message copies;
that and the residual ~1.3 KiB/request of async-db allocs are a follow-up.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): zero-alloc DB request path (Stage B, -gc none)
Eliminate the remaining per-request heap allocations on the DB routes (which leak
under the binary's `-gc none` build). With the pg_async/request_parser Stage B
(reused submit scratch, no .bytes() in the wire builders, no-alloc query parse),
async-db drops from 1,263 -> 159 bytes/request (11,971 -> 159 across Stage A+B,
-98.7%); the Boehm build is dead flat.
- Bind params: replace the per-request `[?[]u8(x.str().bytes()), ...]` literals in
every start_* with reused per-worker buffers — param_scratch (int params as
decimal bytes) + params_buf (the []?[]u8), refilled via push_int/push_bytes. The
borrowed slices are copied by write_bind synchronously inside park, so they never
outlive the call. param_scratch cap (256) ≫ worst case (5×20) so it never
reallocates mid-request (which would dangle already-pushed slices).
- Query parse: qint parses i64 in place (parse_i64_slice); qstr_slice returns a
borrowed []u8 view instead of .clone(). Shed-path fallbacks are module consts
(no per-request `.bytes()`).
- Stash: a per-worker free-list (stash_pool) instead of `&Stash{}` per request;
returned only on the terminal .done path — never on the not-ready re-arm, where
it stays live as the watch udata (incl. a FIX 3 dead tombstone). Statement form
for the borrow (a `&Struct{}` if-expression branch miscompiles under -g, #27485).
- /fortunes: reused fortunes_buf with BORROWED message views (no bytestr().clone()),
an explicit byte comparator for the sort, and escape_html_into that escapes
directly into the render scratch (no Builder/string per row).
Gated: all routes correct vs PG18 (async-db, fortunes sort+escape incl. the
<script> XSS row, crud list/get/create/update round-trip, cache MISS->HIT,
baseline11). Builds clean -prod and -g. Needs the pg_async/request_parser Stage B
(enghitalo/vanilla); land after that merges.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(vanilla-epoll): handle i64::MIN in wi() integer formatter
wi()'s itoa negated n into a signed accumulator (`x = -x`), which overflows
for i64::MIN — its magnitude isn't representable as i64 — leaving x negative
so the digit loop never ran and only '-' was emitted. Build the magnitude in
u64 instead (-(n+1)+1, with the +1 done in u64).
Unreachable from current routes (ids are 32-bit, query ints are clamped) so
this is not a live bug, but it's a latent correctness hole in a general
integer helper. Verified against i64::MIN/MAX, 0, and assorted +/- values.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf(vanilla-epoll): zero-alloc int responses on /baseline11 and /upload
Both built their body with `n.str()` — an int->string heap allocation on every
request that leaks under -gc none. callgrind on /baseline11 showed 1.002
allocs/request, all impl_i64_to_string; at 3.4M RPS that path alone was ~6 GiB
RSS in the arena. Format the int into the reused per-worker scratch via a new
emit_int() helper instead (the same render-scratch pattern the DB paths use).
The read DB paths (async-db, fortunes, crud-list) are already callgrind-clean
per request; this closes the general/plaintext response path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(vanilla-epoll): 503 (not 400/404) when the crud DB pool sheds under load
Under DB-pipeline saturation, park() returns the caller's fallback. The read
paths shed to a benign empty 200, but crud create/update/get shed to
bad_request (400) / not_found (404) — misreporting a backpressure shed as a
client error. At arena scale (4096 conns) this surfaced as ~1.4% "unexpected
status" on crud, while every other framework AND the previously-recorded
vanilla-epoll showed 0 with the SAME gcannon (its requests are well-formed —
confirmed by faithful fixture replay: 100% 2xx at 16 and 96 threads, fresh-conn
and keep-alive+reconnect; reproduced the 400 only by forcing pool saturation
with DATABASE_MAX_CONN=1).
Only the shed fallback for the three crud write/get paths becomes 503 Service
Unavailable — the honest backpressure status. Genuine 400 (malformed JSON body)
and 404 (missing item) are unchanged.
Reducing the shed itself (pool / max_inflight capacity, which trades memory) and
a holistic shed policy (read paths also shed to an empty 200) are deferred to a
measured investigation tracked upstream in vanilla.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* perf+fix(vanilla-epoll): zero-alloc chunked-body parse on /baseline11 (reused scratch)
/baseline11's chunked-POST path parsed the body with `dechunk(s string) string`,
which built a strings.Builder + .str() (and strconv_hex's trim_space) per request
— a permanent leak under -gc none. In the arena baseline mix (1/3 of the
templates are chunked POSTs, at ~3.8M req/s) this was the ~6 GiB RSS the #888
re-bench still showed after emit_int closed the GET path.
Replace with `(mut w WorkerCtx) body_int()` dechunking into a reused per-worker
scratch (WorkerCtx.dechunk_buf): byte-walk the chunked region (dechunk_into),
append data bytes via push_many, parse the integer in place (parse_i64_slice).
No allocation in steady state. Verified byte-identical to the old dechunk on
valid bodies (single/multi-chunk, hex/uppercase sizes, 0x100, chunk-extensions)
and safe on malformed input.
Also hardens a latent OOB read / DoS: the chunk-size range check is now
overflow-safe — `size > end - data_start` instead of `data_start + size > end`,
which wraps i32 for a crafted chunk size near 0x7fffffff, slipping past the guard
and feeding a ~2 GiB out-of-bounds read into push_many (remotely-triggerable
worker segfault). The old dechunk hit the same input as a bounds-checked panic;
this makes it a controlled, bounded reject. parse_hex_slice now accumulates in
i64 and saturates so the size value itself can't wrap. A 5-lens adversarial
review (correctness, scratch-reuse lifetime, zero-alloc, method/caller parity)
returned GO after this fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Benchmark results: vanilla-epoll
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Summary
Allocation/CPU-reduction and large-I/O follow-ups to the zero-alloc handler (#866), in
frameworks/vanilla-epoll/main.v(+vanilla-io_uringwhere noted) — consolidated into one PR.Allocation / GC-pressure (both backends)
(count, m)is deterministic and gzip CPU dominates, so the complete gzipped response is cached per(count, m)(bounded map + RwMutex) and appended on a hit. Compress once, reuse.?via atos()view into the request buffer instead ofall_before('?')'s per-request copy.qint/qstrtake a precomputedconst []u8(qk_*) instead ofkey.bytes()per call; path ints (/json/<n>,/crud/items/<id>) parsed in place (parse_u_at) instead ofroute[n..]substrings.PQprepare/PQexecPrepared(lazily prepared per pooled connection) instead ofexec_param_many's per-request server-side SQL re-parse.escape_html(fortunes) does one pass with a no-alloc fast path instead ofreplace_each's five full-string passes.out << bin the JSON paths withpush_many/indexed writes (wiitoa into a stack scratch + onepush_many; fused object},separators). Single-element<<is 4–7× slower on post-0.5.1 V (vlang/v#27468); this keeps the hot path fast across V versions. Output is byte-identical (verified across counts 0..4096 + edge ints).Large I/O (epoll only; io_uring is a follow-up)
/uploadanswered by Content-Length (streaming drain) — the engine now drains large request bodies into a fixed buffer (recv + discard) instead of buffering the whole 20 MB into a per-conn read buffer (which grew into a big scanned GC block → 64-core stop-the-world cliff). The handler answers fromreq.content_length()(lib change merged as epoll: stream (drain) large request bodies instead of buffering them — fix the upload cliff enghitalo/vanilla#31), falling back tobody.lenfor small/buffered bodies (crud/json POSTs unaffected — only >1 MiB drains). Local (vs the upload leader zix): single-conn 45 req/s / 895 MB/s (zix 45 / 903); 32-parallel 303 req/s / 6.1 GB/s (zix 336 / 6.7) — within ~10%, the 86× gap is closed; server RSS 12 MB (was ~928 MB while buffering).sendfile(2)— preload each asset'sO_RDONLYfd + a precomputed response head;out << f.headerthencore.queue_file(f.fd, 0, f.size)streams the body zero-copy from the page cache, so the write buffer never grows into a large scanned block (the static "cliff"). Local vendor.js (307 KB, 64 conns): 25.7K → 59.3K req/s (2.3×), 7.36 → 16.97 GB/s; md5-identical, keep-alive OK.Build
vlang/vc@f461dfeb) — reproducible builds, and avoids the post-0.5.1 master<<regression (#27468) that otherwise slows every path. Matches the-benchmarker pin (v: pin compiler to V 0.5.1 (from source); keep vanilla_io_uring but skip it in CI the-benchmarker/web-frameworks#9466).Arena results (vanilla-epoll, 64-core, Δ vs #866's zero-alloc baseline)
Highlights: upload now leads the profile — the streaming drain (#7) took it from ~30 to 11,109 req/s, past the prior leader (~4.3K), at −89% memory. static +460% at −90% memory (sendfile(2), #8 — the buffer-growth cliff is gone). json-comp leads (+1,380%, gzip-response cache). pipelined +1,407% (zero-alloc routing +
<<-armor). The DB-bound profiles (api-4/16, async-db) are flat by design — #877 doesn't touch the DB engine; closing those is the async DB driver's job (below).Lib dependencies (all merged to enghitalo/vanilla
main).noscan_dataflag behind$if vanilla_noscan ?so the lib builds on 0.5.1.frame_head_len+HttpRequest.content_length()), which Tweak server TCP loopback configs for performance #7 here depends on.The DB-bound profiles stay flat because they're driver-bound (
db.pg's synchronous text protocol;vebon the same stack is slower, so vanilla already maxes it). Closing that gap is the async, epoll-integrated DB driver: the async runtime (ac.watch(fd, …)+ continuations) is now merged to enghitalo/vanillamain(#34 → #36 → #37, Linux epoll + macOS kqueue), and a measured PoC showed ~6× per-thread on the DB path. The remaining piece — a thindb_asyncconsumer on that runtime — is tracked in enghitalo/vanilla#32. Correctness verified for all paths here.Supersedes #878 (folded in here).
🤖 Generated with Claude Code