Run parameters: 3 runs per (library, scenario), per-scenario worker counts and durations, Postgres 16 on docker-compose. Host: Apple Silicon (darwin/arm64), Go 1.26.0. Libraries: River v0.35.0, rstudio/platform-lib v3.0.6. Date: 2026-04-21. Runs analyzed: 54 across 18 scenario × library pairs.
This is the third benchmark pass, combining:
- Variance pass: 3 runs per cell across 7 scenarios (original 5 +
steady_under/steady_balanced). - Item 1: platform-lib adapter now supports exponential backoff (
ExponentialBackoffRiverLike), andrate_limit_pressurewas re-run with it enabled for apples-to-apples retry-shape comparison. - Item 2: new
noisy_neighbor_saturatedscenario (100 Hz / 10 workers / 90% tenant skew) to force real starvation pressure. - Item 3: new
high_scalescenario (300 Hz / 100 workers / 10 tenants, under-capacity) to test LISTEN/NOTIFY scaling with absolute rate. - Items 4 (
crash_recovery) and 5 (rscacheintegration) are documented indocs/FUTURE.mdas deferred with concrete design notes.
| Finding | Where it matters | Evidence | Confidence |
|---|---|---|---|
| LISTEN/NOTIFY responsiveness — platform-lib 10–60× faster on pickup latency when not saturated | notify_latency, steady_under, high_scale, noisy_neighbor |
p95 gap consistent 16–63×, tight run-to-run variance | High |
| Saturated throughput advantage — platform-lib completes 25–30% more jobs in the same wall-clock under backlog | burst, steady_over, noisy_neighbor_saturated |
Completions per second consistent across runs | Medium — may be adapter-implementation sensitive |
| LISTEN/NOTIFY advantage scales with rate — at 300 Hz the gap widens to 63×, not narrows | high_scale |
4.5ms vs 285ms p95 at 300 Hz | High |
| Retry shape, not count — both libraries discard the same jobs; their backoff shapes differ by design | rate_limit_pressure (both configurations) |
Identical failure counts; p95 1000× different by retry policy | High, but the backoff implementations are not identical-by-construction |
| Natural FIFO fairness — both libraries produce ~1.0 fairness ratios even under saturated skew | noisy_neighbor, noisy_neighbor_saturated |
Per-tenant p95 near equal; FIFO by design | High |
| Goroutine footprint — River uses ~2× more goroutines than platform-lib | All scenarios | Maxes consistent across runs | High |
Biggest architectural distinction: platform-lib prioritizes pickup latency responsiveness; River prioritizes durability, leader-election resilience, and operational batteries-included. Both are correct choices for their designs.
When workers are not saturated (service rate > enqueue rate), pickup latency is the LISTEN/NOTIFY round-trip cost. Across four distinct scenarios at different rates:
| Scenario | platform-lib p95 | River p95 | Ratio |
|---|---|---|---|
notify_latency (20 Hz, zero-work) |
5.7 ms | 102.7 ms | 18× |
steady_under (30 Hz, 20 workers) |
6.3 ms | 101.5 ms | 16× |
noisy_neighbor (50 Hz, 20 workers) |
4.7 ms | 92.8 ms | 20× |
high_scale (300 Hz, 100 workers) |
4.5 ms | 285.0 ms | 63× |
high_scale is the new data point. At 6× higher enqueue rate than the other unsaturated scenarios:
- platform-lib's pickup p95 actually improves (4.5 ms vs 5.7 ms) — higher rate keeps the LISTEN/NOTIFY path warm.
- River's pickup p95 degrades (285 ms vs 102 ms) — more jobs arrive between polling ticks, so the last-arrived in each tick waits longer.
The LISTEN/NOTIFY advantage scales with load. It's not an "at idle" effect; it's structural.
| Library | Scenario | Runs | Throughput (c/s) | Pickup p50 | Pickup p95 (med, min–max) | Pickup p99 |
|---|---|---|---|---|---|---|
| platform-lib | notify_latency |
3 | 10.0 | 4.0 | 5.7 (5.7–7.1) | 7.3 |
| platform-lib | steady_under |
3 | 15.0 | 3.5 | 6.3 (5.3–7.7) | 8.3 |
| platform-lib | noisy_neighbor |
3 | 24.9 | 2.8 | 4.7 (4.6–6.6) | 7.2 |
| platform-lib | high_scale |
3 | 149.5 | 2.0 | 4.5 (4.3–4.7) | 6.1 |
| River | notify_latency |
3 | 10.0 | 51.4 | 102.7 (102.1–103.2) | 105.1 |
| River | steady_under |
3 | 15.0 | 37.9 | 101.5 (101.4–101.7) | 104.8 |
| River | noisy_neighbor |
3 | 25.0 | 44.9 | 92.8 (90.7–95.3) | 103.8 |
| River | high_scale |
3 | 149.9 | 71.1 | 285.0 (276.5–291.4) | 315.4 |
All run-to-run spreads are within ±5% of median.
When enqueue rate exceeds service rate, pickup latency is dominated by queue-wait time, not library coordination. But platform-lib still consistently delivers more work per unit time.
| Scenario | Lib | Throughput (c/s) | Pickup p95 | Completed |
|---|---|---|---|---|
steady_over (50 Hz / 10 workers) |
platform-lib | 24.9 | 6973 ms | 1499 |
| River | 25.0 | 15927 ms | 1499 | |
burst (50 Hz + 10× spike × 5s / 10 workers) |
platform-lib | 40.8 | 48015 ms | 2574 |
| River | 32.8 | 48953 ms | 2081 | |
noisy_neighbor_saturated (100 Hz / 10 workers / 90% skew) |
platform-lib | 39.2 | 33261 ms | 2480 |
| River | 31.0 | 39416 ms | 1916 |
In burst and noisy_neighbor_saturated, platform-lib completes 25–30% more jobs in the same wall-clock. The pickup p95 numbers are in the 30-50 second range for both libraries because the queue has built up tens of seconds of backlog.
Why more completions? Under saturation, every worker slot that's not waiting on LISTEN/NOTIFY handoff is one more job completed. platform-lib's LISTEN/NOTIFY-first dispatch keeps workers busier. River's polling interval includes idle gaps that compound across thousands of jobs.
Caveat: these numbers are sensitive to adapter-implementation choices. My platform-lib QueueStore uses FOR UPDATE SKIP LOCKED with a fast-path idle count — decisions a production-tuned store might make differently. A batch-pop store could produce different throughput numbers.
With one tenant enqueuing 90% of 100 Hz against 10 workers (saturated), does FIFO still produce fair latencies?
| Library | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| platform-lib | 1.18 | 1.17 | 1.18 |
| River | 0.97 | 0.97 | 0.98 |
A ratio near 1.0 means no starvation. Notably, platform-lib's 1.18 means the noisy tenant has slightly slower p95 than the quiet tenant — the opposite of starvation. This is because noisy-tenant enqueues are time-clustered at the tail of the run (they keep arriving at 90 Hz until t=30s), while quiet-tenant enqueues are sparse, so the quiet tenant's last enqueue lands earlier in the run and therefore experiences less queue wait. FIFO is temporally fair.
River's 0.97 is essentially perfect fairness — the small deviation is sampling noise.
Neither library has per-tenant priority or fairness; both rely on FIFO. Under saturation, FIFO produces proportionate delays that roughly match enqueue-time arrival. No tenant is systematically disadvantaged.
| Library | Total completed | Noisiest tenant completed | Share |
|---|---|---|---|
| platform-lib | 2480 | ~2236 | 90% |
| River | 1916 | ~1727 | 90% |
Both libraries complete work proportionate to enqueue share (the noisy tenant contributed 90% of enqueues and gets 90% of completions). This is a direct consequence of FIFO + no per-tenant worker allocation.
For true fairness enforcement you'd need to add a per-tenant worker cap or weighted fair-queuing on top of either library. Neither provides it natively.
Both libraries configured with MaxAttempts=3. platform-lib adapter tested both without backoff (immediate re-enqueue) and with ExponentialBackoffRiverLike (matches River's 1s/16s/81s spacing).
| Config | Completed | Failed events | Discarded | Pickup p50 | Pickup p95 |
|---|---|---|---|---|---|
| platform-lib (immediate retry) | 626 | 2454 | 273 | 4.5 ms | 15.9 ms |
| platform-lib (River-like backoff) | 626 | 2457 | 273 | 4.0 ms | 17017 ms |
| River (native backoff) | 626 | 2457 | 273 | 621.8 ms | 18693 ms |
With retry shapes matched, the outcome counts are identical. What differs:
- platform-lib p50 stays at ~4ms even with backoff; River's p50 is ~622ms.
- platform-lib p95 is slightly lower than River's (17.0s vs 18.7s) at matched backoff, with much tighter variance (17016–17017 vs 18633–18837).
My platform-lib adapter implements backoff with a goroutine-based time.AfterFunc pattern — when a job fails and is scheduled for retry, the worker slot is freed immediately and the retry fires on a background goroutine. Fresh enqueues don't contend with retry-waiting jobs for worker attention.
River implements backoff via Postgres scheduled_at — the retry job sits in the DB with a future timestamp and is eligible to be claimed by any worker once that timestamp has passed. This approach is durable across process restarts (a crash loses nothing) but means fresh enqueues at t+1s are competing with retry jobs for worker attention, pushing up pickup p50.
Both are defensible designs:
- Goroutine-based backoff (platform-lib adapter): lower pickup latency, but retry is lost if the process dies during backoff. Appropriate for non-durable retry semantics.
- DB-based backoff (River native): slightly higher pickup latency under retry pressure, but durable across restarts. Appropriate for production systems where retry-loss is unacceptable.
Caveat: this goroutine-based backoff is an adapter choice, not platform-lib's native behavior. A production platform-lib user building retry semantics would likely choose DB-based backoff (same pattern as River) for durability. This benchmark measures my implementation, not platform-lib's inherent approach.
| Scenario | platform-lib goroutines | River goroutines | platform-lib RSS MB | River RSS MB |
|---|---|---|---|---|
notify_latency |
14 | 40 | 18.2 | 18.3 |
steady_under |
36 | 62 | 18.6 | 18.5 |
noisy_neighbor |
44 | 72 | 18.8 | 19.2 |
steady_over |
35 | 60 | 19.1 | 18.8 |
steady_balanced |
55 | 74 | 19.3 | 19.1 |
burst |
35 | 60 | 23.1 | 19.1 |
noisy_neighbor_saturated |
35 | 60 | 19.0 | 19.1 |
rate_limit_pressure (no backoff) |
32 | 56 | 19.0 | 19.3 |
rate_limit_pressure (backoff) |
193 | 58 | 19.0 | 19.3 |
high_scale |
182 | 198 | 35.9 | 24.4 |
Two observations:
- River consistently runs ~2× more goroutines than platform-lib in most scenarios. Likely due to River's periodic/leader-election/metrics goroutines. Roughly flat across load.
- platform-lib goroutines spike under retry-with-backoff and under high enqueue rate. This reflects my adapter's goroutine-per-backoff-job implementation — each scheduled retry creates a goroutine that sleeps then re-enqueues. At rate_limit with backoff, 273 discards × 3 attempts = 819 goroutines spawned across the run. A pool-based scheduler would cap this.
Based on this data, the case for platform-lib is strongest when:
- Low-latency pickup is a product requirement. Near-zero queue-idle latency matters: user-facing "optimistic action" paths, live-UI notifications, any scenario where a few hundred ms of queue lag is visible in UX.
- High rate of small jobs. The LISTEN/NOTIFY advantage widens with rate (63× at 300 Hz vs 18× at 20 Hz in this benchmark). If your workload is many-short-jobs, platform-lib's dispatch overhead stays flat.
- The architectural adjacents fit. platform-lib's unique value is the integrated cache + queue + broadcaster set. If you're building a system that wants all three unified, the ecosystem fit is better than bolting cache onto River.
The case for staying on River is strongest when:
- Adapter complexity matters. The River adapter in this repo is ~300 lines and uses
riverpgxv5.New(pool)for a production-tested Postgres driver plusrivermigrate.Migratefor automatic schema setup. The platform-lib adapter is ~1,080 lines because platform-lib ships theQueueStoreinterface but not a Postgres implementation — I had to hand-write ~440 lines of Postgres CRUD and a hand-maintained SQL schema. For a team that wants fewer moving parts, River gives you more out of the box. - Leader election is needed. During these benchmark runs, River's logs showed a live
leadership.Electorsubsystem (e.g.,Current leader stepping down because the reelection deadline elapsed). River ships this for multi-instance coordination; platform-lib treats it as a composition concern you'd wire on top ofrsnotify. - Durable retry / crash resilience matters. River's DB-based retry persists through restarts. platform-lib's retry story is "build your own" — this benchmark's adapter uses
time.AfterFunc-style goroutines which would lose retries on crash. A durable platform-lib retry would need a scheduler layer on top ofrsqueue.Push. - You're doing 100–1000 Hz steady-state and can tolerate 100–300 ms pickup latency. River's pickup at moderate rate is below 150ms p99 at lower rates and ~315ms p99 at 300 Hz — acceptable for most SaaS workloads.
What I am NOT claiming (and did not measure in this benchmark):
- Relative quality of middleware ecosystems, CLI tooling, metrics integrations, observability hooks, or documentation. River has
river-cliandrivermetrics; platform-lib hasrsqueue/metricswith typed hook interfaces you implement yourself. A head-to-head feature matrix would be a separate research task. - Transactional enqueue asymmetry. I initially implied platform-lib was weaker here, but
plqueue.Queue.WithDbTx(ctx, tx)supports the same pattern as River'sInsertTx. Both can enqueue atomically with an application transaction.
For Keavi specifically: the LISTEN/NOTIFY advantage matters (the SSE and conversational paths care about sub-100ms responsiveness), but River's out-of-the-box adapter simplicity and leader-election subsystem also matter (Keavi has 55 workers across many job types; building a custom QueueStore + scheduler + leader coordination is work you don't have to do today). A hybrid approach — stay on River for queuing, adopt platform-lib's cache module independently — remains the recommendation from the earlier audit. See project_future_platformlib_cache.md in Keavi's memory.
Deferred work is documented in docs/FUTURE.md with concrete design notes. Ordered by informational value vs. effort:
- Jitter in
ExponentialBackoffRiverLike— matches River's ±10% jitter so retry-shape is fully apples-to-apples. ~30 min. - Multi-process scenario — two adapter processes consuming the same queue. Exercises cross-process LISTEN/NOTIFY. ~1 hour.
crash_recovery— SIGKILL the worker process mid-burst; measure duplicates and resume time. Needs parent/child process split. ~2–4 hours.rscache+ AddressedPush integration — a capability benchmark of platform-lib's unique cache-integrated queue. Not head-to-head (River has no equivalent). ~2–4 hours.- DB-backed backoff in the platform-lib adapter — mirror River's durability model so backoff-enabled comparisons are truly apples-to-apples. Requires building a scheduler layer on top of rsqueue's
Push. ~1–2 hours.
See docs/METHODOLOGY.md for the workload model and metric definitions. Raw JSONL data lives in results/ (gitignored; reproducible via scripts/run-all.sh and scripts/run-items-2-3.sh).
Things this benchmark doesn't measure:
- Cross-process coordination (deferred).
- Crash resilience / recovery behavior (deferred).
- Real-world workload patterns (synthetic
time.Sleepapproximates work). - Long-running stability (30s runs — memory leaks / GC pressure over hours are invisible).
- Failure modes beyond the three simulated error classes.
Things this benchmark measures well:
- LISTEN/NOTIFY pickup latency, under four different pressure regimes, with tight run-to-run variance.
- Throughput differences at saturation.
- Retry-shape semantics (with the backoff asymmetry honestly documented).
- Natural fairness under tenant skew.
MIT. See LICENSE.