Benchmarking Walkthrough

This document walks you through running benchmarks and interpreting results. Your results will vary based on your hardware, but this gives you an idea of what to expect.

Test System

OS:     Linux 6.18.5 (NixOS)
CPU:    AMD Ryzen Threadripper PRO 3945WX 12-Cores
Cores:  12 physical, 24 logical (hyperthreading)
RAM:    128 GB
Go:     go1.25.5 linux/amd64

Step 1: Verify Installation

First, make sure everything builds and tests pass:

$ go build ./...
$ go test ./...

Expected output:

ok      github.com/randomizedcoder/some-go-benchmarks/internal/cancel   0.003s
ok      github.com/randomizedcoder/some-go-benchmarks/internal/combined 0.002s
ok      github.com/randomizedcoder/some-go-benchmarks/internal/queue    0.004s
ok      github.com/randomizedcoder/some-go-benchmarks/internal/tick     0.735s

Using the Makefile

The project includes a Makefile with convenient targets:

$ make help

Output:

Available targets:

Build & Test:
  build          - Build all packages
  test           - Run all tests
  race           - Run tests with race detector
  lint           - Run golangci-lint
  check          - Run build, test, and race

All Benchmarks:
  bench          - Run all benchmarks with memory stats
  bench-count    - Run benchmarks 10 times (for variance)
  bench-variance - Run benchmarks and save for benchstat
  bench-race     - Run benchmarks with race detector

Category Benchmarks:
  bench-cancel   - Cancel check: context vs atomic
  bench-tick     - Tick check: ticker implementations
  bench-queue    - Queue: single goroutine push+pop
  bench-pipeline - Pipeline: 2-goroutine SPSC producer/consumer
  bench-mpsc     - MPSC: N producers -> 1 consumer (channel contention)
  bench-lfr      - go-lock-free-ring comparison (SPSC vs MPSC)
  bench-combined - Combined loop: cancel + tick + queue

Cleanup:
  clean          - Remove generated files

Quick Start:

# Run all benchmarks
$ make bench

# Run specific category
$ make bench-lfr      # go-lock-free-ring comparison
$ make bench-mpsc     # Channel contention with multiple producers
$ make bench-pipeline # 2-goroutine producer/consumer

Step 2: Run Basic Benchmarks

Cancel Package

$ go test -bench=. -benchmem ./internal/cancel

Output:

goos: linux
goarch: amd64
pkg: github.com/randomizedcoder/some-go-benchmarks/internal/cancel
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
BenchmarkCancel_Context_Done_Direct-24          138030020                8.232 ns/op           0 B/op          0 allocs/op
BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.3575 ns/op          0 B/op          0 allocs/op
BenchmarkCancel_Context_Done_Interface-24       143021458                8.193 ns/op           0 B/op          0 allocs/op
BenchmarkCancel_Atomic_Done_Interface-24        1000000000               0.3751 ns/op          0 B/op          0 allocs/op
BenchmarkCancel_Context_Done_Parallel-24        1000000000               0.6508 ns/op          0 B/op          0 allocs/op
BenchmarkCancel_Atomic_Done_Parallel-24         1000000000               0.07654 ns/op         0 B/op          0 allocs/op
BenchmarkCancel_Atomic_Reset-24                 279049110                4.501 ns/op           0 B/op          0 allocs/op
PASS
ok      github.com/randomizedcoder/some-go-benchmarks/internal/cancel   7.361s

How to read this:

Column	Meaning
`-24`	Using 24 CPU threads (GOMAXPROCS)
`138030020`	Number of iterations run
`8.232 ns/op`	8.232 nanoseconds per operation
`0 B/op`	Zero bytes allocated per operation
`0 allocs/op`	Zero heap allocations per operation

Key insight: Atomic is 23x faster than Context (0.36 ns vs 8.23 ns)

Tick Package

$ go test -bench=. -benchmem ./internal/tick

Output:

BenchmarkTick_Std_Direct-24             13369196                86.24 ns/op           0 B/op          0 allocs/op
BenchmarkTick_Batch_Direct-24           209211277                5.627 ns/op          0 B/op          0 allocs/op
BenchmarkTick_Atomic_Direct-24          41821100                25.71 ns/op           0 B/op          0 allocs/op
BenchmarkTick_TSC_Direct-24             131311492                9.436 ns/op          0 B/op          0 allocs/op

Performance ranking:

Implementation	ns/op	Speedup vs Std
StdTicker	86.24	1x (baseline)
AtomicTicker	25.71	3.4x
TSCTicker	9.44	9.1x
BatchTicker	5.63	15.3x

Combined Benchmarks (Most Realistic)

$ go test -bench=. -benchmem ./internal/combined

Output:

BenchmarkCombined_CancelTick_Standard-24        13146752                90.10 ns/op            0 B/op          0 allocs/op
BenchmarkCombined_CancelTick_Optimized-24       45594999                26.75 ns/op            0 B/op          0 allocs/op
BenchmarkCombined_FullLoop_Standard-24           9150345               130.2 ns/op             0 B/op          0 allocs/op
BenchmarkCombined_FullLoop_Optimized-24         19513278                62.86 ns/op            0 B/op          0 allocs/op

Key insight: Combined optimizations give 2.1x speedup on the full loop (130 ns → 63 ns)

Queue Benchmarks: Goroutine Patterns

Queue performance varies dramatically based on goroutine topology.

Single Goroutine (No Contention)

$ go test -bench=BenchmarkQueue -benchmem ./internal/queue

Output:

goos: linux
goarch: amd64
pkg: github.com/randomizedcoder/some-go-benchmarks/internal/queue
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
BenchmarkQueue_Channel_PushPop_Direct-24           30932498                38.96 ns/op            0 B/op          0 allocs/op
BenchmarkQueue_RingBuffer_PushPop_Direct-24        32920832                35.89 ns/op            0 B/op          0 allocs/op
BenchmarkQueue_Channel_PushPop_Interface-24        27947314                43.26 ns/op            0 B/op          0 allocs/op
BenchmarkQueue_RingBuffer_PushPop_Interface-24     30313048                40.37 ns/op            0 B/op          0 allocs/op

Note: The in-repo RingBuffer has SPSC guards that add overhead. An unguarded ring buffer achieves ~9.5 ns/op.

SPSC: 1 Producer → 1 Consumer (2 Goroutines)

This is the classic producer/consumer pattern—the most common Go concurrency pattern:

$ go test -bench=BenchmarkPipeline -benchmem ./internal/combined

Output:

goos: linux
goarch: amd64
pkg: github.com/randomizedcoder/some-go-benchmarks/internal/combined
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
BenchmarkPipeline_Channel-24             7858700               127.9 ns/op            0 B/op           0 allocs/op
BenchmarkPipeline_RingBuffer-24         11740012               146.8 ns/op            0 B/op           0 allocs/op

Key insight: The guarded RingBuffer is slower than channels due to SPSC guard overhead. An unguarded lock-free ring buffer achieves ~39 ns/op (3.3x faster than channels).

MPSC: N Producers → 1 Consumer (Channel Lock Contention)

Multiple goroutines sending to one consumer shows channel lock contention:

$ go test -bench=BenchmarkMPSC -benchmem ./internal/combined

Output:

goos: linux
goarch: amd64
pkg: github.com/randomizedcoder/some-go-benchmarks/internal/combined
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
BenchmarkMPSC_Channel_2Producers-24       180842              5922 ns/op              0 B/op           0 allocs/op
BenchmarkMPSC_Channel_4Producers-24       119090             26351 ns/op              0 B/op           0 allocs/op
BenchmarkMPSC_Channel_8Producers-24       171520             49074 ns/op              0 B/op           0 allocs/op

Lock contention scaling:

Producers	Latency	vs 1 Producer
1 (SPSC)	128 ns	baseline
2	5.9 µs	46x slower
4	26 µs	200x slower
8	49 µs	380x slower

Key insight: Channel lock contention scales poorly. For high-throughput fan-in patterns, use go-lock-free-ring.

go-lock-free-ring Comparison

The go-lock-free-ring library provides a sharded MPSC ring buffer. Let's compare:

$ go test -bench=BenchmarkLFR -benchmem ./internal/combined

Output:

goos: linux
goarch: amd64
pkg: github.com/randomizedcoder/some-go-benchmarks/internal/combined
cpu: AMD Ryzen Threadripper PRO 3945WX 12-Cores
BenchmarkLFR_SPSC_Channel-24                     4372419               247.8 ns/op             0 B/op          0 allocs/op
BenchmarkLFR_SPSC_OurRing-24                    33405556                36.53 ns/op            0 B/op          0 allocs/op
BenchmarkLFR_SPSC_ShardedRing1-24               10212240               114.1 ns/op             8 B/op          1 allocs/op
BenchmarkLFR_MPSC_Channel_4P-24                    85386             35337 ns/op               0 B/op          0 allocs/op
BenchmarkLFR_MPSC_ShardedRing_4P_4S-24           2179492               539.4 ns/op           412 B/op         51 allocs/op
BenchmarkLFR_MPSC_Channel_8P-24                    58347             47067 ns/op               1 B/op          0 allocs/op
BenchmarkLFR_MPSC_ShardedRing_8P_8S-24           2596642               464.0 ns/op           412 B/op         51 allocs/op

SPSC Comparison (1 Producer → 1 Consumer):

These are cross-goroutine benchmarks (separate producer/consumer goroutines):

Implementation	Latency	Allocs	Speedup
Channel	248 ns	0	baseline
go-lock-free-ring (1 shard)	114 ns	1	2.2x
Our SPSC Ring	36.5 ns	0	6.8x

Note: go-lock-free-ring's native BenchmarkProducerConsumer shows 31 ns/op, but that's in the same goroutine. Cross-goroutine polling adds ~80ns coordination overhead.

Why Our SPSC Ring is Faster in Cross-Goroutine Tests:

CAS vs Store: go-lock-free-ring uses CompareAndSwap to safely handle multiple producers (3-10x slower than simple Store)
Sequence numbers: go-lock-free-ring uses per-slot sequence numbers to prevent race conditions (extra atomic ops)
Boxing: go-lock-free-ring uses any type causing allocations; our ring uses generics (zero allocs)

Tradeoff: Our ring is faster because it makes dangerous assumptions (single producer, x86 memory model). go-lock-free-ring is slower because it's provably race-free.

MPSC Comparison (N Producers → 1 Consumer):

Producers	Channel	go-lock-free-ring	Speedup
4	35.3 µs	539 ns	65x
8	47.1 µs	464 ns	101x

Key insight: go-lock-free-ring's sharded design eliminates lock contention, providing 65-101x speedup over channels for multi-producer scenarios!

Choosing the Right Queue:

Your Pattern	Best Choice	Why
1 producer, 1 consumer	Our SPSC Ring	Fastest (36.5 ns), zero allocs
N producers, 1 consumer	go-lock-free-ring	Sharding eliminates contention
Simple/infrequent	Channel	Simplicity over speed

Step 3: Use CLI Tools

The CLI tools provide easier-to-read output with throughput analysis.

Context Cancellation Comparison

$ go run ./cmd/context -n 5000000

Output:

Benchmarking cancellation check (5000000 iterations)
─────────────────────────────────────────────────

Results:
  Context:  43.74395ms (8.75 ns/op)
  Atomic:   1.640922ms (0.33 ns/op)

  Speedup:  26.66x

Throughput (theoretical max):
  Context:  114.30 M ops/sec
  Atomic:   3047.07 M ops/sec

Combined Cancel + Tick (Most Realistic)

$ go run ./cmd/context-ticker -n 5000000

Output:

Benchmarking combined cancel+tick check (5000000 iterations)
─────────────────────────────────────────────────────────

This simulates a hot loop that checks for cancellation
and periodic timing on every iteration:

  for {
      if cancel.Done() { return }
      if ticker.Tick() { doPeriodicWork() }
      processItem()
  }

Results:
─────────────────────────────────────────────────────────
  Standard (ctx + time.Ticker):
    Total: 465.769925ms, Per-op: 93.15 ns

  Optimized (atomic + AtomicTicker):
    Total: 134.594392ms, Per-op: 26.92 ns
    Speedup: 3.46x

  Ultra (atomic + BatchTicker):
    Total: 25.06717ms, Per-op: 5.01 ns
    Speedup: 18.58x

Impact Analysis:
─────────────────────────────────────────────────────────
  Savings per iteration: 66.24 ns

  At 100K ops/sec: save 6.62 ms/sec (0.66% of 1 core)
  At 1000K ops/sec: save 66.24 ms/sec (6.62% of 1 core)
  At 10000K ops/sec: save 662.35 ms/sec (66.24% of 1 core)

What this tells you:

At 1M operations/second, you save 66ms of CPU time per second
At 10M operations/second, you save 662ms — that's 66% of a CPU core!

Step 4: Variance Analysis

Run benchmarks multiple times to check consistency:

$ go test -bench=BenchmarkCancel_Atomic_Done_Direct -count=5 ./internal/cancel

Output:

BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.3794 ns/op
BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.4376 ns/op
BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.3601 ns/op
BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.3526 ns/op
BenchmarkCancel_Atomic_Done_Direct-24           1000000000               0.3450 ns/op

Analysis:

Range: 0.345 - 0.438 ns/op
Variance: ~27% (the 0.44 is an outlier)
Most results cluster around 0.35-0.38 ns

Tip: Use benchstat for statistical analysis:

$ go install golang.org/x/perf/cmd/benchstat@latest
$ go test -bench=. -count=10 ./internal/cancel > results.txt
$ benchstat results.txt

Step 5: Environment Tuning

With GOMAXPROCS=1

Reduce Go scheduler noise by using a single thread:

$ GOMAXPROCS=1 go test -bench=BenchmarkCancel_Atomic_Done_Direct -benchmem ./internal/cancel

Output:

BenchmarkCancel_Atomic_Done_Direct      1000000000               0.4111 ns/op          0 B/op          0 allocs/op

Notice: -24 suffix is now missing (single-threaded).

With CPU Pinning

$ taskset -c 0 GOMAXPROCS=1 go test -bench=BenchmarkCancel_Atomic_Done_Direct ./internal/cancel

With High Priority

$ sudo nice -n -20 go test -bench=. ./internal/cancel

Maximum Isolation

$ sudo nice -n -20 taskset -c 0 GOMAXPROCS=1 go test -bench=. ./internal/cancel

Step 6: Understanding the Results

Summary Table

Component	Standard	Optimized	Speedup
Cancel check	8.2 ns	0.36 ns	23x
Tick check	86 ns	5.6 ns (batch)	15x
Queue SPSC (2 goroutines)	248 ns	36.5 ns	6.8x
Queue MPSC (8 producers)	47 µs	464 ns	101x
Combined loop	130 ns	63 ns	2.1x

When Do These Optimizations Matter?

Operations/sec	Standard CPU	Optimized CPU	Savings
100K	0.9%	0.3%	0.6%
1M	9%	3%	6%
10M	90%	30%	60%

Rule of thumb: If you're doing >1M operations/second in a hot loop, these optimizations matter significantly.

Step 7: Profiling (Optional)

CPU Profile

$ go test -bench=BenchmarkCombined -cpuprofile=cpu.prof ./internal/combined
$ go tool pprof -http=:8080 cpu.prof

Opens a web UI showing where time is spent.

Memory Profile

$ go test -bench=BenchmarkQueue -memprofile=mem.prof ./internal/queue
$ go tool pprof -http=:8080 mem.prof

All benchmarks should show 0 allocations.

Common Issues

High Variance

Symptom: Results vary by >10% between runs.

Causes:

Background processes (browser, IDE)
CPU frequency scaling
Thermal throttling

Fix:

# Kill background apps, then:
sudo cpupower frequency-set -g performance
sudo nice -n -20 taskset -c 0 GOMAXPROCS=1 go test -bench=. ./internal/...

Unexpected Results

Symptom: Optimized version is slower than standard.

Possible causes:

SPSC guards: RingBuffer has safety checks that add overhead
Warm-up: First run may include JIT/cache warming
Measurement noise: Run with -count=10 and use benchstat

Next Steps

Read the code: Look at internal/cancel/atomic.go to see how simple the optimization is
Try in your code: Replace ctx.Done() checks with AtomicCanceler
Measure your application: Profile to see if these hot paths are actually your bottleneck
Don't over-optimize: If you're not doing millions of ops/sec, standard patterns are fine

Quick Reference

# Run all benchmarks
make bench

# Run specific package
go test -bench=. ./internal/cancel

# Multiple runs for variance
go test -bench=. -count=10 ./internal/... > results.txt

# Compare with benchstat
benchstat results.txt

# CLI tools
go run ./cmd/context -n 10000000
go run ./cmd/ticker -n 10000000
go run ./cmd/context-ticker -n 10000000
go run ./cmd/channel -n 10000000

# Maximum isolation
sudo nice -n -20 taskset -c 0 GOMAXPROCS=1 go test -bench=. ./internal/...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Walkthrough

Test System

Step 1: Verify Installation

Using the Makefile

Step 2: Run Basic Benchmarks

Cancel Package

Tick Package

Combined Benchmarks (Most Realistic)

Queue Benchmarks: Goroutine Patterns

Single Goroutine (No Contention)

SPSC: 1 Producer → 1 Consumer (2 Goroutines)

MPSC: N Producers → 1 Consumer (Channel Lock Contention)

go-lock-free-ring Comparison

Step 3: Use CLI Tools

Context Cancellation Comparison

Combined Cancel + Tick (Most Realistic)

Step 4: Variance Analysis

Step 5: Environment Tuning

With GOMAXPROCS=1

With CPU Pinning

With High Priority

Maximum Isolation

Step 6: Understanding the Results

Summary Table

When Do These Optimizations Matter?

Step 7: Profiling (Optional)

CPU Profile

Memory Profile

Common Issues

High Variance

Unexpected Results

Next Steps

Quick Reference

FilesExpand file tree

WALKTHROUGH.md

Latest commit

History

WALKTHROUGH.md

File metadata and controls

Benchmarking Walkthrough

Test System

Step 1: Verify Installation

Using the Makefile

Step 2: Run Basic Benchmarks

Cancel Package

Tick Package

Combined Benchmarks (Most Realistic)

Queue Benchmarks: Goroutine Patterns

Single Goroutine (No Contention)

SPSC: 1 Producer → 1 Consumer (2 Goroutines)

MPSC: N Producers → 1 Consumer (Channel Lock Contention)

go-lock-free-ring Comparison

Step 3: Use CLI Tools

Context Cancellation Comparison

Combined Cancel + Tick (Most Realistic)

Step 4: Variance Analysis

Step 5: Environment Tuning

With GOMAXPROCS=1

With CPU Pinning

With High Priority

Maximum Isolation

Step 6: Understanding the Results

Summary Table

When Do These Optimizations Matter?

Step 7: Profiling (Optional)

CPU Profile

Memory Profile

Common Issues

High Variance

Unexpected Results

Next Steps

Quick Reference