The kernel can't do this because the kernel is part of the system under test.
-
Trust boundary problem: If you modify the kernel to be deterministic, you're no longer testing with a real kernel. Kernel bugs (scheduler races, memory pressure, OOM killer behavior) affect real distributed systems. You want to test with those, not around them.
-
Instruction-level control: True determinism requires controlling CPU-level non-determinism - RDTSC, RDRAND, instruction timing, memory ordering, interrupt delivery. The kernel can intercept syscalls, but it can't intercept what the CPU does between syscalls. A hypervisor (especially TCG/emulation mode) controls every instruction.
-
Unmodified binaries: The hypervisor approach tests your actual production binary. Kernel approaches require recompilation, special libraries, or LD_PRELOAD hacks - so you're not testing what you ship.
-
Multi-node is natural: A hypervisor manages multiple VMs with virtual networks trivially. Kernel-based approaches struggle with true multi-node simulation where you need to control message ordering across processes.
-
Time virtualization: Hypervisors virtualize TSC/HPET completely. The kernel can hook
clock_gettime()but applications can stillRDTSCdirectly, and you can't fake time for the kernel's own timers.
This argument assumes you can swap out non-determinism cleanly. But the bugs you're hunting ARE the non-deterministic behaviors.
-
Race conditions manifest FROM scheduler non-determinism. If you replace the scheduler with a deterministic one, you're not triggering the same interleavings that cause production bugs. You want to explore different scheduler behaviors, not eliminate them.
-
The kernel can't control its own inputs. The hypervisor controls what the kernel sees (time, interrupts, I/O completion order) while letting the kernel respond naturally. A "deterministic kernel" changes how the kernel responds - different thing entirely.
-
You'd be testing a different system. Replace the scheduler, timer subsystem, random pool, memory allocator with deterministic versions and you've written a new kernel. Your tests pass, production still fails.
The right mental model:
- Hypervisor = control the environment (inputs), observe real system behavior
- Kernel hooks = change the system, hope it's "close enough"
The hypervisor approach is: "let me replay exactly what happened." The kernel approach is: "let me prevent it from happening non-deterministically." Only one of those finds production bugs.
In TLA+, prophecy variables allow modeling non-deterministic choices by predetermining outcomes. Theoretically, you could modify the kernel scheduler (and other non-deterministic components) to accept these prophecy variables as inputs.
The hypervisor approach achieves the same result more elegantly: the kernel does NOT need to be modified.
The hypervisor controls inputs at the hardware boundary:
- Timer interrupts (TSC, HPET) delivered at exact virtual cycles
- RDRAND/RDSEED trapped and replaced with seeded values
- I/O completion interrupts delivered in controlled order
- Network packets delivered deterministically
Given identical inputs, the unmodified kernel scheduler makes identical decisions. The "prophecy" is encoded in the controlled hardware inputs, and the kernel's normal algorithm follows deterministically.
Why hypervisor beats kernel-level prophecy variables:
| Approach | Modification | Completeness |
|---|---|---|
| Hypervisor | Minimal, at hardware boundary | Complete by construction |
| Kernel prophecy | Every non-deterministic subsystem | Must identify all sources |
The hypervisor approach is the minimum modification that achieves complete determinism. Kernel prophecy variables would require maximum modification for the same result.
Yes, but subtractively rather than additively.
What we do (subtractive):
CONFIG_SMP=n # Single CPU only
CONFIG_PREEMPT_NONE=y # No preemption
# No hardware RNG - use hypervisor-provided randomness
We remove kernel features that are hard to control from the hypervisor.
What prophecy variables would require (additive):
- Modify scheduler to accept schedule input
- Modify allocator to accept allocation decisions
- Thread prophecy variables through every subsystem
Why subtractive is more practical:
- Still running real kernel code paths, just fewer of them
- Config change vs. code change - much simpler
- Production can run same kernel with
SMP=y- algorithms are identical, just more parallelism
The hypervisor still does the heavy lifting. We took pragmatic shortcuts that limit what we can test (no SMP bugs) but kept the kernel largely unmodified.
SMP = Symmetric Multi-Processing
Multiple CPU cores that share memory and can execute code simultaneously.
Single CPU: SMP (4 cores):
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ CPU │ │ CPU │ │ CPU │ │ CPU │ │ CPU │
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
│ └───────┴───┬───┴───────┘
│ │
┌──┴──┐ ┌──────────────┴──────────────┐
│ RAM │ │ Shared RAM │
└─────┘ └─────────────────────────────┘
Why SMP introduces hard-to-control non-determinism:
-
Memory ordering: Core 1 writes A then B. Core 2 might see B before A due to CPU caching/store buffers.
-
Cache coherence races: Both cores read X, both increment, both write back. You get X+1 instead of X+2.
-
Atomic operation timing:
compare_and_swapon two cores hitting the same cacheline - who wins?
This happens inside the CPU, below the instruction level. Even a hypervisor can't see it without simulating the entire CPU microarchitecture.
No. This is a fundamental limitation of the hypervisor approach without a full CPU simulator.
What Bloodhound can test:
| Bug Type | Supported |
|---|---|
| Distributed systems (network, timing) | Yes |
| Single-core race conditions (thread interleaving) | Yes |
| SMP memory ordering, lock-free bugs | No |
Why SMP is so hard: You'd need to simulate the CPU's memory model - cache coherence, store buffers, memory barriers. That's not hypervisor-level anymore, that's a full CPU simulator. The non-determinism is below the instruction level.
The pragmatic choice: Run single vCPU (CONFIG_SMP=n). You can still test thread interleaving (scheduler switches between threads on one core), just not true parallel memory access bugs. This covers the vast majority of distributed systems bugs.
TLB (Translation Lookaside Buffer) = cache of virtual-to-physical address translations. On real hardware:
- TLB state depends on access history
- TLB miss → page table walk → different timing
- Eviction policy is microarchitectural (not visible to software)
- Can cause non-deterministic page faults, timing variations
Antithesis reportedly had to patch their guest kernel to work around TLB non-determinism.
QEMU TCG mode solves this automatically.
TCG (Tiny Code Generator) = software CPU emulation. There's no real hardware TLB - QEMU does address translation in software (softmmu). Same inputs → same translations → deterministic.
Real Hardware: QEMU TCG:
┌─────────────────┐ ┌─────────────────┐
│ CPU │ │ Host CPU │
│ ┌─────────────┐ │ │ │
│ │ Hardware TLB│ │ │ ┌─────────────┐ │
│ │ (opaque, │ │ │ │ Software TLB│ │
│ │ non-det) │ │ │ │ (code, det) │ │
│ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘
The tradeoff: TCG is ~10-100x slower than KVM, but we get determinism "for free" on things like TLB behavior, without kernel patches.
KVM (Kernel-based Virtual Machine) runs guest code directly on the CPU using hardware virtualization (VT-x/AMD-V). It's fast but:
- Real CPU non-determinism leaks through: TLB behavior, branch prediction, speculative execution timing
- Can't control instruction-level timing: KVM doesn't count instructions
- Hardware RNG accessible: Guest can still execute RDRAND
QEMU TCG mode interprets every guest instruction in software. Slower, but:
- Full control over timing (instruction counting)
- No hardware microarchitecture non-determinism
- Can trap and replace any instruction (RDRAND, RDTSC)
For deterministic simulation testing, correctness trumps performance. A 60-second simulation completing in 6 seconds (TCG) vs 0.6 seconds (KVM) is acceptable when it guarantees reproducibility.
| Aspect | Traditional Testing | Jepsen | Bloodhound (DST) |
|---|---|---|---|
| Timing bugs | Hard to reproduce | Statistical | Deterministic |
| Coverage | Random/manual | Random with analysis | Systematic exploration |
| Reproduction | "Works on my machine" | Sometimes | Same seed = same bug |
| Detection | Manual assertions | Consistency checks | Automatic property checks |
| Kernel tested | Real | Real | Real (mostly) |
| Speed | Fast | Medium | Slower (TCG overhead) |
When to use Bloodhound:
- Distributed systems with complex failure modes
- Need to reproduce timing-sensitive bugs
- Want systematic exploration of failure scenarios
When NOT to use Bloodhound:
- Low-level lock-free data structures (need SMP)
- Pure performance testing
- Simple unit tests
- FoundationDB Testing - Simulation testing at Apple
- TigerBeetle Simulation
- Jepsen - Distributed systems testing and analysis
- CLAUDE.md - Bloodhound project philosophy