Skip to content

Latest commit

 

History

History
216 lines (141 loc) · 11 KB

File metadata and controls

216 lines (141 loc) · 11 KB

Frequently Asked Questions

Why use a hypervisor instead of modifying the kernel for determinism?

The kernel can't do this because the kernel is part of the system under test.

  1. Trust boundary problem: If you modify the kernel to be deterministic, you're no longer testing with a real kernel. Kernel bugs (scheduler races, memory pressure, OOM killer behavior) affect real distributed systems. You want to test with those, not around them.

  2. Instruction-level control: True determinism requires controlling CPU-level non-determinism - RDTSC, RDRAND, instruction timing, memory ordering, interrupt delivery. The kernel can intercept syscalls, but it can't intercept what the CPU does between syscalls. A hypervisor (especially TCG/emulation mode) controls every instruction.

  3. Unmodified binaries: The hypervisor approach tests your actual production binary. Kernel approaches require recompilation, special libraries, or LD_PRELOAD hacks - so you're not testing what you ship.

  4. Multi-node is natural: A hypervisor manages multiple VMs with virtual networks trivially. Kernel-based approaches struggle with true multi-node simulation where you need to control message ordering across processes.

  5. Time virtualization: Hypervisors virtualize TSC/HPET completely. The kernel can hook clock_gettime() but applications can still RDTSC directly, and you can't fake time for the kernel's own timers.


Can't the kernel provide determinism for testing while running normally in production?

This argument assumes you can swap out non-determinism cleanly. But the bugs you're hunting ARE the non-deterministic behaviors.

  1. Race conditions manifest FROM scheduler non-determinism. If you replace the scheduler with a deterministic one, you're not triggering the same interleavings that cause production bugs. You want to explore different scheduler behaviors, not eliminate them.

  2. The kernel can't control its own inputs. The hypervisor controls what the kernel sees (time, interrupts, I/O completion order) while letting the kernel respond naturally. A "deterministic kernel" changes how the kernel responds - different thing entirely.

  3. You'd be testing a different system. Replace the scheduler, timer subsystem, random pool, memory allocator with deterministic versions and you've written a new kernel. Your tests pass, production still fails.

The right mental model:

  • Hypervisor = control the environment (inputs), observe real system behavior
  • Kernel hooks = change the system, hope it's "close enough"

The hypervisor approach is: "let me replay exactly what happened." The kernel approach is: "let me prevent it from happening non-deterministically." Only one of those finds production bugs.


What about TLA+ prophecy variables? Couldn't those make the kernel deterministic?

In TLA+, prophecy variables allow modeling non-deterministic choices by predetermining outcomes. Theoretically, you could modify the kernel scheduler (and other non-deterministic components) to accept these prophecy variables as inputs.

The hypervisor approach achieves the same result more elegantly: the kernel does NOT need to be modified.

The hypervisor controls inputs at the hardware boundary:

  • Timer interrupts (TSC, HPET) delivered at exact virtual cycles
  • RDRAND/RDSEED trapped and replaced with seeded values
  • I/O completion interrupts delivered in controlled order
  • Network packets delivered deterministically

Given identical inputs, the unmodified kernel scheduler makes identical decisions. The "prophecy" is encoded in the controlled hardware inputs, and the kernel's normal algorithm follows deterministically.

Why hypervisor beats kernel-level prophecy variables:

Approach Modification Completeness
Hypervisor Minimal, at hardware boundary Complete by construction
Kernel prophecy Every non-deterministic subsystem Must identify all sources

The hypervisor approach is the minimum modification that achieves complete determinism. Kernel prophecy variables would require maximum modification for the same result.


Does Bloodhound modify the guest kernel?

Yes, but subtractively rather than additively.

What we do (subtractive):

CONFIG_SMP=n           # Single CPU only
CONFIG_PREEMPT_NONE=y  # No preemption
# No hardware RNG - use hypervisor-provided randomness

We remove kernel features that are hard to control from the hypervisor.

What prophecy variables would require (additive):

  • Modify scheduler to accept schedule input
  • Modify allocator to accept allocation decisions
  • Thread prophecy variables through every subsystem

Why subtractive is more practical:

  1. Still running real kernel code paths, just fewer of them
  2. Config change vs. code change - much simpler
  3. Production can run same kernel with SMP=y - algorithms are identical, just more parallelism

The hypervisor still does the heavy lifting. We took pragmatic shortcuts that limit what we can test (no SMP bugs) but kept the kernel largely unmodified.


What is SMP?

SMP = Symmetric Multi-Processing

Multiple CPU cores that share memory and can execute code simultaneously.

Single CPU:          SMP (4 cores):
┌─────┐              ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ CPU │              │ CPU │ │ CPU │ │ CPU │ │ CPU │
└──┬──┘              └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘
   │                    └───────┴───┬───┴───────┘
   │                                │
┌──┴──┐              ┌──────────────┴──────────────┐
│ RAM │              │         Shared RAM          │
└─────┘              └─────────────────────────────┘

Why SMP introduces hard-to-control non-determinism:

  1. Memory ordering: Core 1 writes A then B. Core 2 might see B before A due to CPU caching/store buffers.

  2. Cache coherence races: Both cores read X, both increment, both write back. You get X+1 instead of X+2.

  3. Atomic operation timing: compare_and_swap on two cores hitting the same cacheline - who wins?

This happens inside the CPU, below the instruction level. Even a hypervisor can't see it without simulating the entire CPU microarchitecture.


Can Bloodhound find SMP/multi-core bugs?

No. This is a fundamental limitation of the hypervisor approach without a full CPU simulator.

What Bloodhound can test:

Bug Type Supported
Distributed systems (network, timing) Yes
Single-core race conditions (thread interleaving) Yes
SMP memory ordering, lock-free bugs No

Why SMP is so hard: You'd need to simulate the CPU's memory model - cache coherence, store buffers, memory barriers. That's not hypervisor-level anymore, that's a full CPU simulator. The non-determinism is below the instruction level.

The pragmatic choice: Run single vCPU (CONFIG_SMP=n). You can still test thread interleaving (scheduler switches between threads on one core), just not true parallel memory access bugs. This covers the vast majority of distributed systems bugs.


What about TLB-related non-determinism?

TLB (Translation Lookaside Buffer) = cache of virtual-to-physical address translations. On real hardware:

  • TLB state depends on access history
  • TLB miss → page table walk → different timing
  • Eviction policy is microarchitectural (not visible to software)
  • Can cause non-deterministic page faults, timing variations

Antithesis reportedly had to patch their guest kernel to work around TLB non-determinism.

QEMU TCG mode solves this automatically.

TCG (Tiny Code Generator) = software CPU emulation. There's no real hardware TLB - QEMU does address translation in software (softmmu). Same inputs → same translations → deterministic.

Real Hardware:              QEMU TCG:
┌─────────────────┐         ┌─────────────────┐
│ CPU             │         │ Host CPU        │
│ ┌─────────────┐ │         │                 │
│ │ Hardware TLB│ │         │ ┌─────────────┐ │
│ │ (opaque,    │ │         │ │ Software TLB│ │
│ │  non-det)   │ │         │ │ (code, det) │ │
│ └─────────────┘ │         │ └─────────────┘ │
└─────────────────┘         └─────────────────┘

The tradeoff: TCG is ~10-100x slower than KVM, but we get determinism "for free" on things like TLB behavior, without kernel patches.


Why not just use KVM for better performance?

KVM (Kernel-based Virtual Machine) runs guest code directly on the CPU using hardware virtualization (VT-x/AMD-V). It's fast but:

  1. Real CPU non-determinism leaks through: TLB behavior, branch prediction, speculative execution timing
  2. Can't control instruction-level timing: KVM doesn't count instructions
  3. Hardware RNG accessible: Guest can still execute RDRAND

QEMU TCG mode interprets every guest instruction in software. Slower, but:

  • Full control over timing (instruction counting)
  • No hardware microarchitecture non-determinism
  • Can trap and replace any instruction (RDRAND, RDTSC)

For deterministic simulation testing, correctness trumps performance. A 60-second simulation completing in 6 seconds (TCG) vs 0.6 seconds (KVM) is acceptable when it guarantees reproducibility.


How does Bloodhound compare to other testing approaches?

Aspect Traditional Testing Jepsen Bloodhound (DST)
Timing bugs Hard to reproduce Statistical Deterministic
Coverage Random/manual Random with analysis Systematic exploration
Reproduction "Works on my machine" Sometimes Same seed = same bug
Detection Manual assertions Consistency checks Automatic property checks
Kernel tested Real Real Real (mostly)
Speed Fast Medium Slower (TCG overhead)

When to use Bloodhound:

  • Distributed systems with complex failure modes
  • Need to reproduce timing-sensitive bugs
  • Want systematic exploration of failure scenarios

When NOT to use Bloodhound:

  • Low-level lock-free data structures (need SMP)
  • Pure performance testing
  • Simple unit tests

Further Reading