Version: 0.1.0-alpha | Status: Pre-release | Last Updated: 2026-03-24
Silent Data Corruption (SDC) occurs when hardware faults produce incorrect computation results without raising any detectable error signal. In GPU clusters running AI workloads, SDC can corrupt training runs spanning weeks of compute time, produce subtly wrong inference outputs, and propagate through model weights and checkpoints in ways that are extremely difficult to diagnose after the fact.
Traditional error-detection mechanisms -- ECC memory, parity checks, watchdog timers -- catch only a subset of hardware faults. "Mercurial cores" (Google, 2021) and "small-delay defects" can produce arithmetically wrong results that pass all hardware error checks. At fleet scale (thousands of GPUs running continuously under thermal stress), the probability of at least one GPU exhibiting SDC at any moment approaches certainty.
SENTINEL is a multi-layered detection framework that continuously validates the computational integrity of every GPU in a cluster. It combines:
- Active probing -- deterministic micro-benchmarks that test every arithmetic unit, memory subsystem, and register file against known-good answers.
- Passive monitoring -- statistical analysis of production inference and training outputs to catch SDC that manifests as anomalous model behavior.
- Cross-GPU validation -- selective Triple Modular Redundancy (TMR) where the same computation is run on three GPUs and results are compared by majority vote.
- Bayesian attribution -- a fleet-wide correlation engine that fuses signals from all sources to compute a per-GPU reliability score and automatically quarantine suspect hardware.
- Tamper-evident audit trail -- a cryptographically chained ledger of every detection event, quarantine decision, and operator action for compliance and forensics.
+=========================================================================+
| GPU CLUSTER |
| |
| +------------------+ +------------------+ +------------------+ |
| | GPU Node 1 | | GPU Node 2 | | GPU Node N | |
| | +------+ +-----+ | | +------+ +-----+ | | +------+ +-----+ | |
| | |Probe | |Inf. | | | |Probe | |Train| | | |Probe | |Inf. | | |
| | |Agent | |Mon. | | | |Agent | |Mon. | | | |Agent | |Mon. | | |
| | +--+---+ +--+--+ | | +--+---+ +--+--+ | | +--+---+ +--+--+ | |
| +----|---------|----| +----|---------|----| +----|---------|----| |
| | | | | | | |
+-------|---------|------------|---------|------------|---------|------+ |
| | | | | | |
gRPC | gRPC | gRPC | gRPC | gRPC | gRPC | |
Stream Stream Stream Stream Stream Stream |
| | | | | | |
+---v---------v------------v---------v------------v---------v---+ |
| CORRELATION ENGINE (Rust) | |
| | |
| +------------------+ +------------------+ +--------------+ | |
| | Bayesian | | Temporal | | Trust | | |
| | Attribution | | Correlation | | Graph & | | |
| | Model | | Windows | | TMR | | |
| +------------------+ +------------------+ +--------------+ | |
| +------------------+ +------------------+ +--------------+ | |
| | Quarantine | | Pattern | | Alert | | |
| | State Machine | | Matcher | | Manager | | |
| +--------+---------+ +------------------+ +--------------+ | |
+-----------|----------------------------------------------------+ |
| |
+-----------v--------------------------------------------------------+ |
| AUDIT LEDGER (Rust) | |
| +------------------+ +------------------+ +--------------+ | |
| | Merkle Hash | | Batch | | Compliance | | |
| | Chain | | Processor | | Reports | | |
| +------------------+ +------------------+ +--------------+ | |
+---------------------------------------------------------------------+ |
|
+---------------------------------------------------------------------+ |
| DATA STORES | |
| +-----------+ +-----------+ +-----------+ | |
| |PostgreSQL | | ScyllaDB | | Redis | | |
| |State & | |Time-series| |Cache & | | |
| |Metadata | |& Audit | |Pub/Sub | | |
| +-----------+ +-----------+ +-----------+ | |
+---------------------------------------------------------------------+ |
|
+---------------------------------------------------------------------+ |
| DASHBOARD (React/TypeScript) | |
| Real-time fleet health | SDC event timeline | GPU drill-down | |
+---------------------------------------------------------------------+ |
+=========================================================================+
SENTINEL employs a defense-in-depth strategy with five complementary detection layers. Each layer targets a different manifestation of SDC and operates on a different timescale, providing overlapping coverage so that a fault missed by one layer is likely caught by another.
Purpose: Directly test GPU arithmetic units, memory, and register files by running computations with known inputs and comparing outputs against pre-computed golden answers. This is the primary, most direct SDC detection mechanism.
Principle: If a GPU is producing correct arithmetic results for known inputs, it is overwhelmingly likely to be producing correct results for production workloads. Conversely, if a probe fails, the specific SM (Streaming Multiprocessor) that produced the wrong answer is identified, enabling fine-grained fault isolation.
| Probe | What It Tests | Tolerance | Default Period |
|---|---|---|---|
| FMA | Fused multiply-add units on every SM | Exact (0 ULP) | 60s |
| Tensor Core | HMMA (half-precision matrix multiply-accumulate) units | Exact (0 ULP) | 60s |
| Transcendental | sin, cos, exp, log SFUs (special function units) | 1 ULP (FP32), 2 ULP (FP16/BF16) | 120s |
| AES | Integer ALU and bitwise logic via AES-128 encrypt/decrypt | Exact (0 ULP) | 300s |
| Memory | Global memory via walking-ones, walking-zeros, MATS+ patterns | Exact | 600s |
| Register File | Register file via known-pattern write/read/verify | Exact | 120s |
| Shared Memory | Shared memory banks via address-as-data patterns | Exact | 300s |
Each probe kernel is launched with explicit SM affinity so that SENTINEL knows exactly which SM produced each result. This is implemented via the sm_affinity.cu module:
- On startup the probe agent queries the GPU's SM count via
cudaDeviceGetAttribute(cudaDevAttrMultiProcessorCount). - For each probe execution, the scheduler selects the set of target SMs based on the
sm_coverageparameter (1.0 = all SMs, 0.5 = half, rotating). - The probe kernel uses cooperative groups and thread block affinity hints to pin each thread block to a specific SM. On Ampere and later architectures,
cudaLaunchAttributePreferredClusterSizeand MIG partition awareness are used for stricter isolation. - Results are indexed by SM ID, so a failure report contains the exact SM that returned the wrong answer.
SM pinning enables SENTINEL to distinguish between a single faulty SM (localized defect) and GPU-wide corruption (systemic issue), which is critical for the correlation engine's attribution model.
Golden answers are the reference outputs against which probe results are compared. They must be:
- Computed at higher precision than the GPU to ensure the reference is correct. The golden answer generator (
tools/golden-answer-generator/) uses Python'smpmathlibrary for arbitrary-precision arithmetic. - Architecture-aware: Different GPU architectures (Ampere, Hopper, Blackwell) may use different internal rounding modes for certain operations. Golden answers are generated per-architecture family.
- Verified: After generation, golden answers are independently verified by running the same computation on multiple known-good GPUs and confirming bit-exact agreement.
- Input selection: Probe inputs are carefully chosen to avoid rounding ambiguity. For FMA probes, inputs are selected such that
fma(a, b, c)produces an exactly representable result in the target precision, eliminating any possibility of a legitimate rounding difference being misinterpreted as corruption.
For probes with non-zero ULP tolerance (transcendentals), the golden answer file includes both the expected value and the acceptable ULP range. The comparison logic computes the ULP distance between the GPU result and the golden answer and flags only results outside the tolerance window.
Probes are scheduled by a priority-based scheduler (probe-agent/src/agent/scheduler.cpp) that respects the configurable overhead budget (default: 2% of GPU time). The scheduler:
- Maintains a priority queue ordered by each probe's next-due timestamp and priority level.
- Before launching a probe, estimates its execution time based on historical measurements.
- If launching the probe would exceed the overhead budget in the current measurement window, the probe is deferred to the next window.
- CUDA streams are used with low priority (
stream_priority: -1) so probe kernels yield to production workloads. - Optionally, CUDA graphs are pre-compiled for probe kernels to minimize launch overhead (CPU-side cost of kernel dispatch).
The overhead budget is enforced using a sliding-window counter of probe kernel execution time divided by wall-clock time. If the budget is exceeded, the scheduler backs off exponentially until the ratio falls below budget.
Purpose: Detect SDC that manifests as anomalous inference or training outputs, even when probes pass. This catches faults that are workload-dependent -- exercised only by specific input patterns or data-dependent control flow that probes do not cover.
The Exponentially Weighted Moving Average (EWMA) tracks key output statistics (logit means, variances, top-k token probabilities) over time. For each tracked statistic:
EWMA_t = alpha * x_t + (1 - alpha) * EWMA_{t-1}
Where alpha (default 0.01) controls the smoothing factor. A smaller alpha produces a smoother estimate that is slower to react to genuine distribution shifts, while a larger alpha is more responsive but more prone to false positives.
An anomaly is signaled when the current observation exceeds:
|x_t - EWMA_t| > k * sigma_EWMA
Where k (the sigma multiplier, default 3.5) controls sensitivity and sigma_EWMA is the EWMA-tracked standard deviation.
A warmup period (default 1000 samples) is enforced before anomaly detection activates, allowing the EWMA to converge to a stable baseline.
The Kullback-Leibler divergence measures how the current output distribution has drifted from a reference distribution:
D_KL(P || Q) = sum_i P(i) * log(P(i) / Q(i))
Where P is the current distribution and Q is the reference. The reference distribution is estimated from a configurable sample count (default 10,000 samples) and updated periodically (default every 3600 seconds).
A sustained elevation in KL divergence that is not explained by a known model update or data distribution shift may indicate SDC-induced drift.
The inference monitor supports two binning strategies for continuous distributions:
- Histogram (default): Fixed-width bins (default 256 bins) for logit or probability distributions.
- KDE: Kernel Density Estimation for smoother distribution estimates when sample counts are lower.
Output entropy (Shannon entropy of the softmax distribution) is tracked over a rolling window (default 1000 samples). SDC in certain compute paths can cause:
- Entropy collapse: Outputs become more confident (peaked distributions) due to corruption in normalization layers.
- Entropy explosion: Outputs become uniform due to corruption producing garbage logits.
Both directions are monitored. A sudden entropy shift that is not correlated with a model change is flagged as anomalous.
When enabled (spectral.enabled: true), the inference monitor applies FFT-based spectral analysis to sequences of output statistics. This detects periodic or quasi-periodic corruption patterns that may indicate a clock-domain fault or a defect that activates only under specific thermal/voltage conditions.
An FFT window of 1024 samples (configurable, must be power of 2) is used. Anomalous spectral peaks -- frequencies that appear in the output statistics but have no corresponding frequency in the input distribution -- are flagged.
This detector has higher overhead than the others and is disabled by default.
Two-sample KS tests compare the distribution of outputs from a specific GPU against the fleet-wide distribution. A statistically significant difference (after Bonferroni correction for multiple comparisons across the fleet) suggests that one GPU is computing differently from its peers.
Monitoring every inference request or training step would impose unacceptable overhead. The inference monitor uses configurable sampling:
- Inference: Default 1% sampling rate (
sampling_rate: 0.01). For a server processing 10,000 requests/second, this means ~100 samples/second -- sufficient for statistical power while adding minimal latency. - Training: Default 10% sampling rate (
sampling_rate: 0.1). Training steps are less frequent but more expensive, so a higher sampling rate is feasible.
The sampling decision is made at the interceptor level before any analysis, so unsampled requests incur zero monitoring overhead.
Purpose: Provide definitive, ground-truth validation by running the same computation on three GPUs and comparing results via majority vote. TMR is the gold standard for fault detection but is expensive (3x compute cost), so SENTINEL uses it selectively.
Rather than running TMR on all production traffic, SENTINEL periodically selects "canary batches" -- representative inference requests or training micro-steps -- and replicates them across three GPUs.
The TMR scheduler (correlation-engine/src/trust/tmr_scheduler.rs) selects GPU triples using one of three strategies:
round_robin: Systematic rotation through all GPUs ensures every GPU is tested over time.random: Random GPU triple selection provides unbiased coverage.suspect_priority(default): GPUs with lower reliability scores are tested more frequently. A GPU in SUSPECT state might be included in every TMR round, while a HEALTHY GPU with a perfect record is tested only occasionally.
TMR canaries run at a configurable interval (default every 600 seconds) with a per-GPU timeout of 30 seconds.
When a TMR canary completes:
- Results from all three GPUs are collected.
- If all three agree (bit-exact after accounting for tolerances), the canary passes.
- If two agree and one dissents, the dissenting GPU is flagged. The dissent event is a strong signal and directly updates the Bayesian model with a weighted failure.
- If all three disagree, the canary is inconclusive and logged for manual review.
Voting operates on cryptographic fingerprints (xxHash or SHA-256) of the output tensors, not on the raw tensor data, to minimize network bandwidth.
The trust graph (correlation-engine/src/trust/trust_graph.rs) tracks pairwise agreement history between GPUs. Each edge in the graph represents the number of times two GPUs agreed (or disagreed) in TMR rounds.
Over time, a GPU with a hardware defect will accumulate disagreements with many partners, while healthy GPUs will consistently agree with each other. The trust graph provides a secondary signal to the Bayesian model: a GPU that disagrees with many different partners is more likely to be faulty than one that disagreed with only one specific partner (which might indicate a fault in the other GPU).
Trust graph snapshots are persisted at configurable intervals (default every 3600 seconds) for historical analysis and forensics.
Purpose: Leverage GPU-internal diagnostic capabilities that are not accessible from normal CUDA kernels. This layer operates below the driver level.
SENTINEL's probe agent can trigger NVIDIA's GPU self-test utilities (where available through the NVML API) to run hardware-level diagnostics. These tests exercise internal GPU components (memory controllers, NVLink transceivers, PCIe interfaces) that are not reachable from user-mode CUDA kernels.
BIST is typically run only during deep-test phases (when a GPU is already quarantined) because it is destructive to running workloads.
The telemetry subsystem (probe-agent/src/telemetry/thermal_monitor.cpp) continuously monitors GPU junction temperature, hotspot temperature, and thermal throttling events via NVML.
Thermal data is correlated with probe failures to detect thermally-activated defects -- faults that manifest only when the GPU exceeds a certain temperature. This is a common failure mode for small-delay defects: timing margins shrink at higher temperatures, and a path that meets timing at 60C may fail at 85C.
Voltage monitoring (probe-agent/src/telemetry/voltage_monitor.cpp) tracks GPU core voltage, memory voltage, and any undervoltage/overvoltage events. Like thermal monitoring, voltage data is correlated with probe failures.
Some SDC-inducing defects are voltage-marginal: the path works at nominal voltage but fails when voltage droops under heavy load (di/dt events). By correlating probe failures with voltage telemetry, SENTINEL can identify voltage-sensitive faults.
Purpose: Detect data corruption that occurs anywhere in the pipeline, not just within the GPU arithmetic units. This includes corruption during data movement (PCIe, NVLink, host memory), storage I/O, and checkpoint serialization.
For workloads with known mathematical invariants, SENTINEL can verify those invariants at runtime:
- Softmax normalization: Output probabilities must sum to 1.0 (within floating-point tolerance). A softmax output that sums to 0.97 or 1.15 indicates corruption in the normalization path.
- Attention weight conservation: In transformer architectures, attention weights for each query should sum to 1.0 across all keys.
- Gradient norm consistency: In data-parallel training, gradient norms across ranks should be approximately equal after all-reduce. A rank whose gradient norm diverges significantly from peers is suspect.
- Loss function monotonicity: While training loss is not strictly monotonic, sustained anomalous loss trajectories (spikes that do not correlate with learning rate schedules or data distribution changes) may indicate corruption.
Invariant violations are reported as high-severity anomaly events to the correlation engine.
End-to-end data integrity is verified by computing cryptographic checksums at critical pipeline stages:
- Checkpoint integrity: Training checkpoints are hashed (BLAKE3 or SHA-256) at write time and verified at read time. The training monitor supports both
verify_on_saveandverify_on_loadmodes. - Tensor fingerprinting: Output tensors can be fingerprinted using xxHash (fast, non-cryptographic) or SHA-256 (slower, cryptographic) for TMR comparison and historical anomaly tracking.
- Weight integrity: Model weights loaded for inference can be verified against a known-good hash to detect bit-flips during model loading or GPU-to-GPU transfer.
The audit ledger provides storage-level integrity via its Merkle hash chain (see Audit Ledger below). Additionally, the training monitor validates checkpoint integrity across the entire storage path: GPU memory -> host memory -> filesystem/object storage -> load -> GPU memory.
Location: probe-agent/
The probe agent is a C++/CUDA binary that runs on every GPU node. It is responsible for executing probe kernels, collecting GPU telemetry, and streaming results to the correlation engine.
+---------------------------------------------------------------+
| Probe Agent Process |
| |
| +------------------+ +------------------+ |
| | Main Thread | | gRPC Client | |
| | - Config load | | Thread | |
| | - Scheduler | | - Batching | |
| | - Signal | | - Streaming | |
| | handling | | - Reconnect | |
| +--------+---------+ +--------+---------+ |
| | ^ |
| | schedule | results |
| v | |
| +--------+---------+ +--------+---------+ |
| | Probe Worker | | Telemetry | |
| | Thread(s) | | Collector | |
| | - CUDA stream | | Thread | |
| | per GPU | | - NVML polling | |
| | - Kernel launch | | - NVLink stats | |
| | - Result verify | | - PCIe stats | |
| +------------------+ +------------------+ |
+---------------------------------------------------------------+
- Main thread: Loads configuration, initializes the scheduler, handles POSIX signals (SIGTERM for graceful shutdown, SIGHUP for config reload).
- Probe worker threads: One per GPU on multi-GPU nodes. Each thread owns a low-priority CUDA stream and executes probe kernels sequentially according to the schedule. CUDA graphs are optionally used to reduce launch overhead.
- Telemetry collector thread: Polls NVML at a configurable interval (default 10 seconds) for temperature, power, ECC counters, NVLink/PCIe error counters, and retired page counts.
- gRPC client thread: Batches probe results (default batch size: 64, max flush interval: 1000ms) and streams them to the correlation engine over a persistent gRPC connection. Handles reconnection with exponential backoff if the connection drops.
The scheduler maintains a min-heap of probes ordered by their next-due timestamp. At each tick:
- Pop the next due probe.
- Check the overhead budget. If launching this probe would exceed the budget in the current 1-second measurement window, defer it by one period.
- Launch the probe kernel on the appropriate CUDA stream with SM affinity.
- Wait for completion (or timeout).
- Compare results against golden answers using the configured tolerance mode.
- Package the result (pass/fail, per-SM results, timing, telemetry snapshot) and enqueue it for the gRPC client thread.
- Re-insert the probe into the heap with
next_due = now + period.
In addition to probe results, the agent collects GPU telemetry that is correlated with probe outcomes by the correlation engine:
- Thermal: Junction temperature, hotspot delta, throttle state, fan speed.
- Power: Instantaneous power draw, voltage, current.
- Memory: ECC corrected/uncorrected error counts, retired page counts.
- Interconnect: NVLink CRC error counts, PCIe replay counts.
- Utilization: SM utilization, memory bandwidth utilization (to contextualize overhead measurements).
Location: inference-monitor/
The inference monitor is a Python library that integrates with inference serving frameworks to sample and analyze model outputs for SDC anomalies.
The inference monitor uses a pluggable interceptor pattern to support multiple inference frameworks:
+------------------------------------------------------------------+
| Inference Server Process |
| |
| +--------------------+ |
| | Framework | +----------------------------+ |
| | (vLLM, TRT-LLM, | ----> | Interceptor | |
| | Triton, Generic) | | - Sample decision | |
| +--------------------+ | - Tensor capture | |
| +-------------+--------------+ |
| | |
| +-------------v--------------+ |
| | Analyzer Pipeline | |
| | +------------------------+ | |
| | | Logit Analyzer (EWMA) | | |
| | +------------------------+ | |
| | | KL Divergence | | |
| | +------------------------+ | |
| | | Entropy Analyzer | | |
| | +------------------------+ | |
| | | Spectral Analyzer | | |
| | +------------------------+ | |
| | | Statistical Tests (KS) | | |
| | +------------------------+ | |
| +-------------+--------------+ |
| | |
| +-------------v--------------+ |
| | gRPC Client (batched) | |
| +----------------------------+ |
+------------------------------------------------------------------+
Available interceptors:
| Interceptor | Framework | Integration Method |
|---|---|---|
vllm_interceptor.py |
vLLM | Monkey-patches the sampler output path |
trtllm_interceptor.py |
TensorRT-LLM | Hooks into the TRT-LLM Python runtime |
triton_interceptor.py |
Triton Inference Server | gRPC/HTTP model wrapper |
generic_interceptor.py |
Any framework | Wraps model __call__ or forward |
Each interceptor:
- Decides whether to sample the current request (based on
sampling_rate). - If sampling, captures the output tensors (logits, probabilities, or generated tokens).
- Computes a fingerprint (xxHash or SHA-256) for TMR comparison.
- Passes the captured data to the analyzer pipeline.
Analyzers run sequentially on each sampled output. Each analyzer maintains its own state (running statistics, reference distributions) and independently decides whether the current sample is anomalous.
Anomaly events from any analyzer are batched (default batch size: 32, flush interval: 500ms) and streamed to the correlation engine. Each event includes:
- The anomaly type and severity.
- The GPU UUID that produced the output.
- The analyzer-specific evidence (e.g., EWMA deviation magnitude, KL divergence value).
- A timestamp and request identifier.
Location: training-monitor/
The training monitor is a Python library that hooks into training frameworks (PyTorch and JAX) to monitor training health and detect SDC-induced anomalies.
import sentinel_training
# Wrap your model
monitor = sentinel_training.pytorch.TrainingMonitor(config_path="sentinel.yaml")
monitor.attach(model, optimizer)
# Training loop proceeds normally -- hooks fire automatically
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()The PyTorch integration uses the following hooks:
| Hook | What It Monitors |
|---|---|
hooks.py |
Registers register_full_backward_hook on all parameters to intercept gradient tensors |
gradient_monitor.py |
Tracks per-parameter and global gradient L2 norms; flags spikes > k*sigma above rolling mean |
loss_monitor.py |
Tracks loss trajectory; detects spikes accounting for learning rate schedule |
ddp_divergence.py |
Compares gradient norms across DDP ranks via NCCL/Gloo all-reduce; flags divergence > threshold |
checkpoint_validator.py |
Hashes checkpoints at save/load time using BLAKE3 or SHA-256 |
| Module | What It Monitors |
|---|---|
gradient_monitor.py |
Monitors gradient norms via JAX's jax.grad wrapper |
pjit_monitor.py |
Hooks into pjit-compiled functions to monitor sharded computation |
transforms.py |
Custom JAX transforms for output fingerprinting |
In data-parallel training, all ranks should compute approximately the same gradient norms (they process different data but the model is synchronized via all-reduce). The cross-rank monitor:
- At configurable intervals (default every 10 steps), each rank broadcasts its local gradient L2 norm.
- The monitor computes the max relative divergence:
max(|norm_i - median|) / median. - If divergence exceeds the threshold (default 5%), an anomaly event is raised identifying the outlier rank and its associated GPU.
This is one of the most powerful SDC detection signals for training workloads: a single corrupted GPU in a DDP group will produce different gradients from all peers, and this divergence is measurable even when the corruption is small.
Location: correlation-engine/
The correlation engine is the central intelligence of SENTINEL. It ingests events from all probe agents and monitors, applies statistical and causal analysis, maintains per-GPU reliability scores, and makes quarantine decisions.
Each GPU maintains a Bayesian belief state modeled as a Beta distribution:
P(GPU is reliable | evidence) ~ Beta(alpha, beta)
Where:
alpha= prior_alpha + number of successful probes (weighted)beta= prior_beta + number of failures and anomalies (weighted)
The reliability score is the mean of the Beta distribution:
reliability = alpha / (alpha + beta)
The confidence interval is computed from the Beta CDF. The 95% lower bound of the reliability score is used for quarantine decisions, ensuring that GPUs are not quarantined due to statistical noise when few observations are available.
Prior selection: The default prior is Beta(100, 1), which encodes a strong prior belief that GPUs are reliable. This requires substantial evidence of failure before the reliability score drops significantly. The prior is configurable to match fleet-specific base rates.
Evidence weighting:
- A probe failure contributes weight 1.0 to beta (strong, direct evidence).
- An anomaly event from the inference or training monitor contributes weight 0.3 to beta (weaker, indirect evidence that could have non-SDC explanations).
- A successful probe contributes weight 1.0 to alpha.
- A TMR dissent contributes weight 2.0 to beta (very strong evidence -- the GPU disagreed with two independent peers).
Time decay: Old observations decay with a configurable half-life (default 168 hours / 7 days). This allows a GPU that experienced a transient issue to recover its reliability score over time, while persistent defects continue to accumulate evidence. Decay is implemented by scaling historical alpha/beta contributions by 2^(-t/half_life) where t is the age of the observation.
Per-SM Granularity: The model optionally tracks per-SM beliefs within each GPU, enabling identification of individual faulty SMs. This is used in the probe scheduling feedback loop: an SM with a lower reliability score is probed more frequently.
+------------------+
| |
| HEALTHY |<---------+
| score >= 0.95 | |
+--------+---------+ |
| |
score < 0.95 | | score >= 0.98
or | | AND 1000
anomalies | | consecutive
| | passes
+--------v---------+ |
| | |
| SUSPECT +----------+
| 0.80 <= score | (cleared)
+--------+---------+
|
score < 0.80|
or |
3 consecutive |
failures |
|
+--------v---------+
| |
| QUARANTINED +----------+
| removed from | |
| production | | passed
+--------+---------+ | deep test
| |
auto or | +---------+---------+
manual | | |
trigger +--------->+ DEEP_TEST |
| | BIST + stress |
| +---------+---------+
| |
max | | failed
quarantine| | deep test
time | |
exceeded | +---------v---------+
+--------->+ |
| CONDEMNED |
| permanent |
| removal |
+-------------------+
State transitions:
| From | To | Trigger | Reversible |
|---|---|---|---|
| HEALTHY | SUSPECT | Reliability score drops below 0.95, or multiple anomaly events in correlation window | Yes |
| SUSPECT | HEALTHY | Reliability score recovers above 0.98 AND 1000 consecutive probe passes | Yes |
| SUSPECT | QUARANTINED | Reliability score drops below 0.80, or 3 consecutive probe failures | Yes |
| QUARANTINED | DEEP_TEST | Automatic (scheduled) or manual operator trigger | Yes |
| QUARANTINED | CONDEMNED | Maximum quarantine time exceeded (default 720 hours / 30 days) | No |
| DEEP_TEST | HEALTHY | Deep test suite passes; reliability score reset to prior | Yes |
| DEEP_TEST | CONDEMNED | Deep test suite fails | No |
| CONDEMNED | (none) | Terminal state; requires hardware replacement | No |
All transitions are logged to the audit ledger with the triggering evidence, operator identity (if manual), and timestamp.
Approval gates: The require_approval setting (default false) can require human approval for quarantine and reinstatement actions. When enabled, the state machine enters a pending state and emits an approval-request alert. The operator must approve via the dashboard or SDK before the transition completes.
The correlation engine uses configurable time windows (default 300 seconds) to correlate events from different sources. When multiple events arrive for the same GPU within a window, they are analyzed together:
- A probe failure AND an inference anomaly within the same window is a much stronger signal than either alone.
- Multiple probe failures across different probe types (e.g., FMA and Tensor Core) within a window suggest a systemic GPU issue rather than a single-unit defect.
- Probe failures across multiple GPUs on the same node within a window may indicate a node-level issue (power supply, cooling, PCIe bus).
The correlation buffer holds up to 100,000 events (configurable). Events older than the correlation window are flushed to storage and removed from the active buffer.
The pattern matcher (correlation-engine/src/correlation/pattern_matcher.rs) identifies fleet-wide failure patterns:
- Node-correlated failures: Multiple GPUs on the same node failing simultaneously, suggesting a shared-infrastructure issue.
- Topology-correlated failures: GPUs connected via the same NVLink switch or PCIe root complex failing together.
- Temporal clustering: A burst of failures across multiple GPUs fleet-wide, suggesting a software bug (driver, firmware) rather than hardware SDC.
- Probe-specific patterns: A single probe type failing across many GPUs, suggesting a golden answer error or probe bug rather than real SDC.
Pattern detection is critical for avoiding false-positive cascades where a software issue triggers fleet-wide quarantines.
Location: audit-ledger/
The audit ledger is a tamper-evident, append-only log of every significant event in the SENTINEL system. It provides the compliance and forensics foundation.
Every audit entry is cryptographically chained to its predecessor:
Entry_N.hash = SHA-256( Entry_N.data || Entry_{N-1}.hash )
This means that modifying any historical entry would change its hash, which would break the chain for all subsequent entries. Verification requires only walking the chain forward and recomputing hashes.
The first entry in the chain uses a zero hash (0x0000...0000) as its predecessor.
For efficiency, entries are grouped into batches (default batch size: 1024 entries, flush interval: 60 seconds):
- Pending entries are sorted by timestamp for deterministic ordering.
- Each entry is chained to its predecessor (sequential hashing within the batch).
- A Merkle tree is computed over all entry hashes in the batch.
- The Merkle root is stored alongside the batch metadata.
The Merkle tree enables efficient integrity proofs: to prove that a specific entry exists and is unmodified, only O(log N) hashes are needed rather than the entire chain.
The audit ledger supports two storage backends:
- ScyllaDB (default): Used for the time-series entry data. Provides horizontal scalability and configurable replication (default RF=3) with tunable consistency (QUORUM writes, QUORUM reads for audit queries).
- PostgreSQL: Used for metadata, batch summaries, and Merkle roots. Supports table partitioning for efficient retention management.
The audit metadata tables are partitioned by time range (audit-ledger/src/storage/migrations/003_partitioning.sql):
- Active partition: current month.
- Historical partitions: one per month.
- Partition pruning is automatic when retention limits are reached.
The detail_retention_days setting (default 365 days) controls how long detailed entry data is kept. The summary_retention_days setting (default 0 = forever) controls Merkle root and batch summary retention. This separation allows operators to prune detailed data for storage efficiency while retaining the cryptographic proof chain indefinitely.
The audit ledger includes built-in compliance report generators:
- SOC 2 reports (
audit-ledger/src/compliance/soc2_report.rs): Generates evidence reports mapped to SOC 2 Trust Services Criteria. - ISO 27001 reports (
audit-ledger/src/compliance/iso27001_report.rs): Generates evidence reports mapped to ISO 27001 Annex A controls.
Reports can be generated on-demand via the gRPC query service or scheduled for periodic generation.
The ledger runs automatic chain integrity checks at configurable intervals (default every 6 hours). Each check verifies the most recent N entries (default 10,000) by recomputing hashes and comparing against stored values. Any discrepancy triggers a CRITICAL alert.
GPU Node Correlation Engine Audit Ledger
-------- ------------------ ------------
Probe Agent -- gRPC stream (ProbeResultBatch) --> ProbeService
|
v
EventStore (ScyllaDB)
|
v
BayesianAttribution
|
v
QuarantineStateMachine
|
v
AlertManager
|
+-- gRPC --> IngestService
|
v
ChainBuilder
|
v
ScyllaDB/PG
Inf. Monitor -- gRPC stream (AnomalyEventBatch) --> AnomalyService
|
v
(same pipeline as above)
Train. Mon. -- gRPC stream (AnomalyEventBatch) --> AnomalyService
|
v
(same pipeline as above)
All agent-to-engine communication uses bidirectional gRPC streaming:
- Agent -> Engine: Probe results and anomaly events are streamed in batches. Batching reduces per-message overhead and network round-trips. Default batch sizes are 64 (probe agent) and 32 (monitors), with maximum flush intervals to ensure timely delivery even under low event rates.
- Engine -> Agent: Configuration updates are streamed via the
ConfigService. When an operator changes a probe schedule or threshold via the dashboard, the update is pushed to all connected agents without requiring a restart.
The gRPC connections use persistent streams with keepalive pings (every 30 seconds) and automatic reconnection with exponential backoff (1s, 2s, 4s, 8s, max 60s).
Events are batched at multiple levels:
- Source batching: The probe agent and monitors batch events locally before sending (reduces gRPC call frequency).
- Correlation batching: The correlation engine buffers events in the temporal correlation window before making attribution decisions (ensures events from multiple sources for the same GPU are analyzed together).
- Audit batching: The audit ledger batches entries for efficient Merkle tree computation and storage writes (reduces per-entry overhead).
SENTINEL is designed to scale from small development clusters (a handful of GPUs) to large production fleets (10,000+ GPUs).
Probe agents scale trivially because they run independently on each node. There is no inter-agent communication. Each agent only communicates with the correlation engine.
- Per-node overhead: Probe agents are designed for < 2% GPU time overhead (configurable). CPU overhead is minimal (< 0.5 core per node).
- Multi-GPU nodes: A single probe agent process manages all GPUs on a node using one worker thread per GPU.
The correlation engine is the central bottleneck. Scaling strategies:
- Horizontal scaling: Multiple correlation engine replicas behind a load balancer, with consistent hashing by GPU UUID to ensure all events for a given GPU land on the same replica (necessary for stateful Bayesian tracking).
- HPA: In Kubernetes, the correlation engine Deployment uses Horizontal Pod Autoscaler based on CPU utilization and gRPC request rate.
- Partitioning: For very large fleets, GPUs can be partitioned across multiple correlation engine clusters by region, rack, or logical group. Each partition is independently managed.
Sizing estimates:
| Fleet Size | Correlation Engine Replicas | CPU Cores | Memory |
|---|---|---|---|
| 64 GPUs | 1 | 2 | 2 GB |
| 256 GPUs | 1 | 4 | 4 GB |
| 1,000 GPUs | 2-3 | 4 each | 4 GB each |
| 4,000 GPUs | 4-6 | 4 each | 8 GB each |
| 10,000 GPUs | 8-12 | 8 each | 16 GB each |
- ScyllaDB: Scales horizontally by adding nodes to the cluster. Time-series data (probe results, anomaly events) is the highest-volume data and benefits from ScyllaDB's write-optimized architecture.
- PostgreSQL: Handles state and metadata queries. For fleets > 5,000 GPUs, consider read replicas for dashboard queries.
- Redis: Used for caching (current GPU states, active correlation windows) and pub/sub (real-time dashboard updates). A single Redis instance handles fleets up to ~5,000 GPUs; Redis Cluster for larger deployments.
The audit ledger is designed as a single-writer system to maintain sequential hash chain integrity. Scaling strategies:
- Read replicas: Multiple read-only replicas serve compliance queries and dashboard requests.
- Batch throughput: The batch processing architecture (1024 entries per batch, 60-second flush interval) can sustain ~17 entries/second continuously, or burst to much higher rates with larger batches. At 10,000 GPUs with 7 probe types and 60-second probe intervals, the sustained event rate is approximately 1,167 events/second -- well within batch processing capacity with appropriately sized batches.
- Storage partitioning: Time-based partitioning in PostgreSQL and TTL-based expiry in ScyllaDB keep storage costs manageable.
Each probe result message is approximately 200 bytes. At fleet scale:
| Fleet Size | Probe Events/sec | Bandwidth (probe results only) |
|---|---|---|
| 64 GPUs | ~7 | ~1.5 KB/s |
| 1,000 GPUs | ~117 | ~23 KB/s |
| 10,000 GPUs | ~1,167 | ~230 KB/s |
Network bandwidth for SENTINEL telemetry is negligible relative to production GPU cluster traffic.
All gRPC connections between SENTINEL components use mutual TLS by default:
- Each component has its own TLS certificate and private key.
- The CA certificate is shared across all components.
- Clients verify the server certificate; servers verify the client certificate.
- TLS 1.3 is required; earlier versions are rejected.
Certificate paths are configured in sentinel.yaml under each component's tls section.
Probe results are HMAC-signed by the probe agent before transmission:
signature = HMAC-SHA256(probe_result_bytes, shared_secret)
The correlation engine verifies the HMAC before processing any probe result. This prevents an attacker from injecting false probe results to mask a compromised GPU or trigger false quarantines.
The HMAC key should be injected via a secrets manager (e.g., Kubernetes Secrets, HashiCorp Vault) rather than stored in configuration files. The hmac_key field in sentinel.yaml is intentionally left empty as a reminder.
The SENTINEL dashboard and SDK enforce RBAC:
| Role | Permissions |
|---|---|
viewer |
Read-only access to fleet health, event history, audit trail |
operator |
Viewer permissions + manual quarantine/reinstatement, threshold adjustment |
admin |
Operator permissions + configuration changes, RBAC management, compliance report generation |
auditor |
Read-only access to audit trail and compliance reports; cannot view or modify operational settings |
RBAC is enforced at the gRPC service level in the correlation engine and audit ledger.
The SENTINEL Kubernetes deployment includes network policies (deploy/kubernetes/network-policies.yaml) that restrict inter-component communication:
- Probe agents can only communicate with the correlation engine (port 50051).
- The correlation engine can communicate with the audit ledger (port 50052), data stores (PostgreSQL 5432, ScyllaDB 9042, Redis 6379), and the metrics endpoint.
- The audit ledger can communicate with data stores only.
- The dashboard can communicate with the correlation engine and audit ledger query services only.
- No component can communicate with external networks except for configured alerting channels (Slack webhook, PagerDuty API, SMTP).
The Merkle hash chain in the audit ledger provides tamper evidence. Even if an attacker gains access to the database, modifying historical entries without detection requires recomputing the entire hash chain from the modified entry forward, and any external verifier holding a previous Merkle root would detect the tampering.
Periodic chain verification (default every 6 hours) provides automated detection of any integrity violations.