Package:
sentinel-sdkStatus: Pre-release Alpha Minimum Python: 3.10+ License: Apache-2.0
The SENTINEL Python SDK provides both synchronous and asynchronous interfaces for querying GPU health, fleet status, audit trails, trust graphs, and managing quarantine directives in a SENTINEL deployment.
- Installation
- Quick Start
- Authentication
- Client API Reference
- Type Reference
- Sync vs Async Usage Patterns
- Error Handling
- Configuration via Environment Variables
- Integration Examples
- Troubleshooting
- Version Compatibility
pip install sentinel-sdkThe SDK depends on:
grpcio >= 1.60.0grpcio-tools >= 1.60.0(for protobuf stubs)pydantic >= 2.0protobuf >= 4.25.0
For async support, no additional dependencies are required -- grpcio ships
with grpc.aio built in.
git clone https://github.com/sentinel-sdc/sentinel.git
cd sentinel/sdk/python
pip install -e ".[dev]"from sentinel_sdk import SentinelClient, TlsConfig
# Connect to the SENTINEL correlation engine.
client = SentinelClient.connect(
"sentinel.example.com:443",
tls_config=TlsConfig(ca_cert_path="/etc/sentinel/ca.pem"),
)
# Check a single GPU.
health = client.query_gpu_health("GPU-abcd1234-5678-9abc-def0-123456789abc")
print(f"GPU state: {health.state.name}, reliability: {health.reliability_score:.4f}")
# Check the entire fleet.
fleet = client.query_fleet_health()
print(f"Fleet: {fleet.healthy}/{fleet.total_gpus} healthy, SDC rate: {fleet.overall_sdc_rate:.6f}")
client.close()import asyncio
from sentinel_sdk import SentinelClient, TlsConfig
async def main():
client = await SentinelClient.aconnect(
"sentinel.example.com:443",
tls_config=TlsConfig(ca_cert_path="/etc/sentinel/ca.pem"),
)
health = await client.aquery_gpu_health("GPU-abcd1234-5678-9abc-def0-123456789abc")
print(f"GPU state: {health.state.name}")
await client.aclose()
asyncio.run(main())with SentinelClient.connect("localhost:50051") as client:
fleet = client.query_fleet_health()
print(f"{fleet.total_gpus} GPUs tracked")async with await SentinelClient.aconnect("localhost:50051") as client:
fleet = await client.aquery_fleet_health()
print(f"{fleet.total_gpus} GPUs tracked")SENTINEL supports three connection modes:
client = SentinelClient.connect("localhost:50051")No TLS. Suitable only for local development. Never use in production.
client = SentinelClient.connect(
"sentinel.example.com:443",
tls_config=TlsConfig(
ca_cert_path="/etc/sentinel/ca.pem",
),
)The client verifies the server certificate against the provided CA. The server does not authenticate the client.
client = SentinelClient.connect(
"sentinel.example.com:443",
tls_config=TlsConfig(
ca_cert_path="/etc/sentinel/ca.pem",
client_cert_path="/etc/sentinel/client.pem",
client_key_path="/etc/sentinel/client-key.pem",
),
)Both client and server authenticate each other. This is the recommended production configuration.
@dataclass
class TlsConfig:
ca_cert_path: Optional[str] = None # Path to CA certificate (PEM).
client_cert_path: Optional[str] = None # Path to client certificate (PEM).
client_key_path: Optional[str] = None # Path to client private key (PEM).| Field | Description |
|---|---|
ca_cert_path |
Path to the PEM-encoded CA certificate used to verify the server. If None, the system default trust store is used. |
client_cert_path |
Path to the PEM-encoded client certificate for mTLS. Must be provided together with client_key_path. |
client_key_path |
Path to the PEM-encoded client private key for mTLS. Must be provided together with client_cert_path. |
@dataclass
class RetryConfig:
max_retries: int = 3 # Maximum retry attempts.
initial_backoff_s: float = 0.1 # Initial backoff in seconds.
max_backoff_s: float = 10.0 # Maximum backoff cap in seconds.
backoff_multiplier: float = 2.0 # Exponential backoff multiplier.
retryable_status_codes: list[grpc.StatusCode] = [
grpc.StatusCode.UNAVAILABLE,
grpc.StatusCode.DEADLINE_EXCEEDED,
grpc.StatusCode.RESOURCE_EXHAUSTED,
]| Field | Description |
|---|---|
max_retries |
Number of times to retry a failed RPC before raising. Set to 0 to disable retries. |
initial_backoff_s |
How long to wait before the first retry, in seconds. |
max_backoff_s |
Upper bound on backoff duration, in seconds. |
backoff_multiplier |
Each successive backoff is multiplied by this factor. |
retryable_status_codes |
gRPC status codes that trigger a retry. Non-matching codes raise immediately. |
SentinelClient.connect(endpoint, *, tls_config=None, retry_config=None, default_timeout=30.0, options=None)
Create a synchronous client connected to the SENTINEL gRPC gateway.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
endpoint |
str |
(required) | host:port of the SENTINEL gRPC gateway. |
tls_config |
Optional[TlsConfig] |
None |
TLS credentials. If None, an insecure channel is used. |
retry_config |
Optional[RetryConfig] |
None |
Retry behaviour. Defaults to 3 retries with exponential backoff. |
default_timeout |
float |
30.0 |
Default per-RPC deadline in seconds. |
options |
Optional[list[tuple[str, Any]]] |
None |
Additional gRPC channel options (e.g., keepalive settings). |
Returns: SentinelClient
Example:
client = SentinelClient.connect(
"sentinel.prod.internal:443",
tls_config=TlsConfig(ca_cert_path="ca.pem"),
retry_config=RetryConfig(max_retries=5),
default_timeout=15.0,
)SentinelClient.aconnect(endpoint, *, tls_config=None, retry_config=None, default_timeout=30.0, options=None)
Create an asynchronous client connected to the SENTINEL gRPC gateway.
Same parameters as connect. Must be called with await.
Returns: SentinelClient
Example:
client = await SentinelClient.aconnect(
"sentinel.prod.internal:443",
tls_config=TlsConfig(ca_cert_path="ca.pem"),
)Query the health of a specific GPU by its UUID. Returns the GPU's current lifecycle state, Bayesian reliability score, probe statistics, anomaly counts, and per-SM health breakdown.
Parameters:
| Parameter | Type | Description |
|---|---|---|
gpu_uuid |
str |
NVIDIA GPU UUID (e.g., "GPU-abcd1234-...") |
Returns: GpuHealth
Raises: NotFoundError if the GPU UUID is unknown to the system.
Example:
health = client.query_gpu_health("GPU-abcd1234-5678-9abc-def0-123456789abc")
print(f"State: {health.state.name}")
print(f"Reliability: {health.reliability_score:.6f}")
print(f"Probes: {health.probe_pass_count} pass / {health.probe_fail_count} fail")
print(f"Anomalies: {health.anomaly_count}")
# Inspect per-SM health.
for sm in health.sm_health:
if sm.probe_fail_count > 0:
print(f" SM {sm.sm.sm_id}: {sm.probe_fail_count} failures, "
f"reliability={sm.reliability_score:.4f}")Async variant. Same parameters and return type.
health = await client.aquery_gpu_health("GPU-abcd1234-5678-9abc-def0-123456789abc")Query the fleet-wide health summary. Optionally filter by hostname prefix, GPU model, or lifecycle state.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
hostname_prefix |
str |
"" |
Only include GPUs on hosts matching this prefix (e.g., "gpu-rack-01"). Empty string matches all. |
model_filter |
str |
"" |
Only include GPUs of this model (e.g., "H100"). Empty string matches all. |
state_filter |
Optional[list[GpuHealthState]] |
None |
Only include GPUs in these lifecycle states. None includes all states. |
Returns: FleetHealthSummary
Example:
# Get full fleet summary.
fleet = client.query_fleet_health()
print(f"Total: {fleet.total_gpus} | Healthy: {fleet.healthy} | "
f"Suspect: {fleet.suspect} | Quarantined: {fleet.quarantined}")
print(f"SDC rate: {fleet.overall_sdc_rate:.6f} events/GPU-hour")
print(f"Active agents: {fleet.active_agents}")
# Filter to just H100s on rack 3.
from sentinel_sdk.types import GpuHealthState
fleet = client.query_fleet_health(
hostname_prefix="gpu-rack-03",
model_filter="H100",
state_filter=[GpuHealthState.SUSPECT, GpuHealthState.QUARANTINED],
)Async variant. Same parameters and return type.
Retrieve historical health data for a GPU within a time range, including state transitions, correlation events, and reliability score time series.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
gpu_uuid |
str |
(required) | GPU UUID to query. |
start_time |
datetime |
(required) | Start of the time range (UTC). |
end_time |
datetime |
(required) | End of the time range (UTC). |
limit |
int |
1000 |
Maximum number of events to return per page. |
page_token |
str |
"" |
Pagination token from a previous response. |
Returns: GpuHistoryResponse
Example:
from datetime import datetime, timedelta
end = datetime.utcnow()
start = end - timedelta(days=7)
history = client.get_gpu_history(
"GPU-abcd1234-5678-9abc-def0-123456789abc",
start_time=start,
end_time=end,
)
print(f"State transitions: {len(history.state_transitions)}")
for t in history.state_transitions:
print(f" {t.timestamp}: {t.from_state.name} -> {t.to_state.name} ({t.reason})")
print(f"Correlations: {len(history.correlations)}")
print(f"Reliability samples: {len(history.reliability_history)}")
# Pagination.
while history.next_page_token:
history = client.get_gpu_history(
"GPU-abcd1234-5678-9abc-def0-123456789abc",
start_time=start,
end_time=end,
page_token=history.next_page_token,
)
# Process next page...Async variant. Same parameters and return type.
issue_quarantine(gpu_uuid, action, reason, initiated_by="sentinel-sdk", evidence=None, requires_approval=False) -> DirectiveResponse
Issue a quarantine directive to change a GPU's lifecycle state.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
gpu_uuid |
str |
(required) | GPU UUID to act upon. |
action |
QuarantineAction |
(required) | The action to take. |
reason |
str |
(required) | Human-readable reason for this action. |
initiated_by |
str |
"sentinel-sdk" |
Identifier for who/what initiated this directive. |
evidence |
Optional[list[str]] |
None |
List of supporting evidence references (event IDs, probe execution IDs, etc.). |
requires_approval |
bool |
False |
Whether this directive requires human approval before execution. |
Returns: DirectiveResponse
QuarantineAction values:
| Value | Description |
|---|---|
QuarantineAction.QUARANTINE |
Remove GPU from production workloads and place under investigation. |
QuarantineAction.REINSTATE |
Return a previously quarantined GPU to production. |
QuarantineAction.CONDEMN |
Permanently mark GPU as unreliable; schedule for hardware replacement. |
QuarantineAction.SCHEDULE_DEEP_TEST |
Initiate a comprehensive deep-diagnostic test suite on the GPU. |
Example:
from sentinel_sdk.types import QuarantineAction
# Quarantine a suspect GPU.
resp = client.issue_quarantine(
gpu_uuid="GPU-abcd1234-5678-9abc-def0-123456789abc",
action=QuarantineAction.QUARANTINE,
reason="Repeated FMA probe failures on SM 42",
evidence=["probe-exec-001", "probe-exec-002", "anomaly-evt-007"],
requires_approval=True,
)
if resp.accepted:
print(f"Directive {resp.directive_id} accepted, GPU now {resp.resulting_state}")
else:
print(f"Directive rejected: {resp.rejection_reason}")Async variant. Same parameters and return type.
Query the tamper-evident audit trail with filtering and pagination.
Parameters:
| Parameter | Type | Description |
|---|---|---|
filters |
AuditQueryFilters |
Query filters (see below). |
AuditQueryFilters fields:
| Field | Type | Default | Description |
|---|---|---|---|
gpu |
Optional[GpuIdentifier] |
None |
Filter by GPU. |
start_time |
Optional[datetime] |
None |
Start of time range. |
end_time |
Optional[datetime] |
None |
End of time range. |
entry_type |
AuditEntryType |
UNSPECIFIED |
Filter by entry type. UNSPECIFIED returns all types. |
limit |
int |
100 |
Maximum entries to return. |
page_token |
str |
"" |
Pagination token. |
descending |
bool |
False |
If True, return newest entries first. |
Returns: AuditQueryResponse
AuditEntryType values:
| Value | Description |
|---|---|
PROBE_RESULT |
A probe execution result. |
ANOMALY_EVENT |
An anomaly detection event. |
QUARANTINE_ACTION |
A quarantine lifecycle action. |
CONFIG_CHANGE |
A configuration change. |
TMR_RESULT |
A TMR canary result. |
SYSTEM_EVENT |
A system-level event (startup, shutdown, error). |
Example:
from datetime import datetime, timedelta
from sentinel_sdk.types import AuditQueryFilters, AuditEntryType, GpuIdentifier
filters = AuditQueryFilters(
gpu=GpuIdentifier(uuid="GPU-abcd1234-5678-9abc-def0-123456789abc"),
start_time=datetime.utcnow() - timedelta(hours=24),
entry_type=AuditEntryType.QUARANTINE_ACTION,
limit=50,
descending=True,
)
result = client.query_audit_trail(filters)
print(f"Found {result.total_count} entries (showing {len(result.entries)})")
for entry in result.entries:
print(f" [{entry.entry_id}] {entry.entry_type.name} at {entry.timestamp}")Async variant. Same parameters and return type.
verify_chain(start_time=None, end_time=None, start_entry_id=0, end_entry_id=0, verify_merkle_roots=True) -> ChainVerificationResult
Verify the integrity of the audit chain. This checks that the hash chain is unbroken and optionally verifies Merkle roots of batches.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
start_time |
Optional[datetime] |
None |
Start of verification range (reserved for future use). |
end_time |
Optional[datetime] |
None |
End of verification range (reserved for future use). |
start_entry_id |
int |
0 |
Starting entry ID (0 = from genesis). |
end_entry_id |
int |
0 |
Ending entry ID (0 = through latest). |
verify_merkle_roots |
bool |
True |
Whether to also verify Merkle roots of batches. |
Returns: ChainVerificationResult
Example:
result = client.verify_chain()
if result.valid:
print(f"Chain integrity verified: {result.entries_verified} entries, "
f"{result.batches_verified} batches in {result.verification_time_ms}ms")
else:
print(f"CHAIN BROKEN at entry {result.first_invalid_entry_id}: "
f"{result.failure_description}")Async variant. Same parameters and return type.
Retrieve a point-in-time snapshot of the GPU trust graph. The trust graph records pairwise comparison history from TMR canary runs.
Returns: TrustGraphSnapshot
Example:
graph = client.get_trust_graph()
print(f"Trust graph: {graph.total_gpus} GPUs, {len(graph.edges)} edges")
print(f"Coverage: {graph.coverage_pct:.1f}%")
print(f"Trust scores: min={graph.min_trust_score:.4f}, mean={graph.mean_trust_score:.4f}")
# Find low-trust edges.
for edge in graph.edges:
if edge.trust_score < 0.95:
print(f" Low trust: {edge.gpu_a.uuid[:16]}... <-> {edge.gpu_b.uuid[:16]}... "
f"score={edge.trust_score:.4f} "
f"({edge.agreement_count} agree / {edge.disagreement_count} disagree)")Async variant. Same return type.
Push a dynamic configuration update to agents or subsystems.
Parameters:
| Parameter | Type | Description |
|---|---|---|
update |
ConfigUpdate |
The configuration update to apply. |
A ConfigUpdate carries exactly one of the following update payloads:
| Field | Type | Description |
|---|---|---|
probe_schedule |
ProbeScheduleUpdate |
Replace the probe execution schedule. |
overhead_budget |
OverheadBudgetUpdate |
Change the maximum GPU overhead budget. |
sampling_rate |
SamplingRateUpdate |
Update a component's sampling rate. |
threshold |
ThresholdUpdate |
Update a component's detection threshold. |
Returns: ConfigAck
Example:
from sentinel_sdk.types import (
ConfigUpdate, ProbeScheduleUpdate, ProbeScheduleEntry, ProbeType,
OverheadBudgetUpdate, ThresholdUpdate,
)
# Update probe schedule.
ack = client.update_config(ConfigUpdate(
update_id="cfg-001",
initiated_by="operator:jane",
reason="Increase FMA probe frequency during investigation",
probe_schedule=ProbeScheduleUpdate(entries=[
ProbeScheduleEntry(
type=ProbeType.FMA,
period_seconds=30, # Every 30 seconds (was 60).
sm_coverage=1.0, # All SMs.
priority=1,
enabled=True,
timeout_ms=5000,
),
ProbeScheduleEntry(
type=ProbeType.TENSOR_CORE,
period_seconds=120,
sm_coverage=0.25,
priority=2,
enabled=True,
timeout_ms=10000,
),
]),
))
if ack.applied:
print(f"Config {ack.update_id} applied by {ack.component_id} (v{ack.config_version})")
else:
print(f"Config rejected: {ack.error}")
# Adjust overhead budget.
ack = client.update_config(ConfigUpdate(
update_id="cfg-002",
initiated_by="auto-tuner",
reason="Reduce probe overhead during peak training",
overhead_budget=OverheadBudgetUpdate(budget_pct=1.0),
))
# Adjust detection threshold.
ack = client.update_config(ConfigUpdate(
update_id="cfg-003",
initiated_by="operator:jane",
reason="Lower KL divergence sensitivity",
threshold=ThresholdUpdate(
component="inference_monitor",
parameter="kl_divergence_threshold",
value=0.05,
),
))Async variant. Same parameters and return type.
Stream quarantine directives in real-time. This method blocks until the client is closed, the server terminates the stream, or a non-retryable error occurs. The SDK automatically reconnects on transient failures.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
callback |
Callable[[QuarantineDirective], None] |
(required) | Invoked for each incoming directive. |
hostname_filter |
str |
"" |
Only receive directives for this hostname. Empty = all. |
action_filter |
QuarantineAction |
UNSPECIFIED |
Only receive directives of this type. UNSPECIFIED = all. |
Example:
from sentinel_sdk.types import QuarantineDirective
def on_directive(d: QuarantineDirective):
print(f"[{d.timestamp}] {d.action.name} -> {d.gpu.uuid} ({d.reason})")
if d.requires_approval:
print(f" Awaiting approval from {d.initiated_by}")
# This blocks forever (until client.close() is called).
client.stream_events(on_directive)Async variant. The callback may be a regular function or a coroutine function
(async def). If it is a coroutine, it will be awaited.
async def on_directive(d: QuarantineDirective):
print(f"Directive: {d.action.name} -> {d.gpu.uuid}")
# Can await other async operations here.
await client.astream_events(on_directive)Close the client and release all resources (synchronous).
Close the client and release all resources (asynchronous). Must be awaited.
The client supports context managers:
# Sync
with SentinelClient.connect("localhost:50051") as client:
...
# Async
async with await SentinelClient.aconnect("localhost:50051") as client:
...All types are Pydantic models defined in sentinel_sdk.types. They mirror the
protobuf messages defined in proto/sentinel/v1/.
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
INFO |
1 | Informational. |
WARNING |
2 | Warning level. |
HIGH |
3 | High severity. |
CRITICAL |
4 | Critical severity. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
FMA |
1 | Fused multiply-add determinism check. |
TENSOR_CORE |
2 | Tensor Core matrix-multiply reproducibility check. |
TRANSCENDENTAL |
3 | Transcendental function (sin, cos, exp, log) accuracy check. |
AES |
4 | AES-based combinational logic exhaustive-path check. |
MEMORY |
5 | GPU global memory integrity check (walking-ones / MATS+). |
REGISTER_FILE |
6 | Register file integrity check via known-pattern writes. |
SHARED_MEMORY |
7 | Shared memory integrity check. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
PASS |
1 | Output matched expected golden value. |
FAIL |
2 | Output did NOT match expected golden value. |
ERROR |
3 | Execution error (kernel launch failure, etc.). |
TIMEOUT |
4 | Probe did not complete within the allowed time window. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
HEALTHY |
1 | Operating normally; no evidence of SDC. |
SUSPECT |
2 | Anomalous signals detected; under increased monitoring. |
QUARANTINED |
3 | Removed from production workloads pending investigation. |
DEEP_TEST |
4 | Undergoing deep diagnostic testing. |
CONDEMNED |
5 | Permanently marked as unreliable; must be replaced. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
QUARANTINE |
1 | Remove from production; place under investigation. |
REINSTATE |
2 | Return to production. |
CONDEMN |
3 | Permanently mark as unreliable. |
SCHEDULE_DEEP_TEST |
4 | Initiate deep diagnostic testing. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
LOGIT_DRIFT |
1 | EWMA-smoothed logit distribution drift beyond threshold. |
ENTROPY_ANOMALY |
2 | Output entropy abnormally high or low. |
KL_DIVERGENCE |
3 | KL divergence between reference and observed exceeds limit. |
GRADIENT_NORM_SPIKE |
4 | Gradient norm spike during training. |
LOSS_SPIKE |
5 | Training loss spike not explained by learning-rate schedule. |
CROSS_RANK_DIVERGENCE |
6 | Divergence between ranks in data-parallel training. |
CHECKPOINT_DIVERGENCE |
7 | Checkpointed model differs from expected. |
INVARIANT_VIOLATION |
8 | Mathematical invariant violated (e.g., softmax sums to 1). |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
INFERENCE_MONITOR |
1 | Inference monitoring sidecar. |
TRAINING_MONITOR |
2 | Training monitoring hooks. |
INVARIANT_CHECKER |
3 | Mathematical invariant checker. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
MULTI_SIGNAL |
1 | Multiple anomaly types on the same GPU in a time window. |
SM_LOCALIZED |
2 | Probe failures and anomalies co-occurring on the same SM. |
ENVIRONMENTAL |
3 | Thermal/power anomalies correlating with computation errors. |
NODE_CORRELATED |
4 | Multiple GPUs on the same node showing correlated failures. |
FIRMWARE_CORRELATED |
5 | Failures correlated with specific firmware or driver versions. |
TMR_CONFIRMED |
6 | TMR dissent correlating with other signals. |
| Value | Int | Description |
|---|---|---|
UNSPECIFIED |
0 | Not set. |
PROBE_RESULT |
1 | A probe execution result. |
ANOMALY_EVENT |
2 | An anomaly detection event. |
QUARANTINE_ACTION |
3 | A quarantine lifecycle action. |
CONFIG_CHANGE |
4 | A configuration change. |
TMR_RESULT |
5 | A TMR canary result. |
SYSTEM_EVENT |
6 | A system-level event. |
Uniquely identifies a GPU within the fleet.
| Field | Type | Description |
|---|---|---|
uuid |
str |
NVIDIA UUID (e.g., "GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"). |
hostname |
str |
Hostname of the machine containing this GPU. |
device_index |
int |
PCI device index on the host (0-based). |
model |
str |
GPU model name (e.g., "NVIDIA H100 80GB HBM3"). |
driver_version |
str |
Driver version string (e.g., "535.129.03"). |
firmware_version |
str |
GPU firmware/VBIOS version. |
Identifies a specific Streaming Multiprocessor on a GPU.
| Field | Type | Description |
|---|---|---|
gpu |
Optional[GpuIdentifier] |
The GPU containing this SM. |
sm_id |
int |
SM index within the GPU (0-based). |
Health status and Bayesian reliability model for a single GPU.
| Field | Type | Description |
|---|---|---|
gpu |
Optional[GpuIdentifier] |
The GPU this health record describes. |
state |
GpuHealthState |
Current lifecycle state. |
reliability_score |
float |
Bayesian reliability score: alpha / (alpha + beta). Range [0.0, 1.0]. |
alpha |
float |
Beta distribution alpha parameter (successes + prior). |
beta |
float |
Beta distribution beta parameter (failures + prior). |
last_probe_time |
Optional[datetime] |
Timestamp of the most recent probe execution. |
last_anomaly_time |
Optional[datetime] |
Timestamp of the most recent attributed anomaly. |
probe_pass_count |
int |
Cumulative count of passed probes (lifetime). |
probe_fail_count |
int |
Cumulative count of failed probes (lifetime). |
anomaly_count |
int |
Cumulative count of attributed anomalies (lifetime). |
state_changed_at |
Optional[datetime] |
When the GPU last transitioned to its current state. |
state_change_reason |
str |
Reason for the most recent state change. |
sm_health |
list[SmHealth] |
Per-SM health breakdown. |
anomaly_rate |
float |
Current anomaly rate (anomalies per hour, rolling window). |
probe_failure_rate |
float |
Current probe failure rate (failures per hour, rolling window). |
Health status for a single Streaming Multiprocessor.
| Field | Type | Description |
|---|---|---|
sm |
Optional[SmIdentifier] |
The SM this record describes. |
reliability_score |
float |
Bayesian reliability score for this SM. |
probe_pass_count |
int |
Count of passed probes on this SM. |
probe_fail_count |
int |
Count of failed probes on this SM. |
disabled |
bool |
Whether this SM is currently disabled/masked. |
disable_reason |
str |
Reason for disabling (if applicable). |
Aggregated health summary for the entire GPU fleet.
| Field | Type | Description |
|---|---|---|
total_gpus |
int |
Total number of GPUs tracked. |
healthy |
int |
Number of GPUs in HEALTHY state. |
suspect |
int |
Number of GPUs in SUSPECT state. |
quarantined |
int |
Number of GPUs in QUARANTINED state. |
deep_test |
int |
Number of GPUs in DEEP_TEST state. |
condemned |
int |
Number of GPUs in CONDEMNED state. |
overall_sdc_rate |
float |
Fleet-wide estimated SDC rate (events per GPU-hour). |
average_reliability_score |
float |
Fleet-wide average reliability score. |
snapshot_time |
Optional[datetime] |
Timestamp of this summary snapshot. |
active_agents |
int |
Number of active probe agents reporting. |
rate_window_seconds |
int |
Time window over which rates are computed (seconds). |
Historical health data for a GPU.
| Field | Type | Description |
|---|---|---|
state_transitions |
list[StateTransition] |
Historical state transitions. |
correlations |
list[CorrelationEvent] |
Historical correlation events. |
reliability_history |
list[ReliabilitySample] |
Reliability score time series (sampled). |
next_page_token |
str |
Pagination token for the next page (empty if no more results). |
A recorded GPU state transition.
| Field | Type | Description |
|---|---|---|
from_state |
GpuHealthState |
Previous state. |
to_state |
GpuHealthState |
New state. |
timestamp |
Optional[datetime] |
When the transition occurred. |
reason |
str |
Reason for the transition. |
initiated_by |
str |
Who/what initiated the transition. |
A point-in-time reliability score sample.
| Field | Type | Description |
|---|---|---|
timestamp |
Optional[datetime] |
When this sample was recorded. |
reliability_score |
float |
Reliability score at this point in time. |
alpha |
float |
Beta distribution alpha parameter. |
beta |
float |
Beta distribution beta parameter. |
A correlation event linking multiple raw events into a higher-level finding.
| Field | Type | Description |
|---|---|---|
event_id |
str |
Unique identifier (UUID v7). |
events_correlated |
list[str] |
IDs of the raw events that were correlated. |
pattern_type |
PatternType |
The type of pattern detected. |
confidence |
float |
Confidence score [0.0, 1.0]. |
attributed_gpu |
Optional[GpuIdentifier] |
GPU attributed as root cause (if determined). |
attributed_sm |
Optional[SmIdentifier] |
SM attributed as root cause (if localized). |
description |
str |
Human-readable description. |
timestamp |
Optional[datetime] |
When this correlation was computed. |
severity |
Severity |
Severity assessment. |
recommended_action |
str |
Recommended action based on this correlation. |
A directive to change a GPU's lifecycle state.
| Field | Type | Description |
|---|---|---|
directive_id |
str |
Unique identifier (UUID v7). |
gpu |
Optional[GpuIdentifier] |
The GPU to act upon. |
action |
QuarantineAction |
The action to take. |
reason |
str |
Human-readable reason. |
initiated_by |
str |
Who/what initiated this directive. |
evidence |
list[str] |
References to supporting evidence. |
timestamp |
Optional[datetime] |
When this directive was issued. |
priority |
int |
Priority (lower = higher priority). |
requires_approval |
bool |
Whether human approval is required. |
approval |
Optional[ApprovalStatus] |
Approval status (populated after review). |
Approval tracking for directives requiring human sign-off.
| Field | Type | Description |
|---|---|---|
approved |
bool |
Whether the directive has been approved. |
reviewer |
str |
Who approved or rejected it. |
review_time |
Optional[datetime] |
When the review decision was made. |
comment |
str |
Optional comment from the reviewer. |
Response to a directive issuance.
| Field | Type | Description |
|---|---|---|
directive_id |
str |
The directive ID that was processed. |
accepted |
bool |
Whether the directive was accepted. |
rejection_reason |
str |
Reason if not accepted. |
resulting_state |
str |
The resulting GPU state after applying the directive. |
A single entry in the tamper-evident audit ledger.
| Field | Type | Description |
|---|---|---|
entry_id |
int |
Monotonically increasing sequence number. |
entry_type |
AuditEntryType |
Classification of this entry. |
timestamp |
Optional[datetime] |
When this entry was recorded. |
gpu |
Optional[GpuIdentifier] |
GPU associated with this entry. |
data |
bytes |
Serialized protobuf of the underlying event. |
previous_hash |
bytes |
Hash of the preceding entry in the chain. |
entry_hash |
bytes |
SHA-256 hash of this entry. |
merkle_root |
bytes |
Merkle root of the batch (last entry in batch only). |
Filters for querying the audit trail. See Audit Trail.
Response from an audit trail query.
| Field | Type | Description |
|---|---|---|
entries |
list[AuditEntry] |
Matching audit entries. |
next_page_token |
str |
Pagination token for the next page. |
total_count |
int |
Total number of matching entries. |
Response from chain verification.
| Field | Type | Description |
|---|---|---|
valid |
bool |
Whether the chain is valid over the requested range. |
first_invalid_entry_id |
int |
Entry ID where the first break was detected (if invalid). |
failure_description |
str |
Description of the verification failure (if any). |
entries_verified |
int |
Total entries verified. |
batches_verified |
int |
Total batches verified. |
verification_time_ms |
int |
Time taken in milliseconds. |
An edge in the GPU trust graph.
| Field | Type | Description |
|---|---|---|
gpu_a |
Optional[GpuIdentifier] |
First GPU in the pair. |
gpu_b |
Optional[GpuIdentifier] |
Second GPU in the pair. |
agreement_count |
int |
Number of matching outputs. |
disagreement_count |
int |
Number of differing outputs. |
last_comparison |
Optional[datetime] |
Most recent comparison timestamp. |
trust_score |
float |
Trust score: agreement / (agreement + disagreement). Range [0.0, 1.0]. |
Point-in-time snapshot of the entire trust graph.
| Field | Type | Description |
|---|---|---|
edges |
list[TrustEdge] |
All edges in the trust graph. |
timestamp |
Optional[datetime] |
When this snapshot was taken. |
coverage_pct |
float |
Percentage of all GPU pairs compared at least once. |
total_gpus |
int |
Total GPUs in the graph. |
min_trust_score |
float |
Minimum trust score across all edges. |
mean_trust_score |
float |
Mean trust score across all edges. |
A dynamic configuration update.
| Field | Type | Description |
|---|---|---|
update_id |
str |
Unique identifier for this config change. |
initiated_by |
str |
Who initiated this change. |
reason |
str |
Human-readable reason. |
probe_schedule |
Optional[ProbeScheduleUpdate] |
New probe schedule (replaces entire schedule). |
overhead_budget |
Optional[OverheadBudgetUpdate] |
New overhead budget. |
sampling_rate |
Optional[SamplingRateUpdate] |
New sampling rate for a component. |
threshold |
Optional[ThresholdUpdate] |
New threshold for a component parameter. |
Acknowledgement from a config update recipient.
| Field | Type | Description |
|---|---|---|
update_id |
str |
The update_id being acknowledged. |
applied |
bool |
Whether the update was applied successfully. |
component_id |
str |
Hostname of the agent/component that processed the update. |
error |
str |
Error message if not applied. |
config_version |
int |
Effective configuration version after this update. |
A single entry in the probe schedule.
| Field | Type | Description |
|---|---|---|
type |
ProbeType |
Probe type. |
period_seconds |
int |
Execution period in seconds. |
sm_coverage |
float |
Fraction of SMs to cover per period [0.0, 1.0]. |
priority |
int |
Scheduling priority (lower = higher priority). |
enabled |
bool |
Whether this probe type is enabled. |
timeout_ms |
int |
Maximum allowed execution time in milliseconds. |
The SDK provides both synchronous and asynchronous methods for every operation.
Synchronous methods are plain function calls; asynchronous methods have an a
prefix and return coroutines.
| Synchronous | Asynchronous |
|---|---|
SentinelClient.connect() |
await SentinelClient.aconnect() |
client.query_gpu_health() |
await client.aquery_gpu_health() |
client.query_fleet_health() |
await client.aquery_fleet_health() |
client.get_gpu_history() |
await client.aget_gpu_history() |
client.issue_quarantine() |
await client.aissue_quarantine() |
client.query_audit_trail() |
await client.aquery_audit_trail() |
client.verify_chain() |
await client.averify_chain() |
client.get_trust_graph() |
await client.aget_trust_graph() |
client.update_config() |
await client.aupdate_config() |
client.stream_events() |
await client.astream_events() |
client.close() |
await client.aclose() |
Use synchronous when:
- Writing scripts, CLI tools, or notebooks
- Your application does not use asyncio
- You want the simplest possible code
Use asynchronous when:
- Your application already uses asyncio (e.g., FastAPI, aiohttp)
- You need to monitor multiple GPUs concurrently
- You are building a real-time dashboard
- You need to stream events without blocking the event loop
Do not call synchronous methods from within an async event loop -- they will
block the loop. Use the a-prefixed methods instead. If you must call sync
methods from async code, use asyncio.to_thread():
# Acceptable but not recommended:
result = await asyncio.to_thread(client.query_fleet_health)All SDK exceptions inherit from SentinelError:
SentinelError
+-- ConnectionError # Cannot connect to endpoint.
+-- AuthenticationError # Authentication/authorization failure.
+-- NotFoundError # Requested resource not found.
+-- InvalidArgumentError # Invalid argument passed to RPC.
Each exception has a code attribute containing the gRPC status code
(grpc.StatusCode).
| gRPC Status Code | SDK Exception |
|---|---|
NOT_FOUND |
NotFoundError |
UNAUTHENTICATED |
AuthenticationError |
PERMISSION_DENIED |
AuthenticationError |
INVALID_ARGUMENT |
InvalidArgumentError |
UNAVAILABLE |
ConnectionError |
| All others | SentinelError |
By default, these status codes trigger automatic retries with exponential backoff:
UNAVAILABLE-- server temporarily unreachableDEADLINE_EXCEEDED-- RPC timed outRESOURCE_EXHAUSTED-- rate limited
All other errors raise immediately without retry.
from sentinel_sdk import SentinelClient, SentinelError, NotFoundError, ConnectionError
try:
client = SentinelClient.connect("sentinel.example.com:443")
health = client.query_gpu_health("GPU-nonexistent-uuid")
except NotFoundError as e:
print(f"GPU not found: {e}")
except ConnectionError as e:
print(f"Cannot reach SENTINEL: {e}")
except SentinelError as e:
print(f"SENTINEL error (code={e.code}): {e}")The SDK reads the following environment variables as defaults:
| Variable | Description | Default |
|---|---|---|
SENTINEL_ENDPOINT |
host:port of the gRPC gateway |
(none -- must be provided) |
SENTINEL_CA_CERT |
Path to CA certificate (PEM) | (none) |
SENTINEL_CLIENT_CERT |
Path to client certificate (PEM) | (none) |
SENTINEL_CLIENT_KEY |
Path to client private key (PEM) | (none) |
SENTINEL_TIMEOUT |
Default RPC timeout in seconds | 30 |
SENTINEL_MAX_RETRIES |
Maximum retry attempts | 3 |
SENTINEL_LOG_LEVEL |
SDK log level (DEBUG, INFO, WARNING, ERROR) |
WARNING |
import os
os.environ["SENTINEL_ENDPOINT"] = "sentinel.prod.internal:443"
os.environ["SENTINEL_CA_CERT"] = "/etc/sentinel/ca.pem""""Check GPU health before and during a training run, with automated
quarantine if failures are detected."""
import time
from datetime import datetime
from sentinel_sdk import SentinelClient, TlsConfig
from sentinel_sdk.types import GpuHealthState, QuarantineAction
GPUS = [
"GPU-aaaa1111-2222-3333-4444-555566667777",
"GPU-bbbb1111-2222-3333-4444-555566667777",
"GPU-cccc1111-2222-3333-4444-555566667777",
"GPU-dddd1111-2222-3333-4444-555566667777",
]
client = SentinelClient.connect(
"sentinel.prod.internal:443",
tls_config=TlsConfig(ca_cert_path="/etc/sentinel/ca.pem"),
)
# Pre-flight check: ensure all GPUs are healthy.
print("Pre-flight health check...")
for gpu_uuid in GPUS:
health = client.query_gpu_health(gpu_uuid)
if health.state != GpuHealthState.HEALTHY:
print(f" ABORT: {gpu_uuid} is {health.state.name}")
exit(1)
print(f" {gpu_uuid}: OK (reliability={health.reliability_score:.6f})")
print("All GPUs healthy. Starting training...")
# launch_training(GPUS) # Your training code here.
# Periodic monitoring during training.
for step in range(1000):
# ... your training step ...
if step % 100 == 0:
for gpu_uuid in GPUS:
health = client.query_gpu_health(gpu_uuid)
if health.state == GpuHealthState.SUSPECT:
print(f"WARNING: {gpu_uuid} is SUSPECT at step {step}")
elif health.state in (GpuHealthState.QUARANTINED, GpuHealthState.CONDEMNED):
print(f"CRITICAL: {gpu_uuid} is {health.state.name} at step {step}")
print("Saving checkpoint and stopping training...")
# save_checkpoint(step)
break
client.close()"""Pre-flight check for an inference service startup."""
from sentinel_sdk import SentinelClient
from sentinel_sdk.types import GpuHealthState
def preflight_check(gpu_uuid: str, min_reliability: float = 0.999) -> bool:
"""Returns True if the GPU is safe to use for inference."""
client = SentinelClient.connect("sentinel.internal:443")
try:
health = client.query_gpu_health(gpu_uuid)
if health.state != GpuHealthState.HEALTHY:
print(f"GPU {gpu_uuid} is {health.state.name}, not HEALTHY")
return False
if health.reliability_score < min_reliability:
print(f"GPU {gpu_uuid} reliability {health.reliability_score:.6f} "
f"below threshold {min_reliability}")
return False
if health.probe_failure_rate > 0:
print(f"GPU {gpu_uuid} has active probe failures "
f"({health.probe_failure_rate:.2f}/hour)")
return False
return True
finally:
client.close()
# Usage:
if preflight_check("GPU-aaaa1111-2222-3333-4444-555566667777"):
print("GPU is safe for inference")
# start_inference_server()
else:
print("GPU failed pre-flight check, selecting alternate GPU")"""Async dashboard backend that periodically polls fleet health."""
import asyncio
import json
from sentinel_sdk import SentinelClient, TlsConfig
async def dashboard_loop():
client = await SentinelClient.aconnect(
"sentinel.internal:443",
tls_config=TlsConfig(ca_cert_path="ca.pem"),
)
try:
while True:
fleet = await client.aquery_fleet_health()
graph = await client.aget_trust_graph()
dashboard_data = {
"timestamp": fleet.snapshot_time.isoformat() if fleet.snapshot_time else None,
"total_gpus": fleet.total_gpus,
"healthy": fleet.healthy,
"suspect": fleet.suspect,
"quarantined": fleet.quarantined,
"condemned": fleet.condemned,
"sdc_rate": fleet.overall_sdc_rate,
"avg_reliability": fleet.average_reliability_score,
"active_agents": fleet.active_agents,
"trust_coverage": graph.coverage_pct,
"min_trust": graph.min_trust_score,
"mean_trust": graph.mean_trust_score,
}
print(json.dumps(dashboard_data, indent=2))
# In production: push to WebSocket clients, write to DB, etc.
await asyncio.sleep(10)
finally:
await client.aclose()
asyncio.run(dashboard_loop())"""Automated quarantine workflow: listen for suspect GPUs and quarantine them."""
import asyncio
from sentinel_sdk import SentinelClient, TlsConfig
from sentinel_sdk.types import (
QuarantineDirective, QuarantineAction, GpuHealthState,
)
async def auto_quarantine():
client = await SentinelClient.aconnect(
"sentinel.internal:443",
tls_config=TlsConfig(
ca_cert_path="/etc/sentinel/ca.pem",
client_cert_path="/etc/sentinel/client.pem",
client_key_path="/etc/sentinel/client-key.pem",
),
)
RELIABILITY_THRESHOLD = 0.990
CHECK_INTERVAL = 30 # seconds
try:
while True:
fleet = await client.aquery_fleet_health(
state_filter=[GpuHealthState.SUSPECT],
)
if fleet.suspect > 0:
# Query individual suspect GPUs.
# (In production, use the fleet response's per-GPU data.)
print(f"Found {fleet.suspect} suspect GPUs, investigating...")
await asyncio.sleep(CHECK_INTERVAL)
finally:
await client.aclose()
asyncio.run(auto_quarantine())"""Export audit trail to JSONL for compliance archival."""
import json
from datetime import datetime, timedelta
from sentinel_sdk import SentinelClient
from sentinel_sdk.types import AuditQueryFilters
def export_audit_trail(output_path: str, days: int = 30):
client = SentinelClient.connect("sentinel.internal:443")
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
filters = AuditQueryFilters(
start_time=start_time,
end_time=end_time,
limit=500,
descending=False,
)
total_exported = 0
with open(output_path, "w") as f:
while True:
result = client.query_audit_trail(filters)
for entry in result.entries:
record = {
"entry_id": entry.entry_id,
"type": entry.entry_type.name,
"timestamp": entry.timestamp.isoformat() if entry.timestamp else None,
"gpu_uuid": entry.gpu.uuid if entry.gpu else None,
"entry_hash": entry.entry_hash.hex(),
"previous_hash": entry.previous_hash.hex(),
}
f.write(json.dumps(record) + "\n")
total_exported += 1
if not result.next_page_token:
break
filters.page_token = result.next_page_token
# Verify the chain covers the exported range.
verification = client.verify_chain()
print(f"Exported {total_exported} entries to {output_path}")
print(f"Chain integrity: {'VALID' if verification.valid else 'INVALID'}")
print(f" Entries verified: {verification.entries_verified}")
print(f" Batches verified: {verification.batches_verified}")
client.close()
export_audit_trail("audit_export.jsonl", days=90)sentinel_sdk.ConnectionError: failed to connect to all addresses
Cause: The gRPC endpoint is unreachable.
Solutions:
- Verify the endpoint address and port.
- Check firewall rules allow traffic on the gRPC port (default: 50051 for insecure, 443 for TLS).
- Ensure the SENTINEL correlation engine is running.
sentinel_sdk.ConnectionError: Ssl handshake failed
Cause: Certificate mismatch or expired certificates.
Solutions:
- Verify the CA certificate matches the server's certificate chain.
- Ensure certificates are not expired:
openssl x509 -in ca.pem -noout -dates. - For mTLS, verify the client certificate is signed by the server's trusted CA.
sentinel_sdk.SentinelError: Deadline Exceeded
Cause: The RPC did not complete within the timeout.
Solutions:
- Increase
default_timeoutwhen connecting. - For
verify_chainon large ledgers, use a longer timeout. - Check network latency to the SENTINEL endpoint.
sentinel_sdk.NotFoundError: GPU GPU-xxxx not found
Cause: The GPU UUID is not known to the SENTINEL system.
Solutions:
- Verify the UUID format:
GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. - Ensure a probe agent is running on the host with this GPU.
- The GPU may not yet have been seen if the agent just started.
ModuleNotFoundError: No module named 'sentinel.v1'
Cause: The generated protobuf Python code is not installed.
Solutions:
- Ensure you installed
sentinel-sdkwith protobuf stubs:pip install sentinel-sdk[stubs]. - Or generate the stubs from proto files:
python -m grpc_tools.protoc \ --proto_path=proto \ --python_out=sdk/python/src \ --grpc_python_out=sdk/python/src \ proto/sentinel/v1/*.proto
Enable SDK debug logging to diagnose issues:
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("sentinel_sdk").setLevel(logging.DEBUG)| SDK Version | SENTINEL Server | Python | gRPC | Protobuf |
|---|---|---|---|---|
| 0.1.x (alpha) | 0.1.x | >= 3.10 | >= 1.60.0 | >= 4.25.0 |
| 0.2.x (planned) | 0.2.x | >= 3.10 | >= 1.62.0 | >= 4.25.0 |
The SDK follows the same versioning as the SENTINEL server. Minor version mismatches are tolerated (e.g., SDK 0.1.3 with server 0.1.5), but major or minor version mismatches may result in incompatible protobuf schemas.
The gRPC API uses sentinel.v1 package versioning. Breaking changes will only
occur with a major package version bump (e.g., sentinel.v2).