Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/claude-security-review.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Claude Security Review

on:
pull_request:
types: [opened, synchronize, reopened]

permissions:
contents: read
pull-requests: write

jobs:
security-review:
# Only run after maintainer approval for external PRs
if: github.event.pull_request.draft == false
runs-on: ubuntu-latest
timeout-minutes: 30

steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Run Claude Security Review
uses: anthropics/claude-code-security-review@main
with:
claude-api-key: ${{ secrets.CLAUDE_API_KEY }}
claude-model: claude-sonnet-4-6
claudecode-timeout: 25
run-every-commit: true
exclude-directories: "docs,gravity_e2e,cluster,examples,.github"
61 changes: 61 additions & 0 deletions docs/plans/2026-02-23-gsdk-fixes-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# GSDK Security Fixes Design

Date: 2026-02-23

## GSDK-001: Unauthenticated `/set_failpoint` Endpoint

**Problem:** `/set_failpoint` is registered on the plaintext HTTP router (`http_routes`) with no authentication middleware. The endpoint calls `fail::cfg()` directly, allowing remote injection of failpoints that control consensus behavior (e.g., `consensus::send::proposal` stops block production). Any attacker with network access to the validator's HTTP port (default 8080) can halt the validator.

**Fix:** Gate the `/set_failpoint` route behind `#[cfg(debug_assertions)]` so it is compiled out in release builds. This ensures the endpoint is only available during development/testing. Additionally, add a runtime warning on startup if the `failpoints` feature is enabled in a non-debug build configuration.

**Alternative considered:** Moving to `https_routes` with bearer token auth. Rejected because failpoints should never be available in production — even with authentication, the risk of credential compromise doesn't justify having the endpoint.

**Files:** `crates/api/src/https/mod.rs`

## GSDK-002: Unauthenticated `/mem_prof` Endpoint

**Problem:** `/mem_prof` is on the same unauthenticated HTTP router. It triggers jemalloc heap profiling via `mallctl("prof.active")` and `mallctl("prof.dump")`. Heap dumps capture all in-memory data including potentially private keys, DKG transcript material, and session tokens. Dump files may be written with default permissions (world-readable).

**Fix:** Same approach as GSDK-001 — gate behind `#[cfg(debug_assertions)]`. Additionally, if profiling is enabled in debug builds, ensure heap dump files are written with `0600` permissions using explicit `fs::set_permissions()` after dump.

**Files:** `crates/api/src/https/mod.rs`, `crates/api/src/https/heap_profiler.rs`

## GSDK-003: Consensus/DKG Endpoints on Plaintext HTTP

**Problem:** Seven routes serve sensitive consensus state over plaintext HTTP without TLS:
- `/dkg/randomness/:block_number` — per-block DKG randomness seeds
- `/consensus/latest_ledger_info` — current epoch state
- `/consensus/ledger_info/:epoch` — epoch transitions with validator sets
- `/consensus/block/:epoch/:round` — full consensus block data
- `/consensus/qc/:epoch/:round` — quorum certificates with aggregate BLS signatures
- `/consensus/validator_count/:epoch` — epoch validator count (public metadata)
- `/dkg/status` — current DKG session status

A passive MITM on the same network segment can collect DKG randomness, QC signatures, and full consensus block data.

**Fix:** Move sensitive routes to `https_routes` which enforces TLS via `ensure_https` middleware. Keep `/consensus/validator_count/:epoch` on HTTP as it's purely public metadata. Update `gravity_cli` DKG subcommands to use HTTPS URLs for these endpoints.

**Files:** `crates/api/src/https/mod.rs`, `bin/gravity_cli/src/dkg.rs`

## GSDK-004: SSRF in Sentinel Probe URL

**Problem:** `ProbeConfig.url` field in `sentinel.toml` is passed verbatim to `reqwest::get()` with no validation of scheme, host, or IP range. If an attacker gains write access to the config file (via compromised CI/CD, overly permissive file ACLs, or shared config management), they can set the probe URL to `http://169.254.169.254/latest/meta-data/iam/security-credentials/` and exfiltrate IAM role names through sentinel's alerting webhooks.

**Fix:** Add URL validation in `Config::load()`:
1. Parse URL with `url::Url::parse()`
2. Scheme allowlist: only `http` and `https`
3. Host IP range check: reject loopback (`127.0.0.0/8`), link-local (`169.254.0.0/16`), and RFC 1918 private ranges (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`)
4. Return hard error and refuse to start on validation failure

**Files:** `bin/sentinel/src/config.rs`, `bin/sentinel/src/probe.rs`


## Review Comments

### GSDK-001 & GSDK-002 & GSDK-003

**Review Comments** reviewer: Lightman; state: Accepted; comments: Moved the gate from individual route definitions in `mod.rs` to the server startup in `consensus_api.rs`. A single `#[cfg(debug_assertions)]` block prevents the entire HTTP/HTTPS API server from starting in release builds. This is cleaner than scattering annotations across 10+ lambdas and routes. The tx submission endpoints (`/tx/submit_tx`, `/tx/get_tx_by_hash`) are also not needed in production since transactions are submitted via the mempool p2p network.

### GSDK-004

**Review Comments** reviewer: Lightman; state: rejected; comments: (1) Loopback and private IP ranges must remain reachable — sentinel commonly probes localhost services (e.g., local node health endpoints), so blocking `127.0.0.0/8` and RFC 1918 ranges would break legitimate use cases. (2) The threat model does not apply — the operator who deploys and configures sentinel is responsible for the probe URLs in their own `sentinel.toml`. If an attacker already has write access to the config file, they have host-level access and URL validation provides no meaningful defense. There is no untrusted user input involved.
170 changes: 170 additions & 0 deletions docs/plans/2026-02-28-gsdk-fixes-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# GSDK Security Fixes Design (Round 2)

Date: 2026-02-28

## GSDK-006: Panic-Inducing `unwrap()` in Consensus-Critical Execution Loop

**Problem:** `start_execution()`, `start_commit_vote()`, and `start_commit()` in `reth_cli.rs` contain 15+ bare `.unwrap()` calls on fallible operations (transaction decoding, signer recovery, DB reads, vector indexing). A single malformed transaction or transient DB error crashes the entire validator node. An attacker who can inject a malformed transaction into the ordering layer could DoS all validators simultaneously.

**Fix:** Replace `.unwrap()` calls with proper error handling:
1. `TransactionSigned::decode_2718().unwrap()` / `recover_signer().unwrap()` → skip the transaction with `warn!()` if malformed.
2. `senders[idx].unwrap()` / `transactions[idx].unwrap()` → filter out `None` entries and log skipped transactions.
3. `provider.recover_block_number().unwrap()` → propagate the error up via `?` (the functions already return `Result`).
4. `exec_blocks.unwrap()` after partial error check → merge the error branch to handle all errors before unwrapping.
5. `block_ids.last().unwrap()` → guard with `is_empty()` check (already present but unwrap follows).
6. `.set_state().await.unwrap()` / `.wait_for_block_persistence().await.unwrap()` → propagate via `?`.

**Files:** `bin/gravity_node/src/reth_cli.rs`

**Review Comments** reviewer: Lightman; state: Accepted;


## GSDK-007: Panic in Relayer `get_last_state` on Missing Cached Result

**Problem:** `RelayerWrapper::get_last_state()` panics when `should_block_poll` returns `true` but no cached result exists. The panic message says "No cached result for uri" but the comment above says "fall through to poll". This is a logic bug: the panic should be a fallthrough to `poll_and_update_state()`.

**Fix:** Replace `panic!()` at L280 with:
```rust
warn!("No cached result for uri: {uri}, falling through to poll");
// Fall through to poll below
```
Remove the `return` path and let execution continue to `poll_and_update_state()`.

**Review Comments** reviewer: AlexYue; state: pending; comment: to be resolved

**Files:** `bin/gravity_node/src/relayer.rs`

## GSDK-008: Unvalidated Signer Recovery in Mempool External Transaction Ingestion

**Problem:** `add_external_txn()` calls `txn.recover_signer().unwrap()` on externally-submitted transactions. A crafted transaction with an invalid signature that passes `decode_2718` but fails `recover_signer` crashes the validator.

**Fix:** Replace `.unwrap()` with error handling:
```rust
let signer = match txn.recover_signer() {
Some(s) => s,
None => {
tracing::error!("Failed to recover signer for external transaction");
return false;
}
};
```

**Files:** `bin/gravity_node/src/mempool.rs`

**Review Comments** reviewer: Lightman; state: Accepted;


## GSDK-009: Verbose Internal Error Messages in HTTP API Responses

**Problem:** Error responses in `consensus.rs` and `dkg.rs` include `format!("{e:?}")` which leaks internal type names, schema names, and potentially file paths to API consumers.

**Fix:** Create a consistent error response pattern:
1. Log the detailed error server-side with `error!()`.
2. Return generic messages to clients: "Internal server error", "Resource not found", "Service unavailable".
3. Factor out into a helper function that takes a log-level error and returns a sanitized response.

**Files:** `crates/api/src/https/consensus.rs`, `crates/api/src/https/dkg.rs`

**Review Comments** reviewer: Lightman; state: Accepted;

## GSDK-010: `GLOBAL_CONFIG_STORAGE` `.unwrap()` Crashes Relayer During Startup Race

**Problem:** `GLOBAL_CONFIG_STORAGE.get().unwrap()` in `get_oracle_source_states()` panics if called before the storage is initialized. This can happen during node startup if the relayer is activated before the config storage is ready.

**Fix:** Replace `.get().unwrap()` with:
```rust
let config_storage = match GLOBAL_CONFIG_STORAGE.get() {
Some(cs) => cs,
None => {
warn!("GLOBAL_CONFIG_STORAGE not yet initialized, returning empty oracle states");
return vec![];
}
};
```

**Files:** `bin/gravity_node/src/relayer.rs`

**Review Comments** reviewer: AlexYue; state: Reject; comment: global variable cannot return empty

## GSDK-011: No Rate Limiting or Request Size Limits on HTTP/HTTPS Endpoints

**Problem:** The HTTP/HTTPS server has no rate limiting, body size limits, or connection limits. An attacker can flood endpoints or send oversized payloads to exhaust resources.

**Fix:** Add middleware layers to the router:
1. `axum::extract::DefaultBodyLimit::max(1_048_576)` (1 MB max request body)
2. `tower::limit::RateLimitLayer::new(100, Duration::from_secs(1))` (100 req/s)
3. `tower::limit::ConcurrencyLimitLayer::new(256)` (max concurrent requests)

**Files:** `crates/api/src/https/mod.rs`

**Review Comments** reviewer: Lightman; state: Accepted;

## GSDK-012: Sentinel Webhook URLs Not Validated for SSRF

**Problem:** `feishu_webhook` and `slack_webhook` in `AlertingConfig` are used directly in `reqwest::Client::post()` without SSRF validation. Same threat model as GSDK-004 but the fix was not applied to webhook URLs.

**Fix:** In `Config::load()`, after deserializing, call `validate_probe_url()` on each webhook URL:
```rust
if let Some(feishu) = &config.alerting.feishu_webhook {
if !feishu.is_empty() {
validate_probe_url(feishu)?;
}
}
if let Some(slack) = &config.alerting.slack_webhook {
if !slack.is_empty() {
validate_probe_url(slack)?;
}
}
```

**Files:** `bin/sentinel/src/config.rs`

**Review Comments** reviewer: Lightman; state: Reject; comment: sentinel is locally deployed, no changes for now

## GSDK-013: `ensure_https` Middleware Ineffective for Plain TCP Connections

**Problem:** `ensure_https` checks `req.uri().scheme_str()` but Axum/hyper does not populate the URI scheme for incoming TCP connections. When TLS is not configured (cert/key are `None`), the fallback path serves all routes (including "HTTPS-only" ones) over plain HTTP. The middleware may not reject these requests because `scheme_str()` returns `None` (not `"http"`).

**Fix:** Separate HTTP and HTTPS routes onto different listeners:
1. When TLS is configured: bind `https_routes` to the TLS listener and `http_routes` to a plaintext listener (or only serve HTTPS).
2. When TLS is NOT configured: do not register the sensitive `https_routes` at all. Log a startup warning that consensus/DKG endpoints are disabled without TLS.

**Alternative:** If a single port is required, check `ConnectInfo` or a custom `Extension` set by the TLS acceptor to detect whether the connection is encrypted.

**Files:** `crates/api/src/https/mod.rs`

**Review Comments** reviewer: Lightman; state: Accepted;

## GSDK-014: Address Parse `unwrap()` Can Crash Server on Invalid Bind Address

**Problem:** `self.address.parse().unwrap()` at L127 panics with no diagnostic message if the configured address is malformed.

**Fix:** Replace with `.parse().unwrap_or_else(|e| panic!("Invalid bind address '{}': {e}", self.address))` or return a `Result`.

**Files:** `crates/api/src/https/mod.rs`

**Review Comments** reviewer: Lightman; state: Accepted;

## GSDK-015: ReDoS Risk via User-Provided Regex in Sentinel Whitelist

**Problem:** Whitelist CSV accepts arbitrary regex patterns. A crafted evil regex can freeze the sentinel via catastrophic backtracking.

**Fix:** Use `RegexBuilder::new(pattern_str).size_limit(1 << 20).dfa_size_limit(1 << 20).build()` to limit compiled regex complexity. Log a warning and skip the rule if compilation exceeds the size limit.

**Files:** `bin/sentinel/src/whitelist.rs`

**Review Comments** reviewer: Lightman; state: Reject; comment: sentinel is locally deployed, no changes for now


## GSDK-016: Glob Pattern Injection in Sentinel File Monitoring

**Problem:** `file_patterns` config field is passed directly to `glob()`. A pattern like `/**/*` recursively scans the entire filesystem.

**Fix:** Validate patterns in `Config::load()`:
1. Reject patterns containing `..`
2. Reject patterns starting with `/` (require relative paths)
3. Add a maximum depth limit for glob results

**Files:** `bin/sentinel/src/config.rs`, `bin/sentinel/src/watcher.rs`

**Review Comments** reviewer: Lightman; state: Reject; comment: sentinel is locally deployed, no changes for now
72 changes: 72 additions & 0 deletions docs/security/2026-02-23-security-audit-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Security Audit Report — gravity-sdk

**Date:** 2026-02-23
**Scope:** Gravity-specific code in `crates/api/`, `bin/gravity_node/`, `bin/gravity_cli/`, `bin/sentinel/`, `crates/block-buffer-manager/`, `aptos-core/consensus/` modifications
**Repository:** https://github.com/Galxe/gravity-sdk
**Release:** gravity-testnet-v1.0.0

---

## Summary

| Severity | Findings | Status | Commit |
|----------|----------|--------|--------|
| HIGH | 2 | All fixed | [`a0bf499`](https://github.com/Richard1048576/gravity-sdk/commit/a0bf499) |
| MEDIUM | 3 | All fixed | [`a0bf499`](https://github.com/Richard1048576/gravity-sdk/commit/a0bf499), [`21e6898`](https://github.com/Richard1048576/gravity-sdk/commit/21e6898) |
| **Total** | **5** | **All fixed** | |

---

## HIGH Severity (2)

### GSDK-001: Unauthenticated `/set_failpoint` on Plaintext HTTP

**File:** `crates/api/src/https/mod.rs:113`
**Issue:** `/set_failpoint` endpoint on unauthenticated plaintext HTTP router. Calls `fail::cfg()` directly, allowing arbitrary failpoint injection. 14+ failpoints throughout consensus layer (`consensus::send::proposal`, `consensus::process_proposal`, etc.). An attacker with network access to port 8080 can halt block production.
**Fix:** Gated behind `#[cfg(debug_assertions)]` so the route is compiled out in release builds. Added runtime warning on startup if failpoints feature is enabled in a release build.
**Commit:** `a0bf499`

### GSDK-002: Unauthenticated `/mem_prof` on Plaintext HTTP

**File:** `crates/api/src/https/mod.rs:114`
**Issue:** `/mem_prof` endpoint on unauthenticated plaintext HTTP. Triggers jemalloc heap dump that may capture in-memory private keys, DKG transcripts, and session material. Dump file may be world-readable.
**Fix:** Gated behind `#[cfg(debug_assertions)]` same as GSDK-001. Heap dump files now written with `0600` permissions.
**Commit:** `a0bf499`

---

## MEDIUM Severity (2)

### GSDK-003: Consensus/DKG Endpoints on Plaintext HTTP

**File:** `crates/api/src/https/mod.rs:105-112`
**Issue:** DKG randomness, quorum certificates, ledger info, and consensus block data served on plaintext HTTP without TLS. Passive MITM can collect DKG randomness values and aggregate BLS signatures from QCs.
**Fix:** Moved sensitive routes (`/dkg/randomness`, `/consensus/latest_ledger_info`, `/consensus/ledger_info`, `/consensus/block`, `/consensus/qc`) to `https_routes` with TLS enforcement. Only `/consensus/validator_count` (public metadata) remains on HTTP.
**Commit:** `a0bf499`

### GSDK-004: SSRF in Sentinel Probe URL

**File:** `bin/sentinel/src/probe.rs:34`, `bin/sentinel/src/config.rs:14`
**Issue:** Sentinel probe URL accepted without validation. Attacker who modifies `sentinel.toml` can cause sentinel to probe cloud metadata endpoints (`169.254.169.254`) and potentially exfiltrate IAM credentials via webhook alerts.
**Fix:** Added URL validation at `Config::load()` time: scheme allowlist (http/https only), reject reserved IP ranges (loopback, link-local, RFC 1918 private). Sentinel refuses to start if validation fails.
**Commit:** `a0bf499`

### GSDK-005: GSDK-004 Fix DNS Bypass

**File:** `bin/sentinel/src/config.rs:72`
**Issue:** The GSDK-004 fix only checked IP literals against the blocklist. When the host was a DNS hostname (e.g., `metadata.google.internal`, `169.254.169.254.nip.io`), `host.parse::<IpAddr>()` failed and all IP checks were bypassed. Additionally, a userinfo prefix (`http://x@169.254.169.254/`) was not stripped, allowing further bypass.
**Fix:** Resolve DNS hostnames via `ToSocketAddrs` and check all resolved IPs against the blocklist. Reject unresolvable hostnames (fail-closed). Strip userinfo component before extracting the host.
**Commit:** `21e6898`

---

## Commits

| Commit | Description | Files Changed |
|--------|-------------|---------------|
| [`a0bf499`](https://github.com/Richard1048576/gravity-sdk/commit/a0bf499) | GSDK-001/002/003/004: admin routes, plaintext HTTP, sentinel SSRF | 4 files |
| [`21e6898`](https://github.com/Richard1048576/gravity-sdk/commit/21e6898) | GSDK-005: fix DNS bypass in probe URL validation | 1 file |

## Design Documents

- [Security Fixes Design](../plans/2026-02-23-gsdk-fixes-design.md)
Loading
Loading