Skip to content

Attention-first infrastructure monitoring. Shows only what matters right now — stays silent otherwise.

License

Notifications You must be signed in to change notification settings

ppiankov/infranow

Repository files navigation

infranow

CI Go License: MIT ANCC

Real-time infrastructure triage -- deterministic problem detection for Kubernetes and Prometheus.

What it is

infranow is a CLI/TUI tool that consumes Prometheus metrics and deterministically identifies the most important infrastructure problems right now. It runs 15 built-in detectors on a loop, ranks problems by severity and persistence, and presents them in an interactive terminal UI or as structured JSON.

When systems are healthy, the screen is empty. When something breaks, it appears immediately, ranked by importance. No dashboards. No graphs. No exploration.

What it is NOT

  • Not a dashboard or visualization tool
  • Not a metric collector or storage engine
  • Not an alerting system (no PagerDuty, Slack, webhooks)
  • Not an anomaly detection or ML/AI system
  • Not a replacement for Grafana, Datadog, or Prometheus itself
  • Not a historical analysis tool

infranow shows what is failing right now. Nothing else.

Project Status

Status: Alpha · v0.1.2 · Pre-1.0

Milestone Status
Core functionality Complete
Test coverage >85% Partial
Security audit Complete
golangci-lint config Complete
CI pipeline (test/lint/scan) Complete
Homebrew distribution Complete
Safety model documented Complete
API stability guarantees Not yet
v1.0 release Planned

Pre-1.0: CLI flags and JSON output schemas may change between minor versions.

Safety Model

infranow is designed to be safe to run against any Prometheus instance, including production.

Zero Footprint

Property Guarantee
Cluster writes None. infranow never writes to Kubernetes.
CRDs / operators None. No custom resources, no controllers, no agents.
Prometheus writes None. Read-only PromQL queries via HTTP API.
Persistent state None. All state is in-memory; exits clean.
Network listeners None. No ports opened, no servers started.
Disk writes Only when explicitly requested (--export-file, --save-baseline).

Read-Only by Design

infranow issues GET /api/v1/query requests to Prometheus. It cannot modify metrics, alerting rules, recording rules, or any cluster state. There is no write path in the codebase.

Bounded Resource Usage

  • Problem map is capped at 10,000 entries to prevent unbounded memory growth
  • Each detector runs with a configurable timeout (default 30s)
  • Concurrent detector execution is optionally bounded (--max-concurrency)
  • Stale problems are pruned after 1 minute without re-detection

Credential Safety

  • Prometheus URLs with embedded credentials are redacted in all UI and log output
  • Export files are written with restrictive permissions (0600)
  • No credentials are stored or cached

Philosophy

Principiis obsta -- resist the beginnings.

infranow is designed to surface active failures before damage spreads, then go silent when the problem resolves. The core principles:

  • Silence is the success condition. Empty output means healthy systems.
  • Deterministic detection only. Every detector is a PromQL query with an explicit threshold. No statistical models, no learning, no probability. The same metrics always produce the same result.
  • Evidence over recommendations. Show the data, provide a hint, let the operator decide.
  • Attention is finite. Problems are ranked by a score that combines severity, blast radius, and persistence. The most important problem is always first.
  • Bounded scope. One Prometheus source per instance. Run multiple instances for multiple sources. No aggregation, no federation.

Quick start

# Install from source
go install github.com/ppiankov/infranow/cmd/infranow@latest

# Or build locally
git clone https://github.com/ppiankov/infranow.git
cd infranow
make build

# Run against a Prometheus instance
./bin/infranow monitor --prometheus-url http://localhost:9090

Usage

Monitor mode (TUI)

infranow monitor --prometheus-url http://localhost:9090

The interactive TUI displays problems ranked by importance. Keyboard controls:

Key Action
q, Ctrl+C Quit
p, Space Pause/resume detection
s Cycle sort: severity, recency, count
j/k, Up/Down Scroll
g/G Jump to top/bottom
/ Search/filter
Esc Clear filter

JSON mode

infranow monitor --prometheus-url http://localhost:9090 --output json

Waits for the first detection cycle, then outputs all problems as JSON to stdout and exits. Suitable for CI/CD pipelines and scripting.

Baseline compare

# Save a baseline snapshot
infranow monitor --prometheus-url http://prom:9090 --output json --save-baseline baseline.json

# Compare against baseline, fail if new problems appear
infranow monitor --prometheus-url http://prom:9090 --output json \
  --compare-baseline baseline.json --fail-on-drift

Kubernetes port-forward

# Automatic port-forward management
infranow monitor --k8s-service prometheus-operated --k8s-namespace monitoring

# Custom ports
infranow monitor --k8s-service prometheus-operated \
  --k8s-namespace monitoring \
  --k8s-local-port 9091 --k8s-remote-port 9090

CI/CD gate

# Exit 1 if any CRITICAL or FATAL problems exist
infranow monitor --prometheus-url http://prom:9090 --output json --fail-on CRITICAL

All flags

infranow monitor [flags]

Connection:
  --prometheus-url string       Prometheus endpoint URL (required unless using --k8s-service)
  --prometheus-timeout duration Prometheus query timeout (default 30s)
  --k8s-service string          Kubernetes service name for port-forward
  --k8s-namespace string        Kubernetes namespace for service (default "monitoring")
  --k8s-local-port string       Local port for port-forward (default "9090")
  --k8s-remote-port string      Remote port for port-forward (default "9090")

Detection:
  --namespace string            Filter by namespace pattern (regex)
  --entity-type string          Filter by entity type
  --min-severity string         Minimum severity: FATAL, CRITICAL, WARNING (default "WARNING")
  --refresh-interval duration   Detection refresh rate (default 10s)
  --max-concurrency int         Max concurrent detector executions (0 = unlimited)
  --detector-timeout duration   Detector execution timeout (default 30s)

Output:
  --output string               Output format: table, json (default "table")
  --export-file string          Export problems to file

Baseline:
  --save-baseline string        Save problems snapshot to file
  --compare-baseline string     Compare current problems to baseline file
  --fail-on-drift               Exit 1 if new problems detected vs baseline

CI/CD:
  --fail-on string              Exit 1 if problems at/above severity (WARNING, CRITICAL, FATAL)
  --include-namespaces string   Comma-separated namespace patterns to include
  --exclude-namespaces string   Comma-separated namespace patterns to exclude

Global:
  --config string               Config file (default $HOME/.infranow.yaml)
  -v, --verbose                 Enable verbose logging

Detectors

15 built-in detectors ship with infranow. Each runs independently at its own interval.

Detector Metric Severity Threshold Interval
OOMKill kube_pod_container_status_restarts_total{reason="OOMKilled"} CRITICAL > 0 restarts in 5m window 30s
CrashLoopBackOff kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} FATAL Pod in state 30s
ImagePullBackOff kube_pod_container_status_waiting_reason{reason=~"ImagePullBackOff|ErrImagePull"} CRITICAL Pod in state 30s
PodPending kube_pod_status_phase{phase="Pending"} CRITICAL Pending > 5 minutes 30s
HighErrorRate rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) CRITICAL > 5% error rate 30s
DiskSpace 1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) WARNING / CRITICAL >= 90% / >= 95% 60s
HighMemoryPressure 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) CRITICAL > 90% usage 30s
LinkerdControlPlane kube_deployment_status_replicas_available{namespace="linkerd"} FATAL == 0 replicas 30s
LinkerdProxyInjection kube_pod_container_status_waiting_reason{namespace="linkerd"} CRITICAL CrashLoopBackOff 30s
IstioControlPlane kube_deployment_status_replicas_available{namespace="istio-system"} FATAL == 0 replicas 30s
IstioSidecarInjection kube_pod_container_status_waiting_reason{namespace="istio-system"} CRITICAL CrashLoopBackOff 30s
LinkerdCertExpiry identity_cert_expiry_timestamp - time() WARNING / CRITICAL / FATAL < 7d / < 48h / < 24h 60s
IstioCertExpiry citadel_server_root_cert_expiry_timestamp - time() WARNING / CRITICAL / FATAL < 7d / < 48h / < 24h 60s

See docs/DETECTORS.md for detailed documentation.

Architecture

cmd/infranow/          Entry point. Minimal main.go, delegates to internal/cli.
internal/
  cli/                 Cobra commands. Root command + monitor subcommand.
  metrics/             MetricsProvider interface + PrometheusClient implementation.
  detector/            Detector interface + Registry + 7 concrete detectors.
  models/              Problem struct, Severity type, scoring logic.
  monitor/             Watcher (detection orchestrator) + Bubble Tea TUI.
  filter/              Post-detection namespace filtering (include/exclude globs).
  baseline/            Snapshot save/load and diff comparison.
  util/                Exit codes + Kubernetes port-forward via client-go.

Data flow:

Prometheus --> MetricsProvider --> Detectors --> Watcher --> TUI or JSON
                                                  |
                                          Filter + Baseline
                                                  |
                                           Exit code (CI/CD)

The Watcher runs each detector in its own goroutine at the detector's configured interval. Results are merged into a shared problem map (deduplicated by ID, count incremented on re-detection, pruned after 1 minute of staleness). The TUI subscribes to change notifications via a channel. JSON mode waits for the first detection cycle, then dumps and exits.

Problem score formula: severity_weight * (1 + blast_radius * 0.1) * (1 + persistence / 3600). Severity weights: WARNING=10, CRITICAL=50, FATAL=100.

Known limitations

  • No integration tests. Unit test coverage is >80% but there are no integration tests against a live Prometheus instance.
  • No config file support. The --config flag is accepted but not wired to anything yet.
  • Single Prometheus source. No federation, no multi-source aggregation. By design, but worth noting.
  • No custom detectors. Detector set is compiled in. No plugin system or config-driven detection yet.
  • Stale problem pruning is time-based. Problems disappear after 1 minute without re-detection, regardless of whether the underlying issue resolved.

Roadmap

v0.1.1

  • Increase test coverage to >80% across all packages (done)
  • Service mesh detectors for linkerd and istio (done: 6 detectors)
  • Certificate expiry detection with tiered severity (done)

v0.1.2 (current)

  • Trustwatch certificate and probe detectors (done: 2 detectors)
  • Security audit: SHA-pinned CI/CD actions, Trivy scanning, supply chain integrity (done)
  • Security hardening: context timeouts, SSRF prevention, file permissions, signal handling (done)
  • golangci-lint config with gocritic, gocyclo, revive (done)
  • Integration tests with docker-compose + Prometheus
  • Config file support (YAML)
  • Custom detector thresholds via config

v0.2.0

  • SARIF output for GitHub Code Scanning integration
  • Prometheus self-metrics endpoint
  • Detector plugin system

Future

  • Additional detectors (Kafka, databases, custom services)
  • Multi-Prometheus aggregation mode
  • Web UI for remote access

License

MIT License. See LICENSE.

Contributing

See CONTRIBUTING.md for development setup, coding standards, and the detector authoring guide.

About

Attention-first infrastructure monitoring. Shows only what matters right now — stays silent otherwise.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published