Version 0.3.3 — Deterministic resource analysis, policy-gated apply, and real-time monitoring for Kubernetes clusters.
# Install
go install github.com/ppiankov/kubenow/cmd/kubenow@latest
# Or download from releases: https://github.com/ppiankov/kubenow/releases/latest
# Monitor cluster problems (real-time TUI)
kubenow monitor
# Analyze over-provisioned resources
kubenow analyze requests-skew --prometheus-url http://localhost:9090
# High-resolution resource sampling for a workload
kubenow pro-monitor latch deployment/payment-api -n production
# Export recommendation as SSA patch
kubenow pro-monitor export deployment/payment-api -n production --format patch0— Success2— Invalid input (bad flags, missing required args)3— Runtime error (cluster connection failed, query timeout)
A Kubernetes cluster analysis tool that combines:
- Deterministic cost analysis — evidence-based resource optimization using Prometheus metrics
- Pro-Monitor — policy-gated resource alignment with bounded Server-Side Apply
- Real-time monitoring — attention-first TUI for cluster problems
- LLM-assisted analysis — optional incident triage via any OpenAI-compatible API
- Not an auto-scaler — presents evidence, never auto-adjusts resources without explicit consent
- Not a service mesh — queries existing APIs, installs nothing into the cluster
- Not an APM — no agents, no sidecars, no instrumentation
- Not a predictor — reports what would have worked historically, never what will work
- Not a replacement for monitoring — complements Prometheus, Grafana, and alerting
Status: Beta · v0.3.3 · Pre-1.0
| Milestone | Status |
|---|---|
| Core functionality | Complete |
| Test coverage >85% | Complete |
| Security audit | Complete |
| golangci-lint config | Complete |
| CI pipeline (test/lint/scan) | Complete |
| Homebrew distribution | Complete |
| Safety model documented | Complete |
| API stability guarantees | Partial |
| v1.0 release | Planned |
Pre-1.0: CLI flags and JSON output schemas may change between minor versions. Exit codes (0/2/3) are stable.
kubenow is designed to be safe to run against production clusters. Every mode has structural guarantees — not just warnings.
kubenow installs nothing into your cluster. No agents, no sidecars, no CRDs, no webhooks. It reads existing APIs and exits. Uninstall means deleting the binary.
| Mode | Cluster Access | Writes? |
|---|---|---|
monitor |
Watch API (pods, events, nodes) | Never |
analyze requests-skew |
List API + Prometheus queries | Never |
analyze node-footprint |
List API + Prometheus queries | Never |
pro-monitor latch |
Metrics API (read) | Never |
pro-monitor export |
Read current workload | Never |
pro-monitor apply |
Server-Side Apply | Yes — only with policy file + confirmation |
Only pro-monitor apply can mutate cluster state, and it requires all of the following:
Before any mutation, every condition must pass:
- Admin policy file loaded and
apply.enabled: true - Safety rating meets policy minimum (UNSAFE always blocked)
- Namespace not denied by policy
- No HPA conflict detected (unless explicitly acknowledged)
- Latch data fresh (within policy
max_latch_age, default 7 days) - Change deltas within policy bounds (
max_request_delta_percent,max_limit_delta_percent) - Audit directory exists and is writable
- Rate limit not exceeded (global and per-workload)
- GitOps field manager conflict check (ArgoCD, Flux, Helm, Kustomize)
- User confirmation prompt
If any check fails, apply is denied. No partial applies.
Every apply attempt — successful or denied — creates an audit bundle:
20260221T143022Z__production__deployment__payment-api/
├── before.yaml # workload state before apply
├── after.yaml # workload state after apply
├── diff.patch # unified diff of changes
└── decision.json # full decision record (identity, evidence, guardrails, result)
Admins control maximum change magnitude via policy:
apply:
max_request_delta_percent: 25 # no single change > 25%
max_limit_delta_percent: 25
allow_limit_decrease: false # limits can only increase
min_safety_rating: SAFE # block CAUTION/RISKY/UNSAFE
rate_limits:
max_applies_per_hour: 5
max_applies_per_workload: 2Apply uses Kubernetes Server-Side Apply (SSA). Changes are standard resource patches — revert with kubectl apply using the before.yaml from the audit bundle, or let GitOps controllers reconcile back to the desired state.
Compares resource requests against actual Prometheus metrics over a configurable time window.
kubenow analyze requests-skew --prometheus-url http://prometheus:9090
# With namespace filtering and 7-day window
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--window 7d \
--namespace-include "prod-*"
# SARIF output for CI integration
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--output sarif --export-file results.sarif
# Compare against saved baseline
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--compare-baseline baseline.jsonOutput:
=== Requests-Skew Analysis (Prometheus metrics only) ===
NAMESPACE WORKLOAD REQ CPU P99 CPU SKEW SAFETY IMPACT
prod payment-api 4.0 3.8 8.0x RISKY HIGH (42.5)
prod checkout-worker 2.0 0.5 6.7x SAFE MED (18.2)
Namespace Prometheus Status:
production 368 series
staging 142 series
ads-fraud no data — use pro-monitor latch for these workloads
Key features:
- Safety analysis: OOMKills, restarts, CPU throttling, spike patterns
- Safety ratings: SAFE, CAUTION, RISKY, UNSAFE with automatic margins
- Per-namespace Prometheus diagnostics with latch suggestions
- Obfuscation mode (
--obfuscate) for sharing without exposing names - Baseline comparison for tracking drift over time
- Output formats: table, JSON, SARIF
Bin-packing simulation to test alternative node configurations against historical data.
kubenow analyze node-footprint --prometheus-url http://prometheus:9090Tests alternative topologies using First-Fit Decreasing algorithm with feasibility checks and headroom calculation.
Policy-gated resource alignment: latch, recommend, export, apply.
Samples workload resource usage at 1-5 second intervals via the Kubernetes Metrics API, capturing sub-scrape-interval spikes that Prometheus misses.
# Sample deployment for 30 minutes at 5-second intervals
kubenow pro-monitor latch deployment/payment-api -n production --duration 30m
# Sample a CRD-managed pod directly
kubenow pro-monitor latch pod/payments-main-db-2 -n productionThe TUI shows real-time progress, and after completion computes a resource alignment recommendation with safety rating and confidence level.
CRD-managed workloads (CNPG, Strimzi, RabbitMQ, Redis, Elasticsearch) are automatically detected from pod labels and displayed with their operator type:
Workload: pod/payments-main-db (CNPG)
Namespace: production
After latch completes, kubenow computes per-container resource recommendations:
- Safety ratings: SAFE (no signals), CAUTION (minor restarts), RISKY (OOMKills), UNSAFE (blocked)
- Confidence levels: HIGH (24h+ latch + Prometheus), MEDIUM (2h+ latch), LOW
- Policy bounds: admin-defined max delta percentages, minimum safety rating
- Evidence: sample count, gaps, percentiles (p50/p95/p99/max)
Export recommendations in multiple formats:
# SSA-compatible YAML patch (pipe to kubectl apply)
kubenow pro-monitor export deployment/payment-api -n production --format patch
# Full manifest with recommended values
kubenow pro-monitor export deployment/payment-api --format manifest
# Unified diff for review
kubenow pro-monitor export deployment/payment-api --format diff
# Machine-readable JSON
kubenow pro-monitor export deployment/payment-api --format jsonPolicy-gated mutation via Kubernetes Server-Side Apply. Requires an admin policy file.
Pre-flight checks before any mutation:
- Policy loaded and apply enabled
- Safety rating meets policy minimum
- Namespace allowed
- HPA not detected (unless acknowledged)
- Latch data fresh (within policy MaxLatchAge)
- Audit path writable
- Rate limit not exceeded
GitOps conflict detection: inspects managedFields for ArgoCD, Flux, Helm, and Kustomize field managers. Reports conflict rather than overwriting.
Press l during latch to view structural traffic topology:
- Services matching the workload's pod selector
- Ingress routes and TLS configuration
- Network policies (allowed sources)
- Namespace neighbors ranked by CPU usage
Shows possible traffic paths from Kubernetes API state, not measured traffic.
Admin-controlled guardrails via a policy file:
kubenow pro-monitor latch deployment/payment-api --policy /etc/kubenow/policy.yaml
# Validate policy without running
kubenow pro-monitor validate-policy --policy policy.yaml --check-pathsThree operating modes:
- Observe Only — no policy or disabled: view metrics, no recommendations
- Export Only — policy present, apply disabled: recommendations with bounds, export only
- Apply Ready — policy present, apply enabled: full latch-recommend-export-apply pipeline
Every apply operation creates a tamper-evident audit bundle:
before.yaml/after.yaml— workload state snapshotsdiff.patch— unified diff of changesdecision.json— full decision record (identity, evidence, guardrails, result)
Rate limiting: configurable global and per-workload apply limits per time window.
Terminal UI for cluster problems, designed like top for Kubernetes issues.
kubenow monitor
# Press 1/2/3 to sort, arrow keys to scroll, c to copy, q to quit- Attention-first: empty screen when healthy, shows only broken things
- Watches for: OOMKills, CrashLoopBackOff, ImagePullBackOff, failed pods, node issues
- Service mesh health: linkerd/istio control plane failures and certificate expiry
- Sortable by severity, recency, or count
- Press
cto dump everything to terminal for copying
Use --severity critical to filter for critical issues only.
Automatically detects linkerd and istio control plane failures and certificate expiry. Runs regardless of --namespace filter because mesh failures affect all namespaces. Silently skips if the mesh is not installed or RBAC denies access.
What's detected:
| Check | Severity | Condition |
|---|---|---|
| Control plane down | FATAL | Deployment in mesh namespace has 0 available replicas |
| Certificate expiry | WARNING | Cert expires within 7 days |
| Certificate expiry | CRITICAL | Cert expires within 48 hours |
| Certificate expiry | FATAL | Cert expires within 24 hours or already expired |
Supported meshes: Linkerd (linkerd namespace), Istio (istio-system namespace)
RBAC requirements: Read access to Deployments and Secrets in the mesh namespace(s). If access is denied, monitoring is silently skipped — no errors are reported.
Disable with: --no-mesh
# Monitor without service mesh checks
kubenow monitor --no-meshFeed cluster snapshots into any OpenAI-compatible API for incident triage, pod debugging, compliance checks, and chaos suggestions.
# Incident triage
kubenow incident --llm-endpoint http://localhost:11434/v1 --model mixtral
# Pod debugging with filters
kubenow pod --llm-endpoint http://localhost:11434/v1 --model mixtral \
--include-pods "payment-*" --namespace production
# Export report
kubenow incident --llm-endpoint https://api.openai.com/v1 --model gpt-4o \
--output incident-report.mdWorks with Ollama, OpenAI, Azure OpenAI, DeepSeek, Groq, Together, OpenRouter, or any /v1/chat/completions endpoint.
Available modes: incident, pod, teamlead, compliance, chaos
kubenow CLI
┌──────────────┬──────────────┬──────────────┬──────────┐
│ monitor │ analyze │ pro-monitor │ LLM │
│ │ │ │ modes │
│ Real-time │ requests-skew│ latch/export │ incident │
│ problem │ node-footprint apply/status │ pod │
│ detection │ │ │ teamlead │
│ │ │ Policy Engine│ compliance│
│ │ │ Audit Trail │ chaos │
│ │ │ Exposure Map │ │
└──────┬───────┴──────┬───────┴───────┬───────┴────┬─────┘
│ │ │ │
▼ ▼ ▼ ▼
┌───────────┐ ┌──────────┐ ┌─────────────┐ ┌─────┐
│Kubernetes │ │Prometheus│ │Kubernetes │ │ LLM │
│ Watch │ │ API │ │Metrics API │ │ API │
│ API │ │ │ │+ SSA Apply │ │ │
└───────────┘ └──────────┘ └─────────────┘ └─────┘
See docs/architecture.md for details.
Requires Go >= 1.25
git clone https://github.com/ppiankov/kubenow
cd kubenow
make build
sudo mv bin/kubenow /usr/local/bin/Download from GitHub Releases.
Available for Linux (amd64, arm64), macOS (amd64, arm64), and Windows (amd64).
kubenow version
# kubenow version 0.3.3# Port-forward (recommended for local analysis)
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
kubenow analyze requests-skew --prometheus-url http://127.0.0.1:9090
# Auto-detect in-cluster Prometheus
kubenow analyze requests-skew --auto-detect-prometheus
# Via Kubernetes service
kubenow analyze requests-skew --k8s-service prometheus-operated --k8s-namespace monitoringUse http://127.0.0.1:9090 (not http://prometheus:9090) for port-forward. Analysis is read-only.
- Missing metrics: Check
container_cpu_usage_seconds_totalexists in Prometheus - Namespace has no Prometheus data: Use
pro-monitor latchfor workloads in unscraped namespaces - Time window too old: Try
--window 7d - Prometheus unreachable: Test with
curl http://127.0.0.1:9090/api/v1/query?query=up
- "No policy file found": Pass
--policy path/to/policy.yamlor set env var - "Audit path not writable": Ensure the audit directory exists and is writable
- "Latch data stale": Re-run latch — data expires after policy MaxLatchAge (default 7 days)
- "Apply denied: HPA detected": Pass
--acknowledge-hpaif HPA conflict is acceptable
# Silent JSON output for pipelines
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--silent --output json --export-file results.json
# SARIF for GitHub Security tab
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--output sarif --export-file results.sarif
# Fail pipeline on critical issues
kubenow analyze requests-skew \
--prometheus-url http://prometheus:9090 \
--fail-on critical- Prometheus metrics required for
requests-skew(Metrics API alone is insufficient for historical data) - Pro-Monitor apply limited to Deployment, StatefulSet, DaemonSet (Pod apply blocked — managed by controllers)
- CRD operator detection relies on well-known pod labels; custom operators need
app.kubernetes.io/managed-by - LLM analysis quality depends on the model used
See CHANGELOG.md for version history. Planned:
- Auto-detect Prometheus in-cluster
- Cloud provider cost integration (AWS, GCP, Azure)
- Historical trend tracking
This tool follows the principles of Attention-First Software:
The primary responsibility of software is to disappear once it works correctly.
- Deterministic analysis over prescriptive recommendations
- Evidence-based outputs ("this would have worked") not predictions ("you should do this")
- Actions are reversible; irreversible ones require explicit consent and structural safeguards
- Tools present evidence and let users decide — mirrors, not oracles
Read the full manifesto: MANIFESTO.md
See CONTRIBUTING.md for development setup, testing guidelines, and code style.
MIT License — see LICENSE for details.
