⚠️ Status: Experimental / Work In Progress (WIP)This project is currently in an early development and research phase. It is not yet recommended for use in production environments.
Support teams frequently encounter situations where an etcd member is silently lagging or unhealthy before a full quorum loss event. Unfortunately, there is often no built-in mechanism in Rancher/RKE2 environments that surfaces this proactively.
etcdoc (formerly etcd-cluster-reliability-tool) is a lightweight daemon designed to track critical etcd member health indicators, such as:
- Database size growth rate
- Disk fsync and backend commit latency
- Leader election frequency
Architecture:
The tool is designed to run either as a Kubernetes DaemonSet on control-plane nodes, or as a standalone Local Container (via docker-compose or nerdctl). It locally scrapes the etcd /metrics endpoint, evaluates the data against predefined safety thresholds, and outputs actionable alerts to standard output (pod logs) and as Prometheus metrics.
- An RKE2 environment with access to the host filesystem where etcd TLS certificates are stored (default path for RKE2 is
/var/lib/rancher/rke2/server/tls/etcd/).
A complete Kubernetes manifest is provided in manifests/etcdoc.yaml.
kubectl apply -f manifests/etcdoc.yamlThis manifest deploys:
- A
ServiceAccount,ClusterRole, andRoleBindingfor Kubernetes leader election. - A
ConfigMapfor application settings. - A
DaemonSettargeting nodes with thenode-role.kubernetes.io/control-planelabel. - A
ServiceandServiceMonitor(for Prometheus Operator integration).
For support engineers needing immediate answers on cluster health, etcdoc includes a one-shot diagnostic mode. This runs a single evaluation cycle and exits with a semantic status code (0 for healthy, 1 for unhealthy). You can run this directly on an RKE2 node using the packaged ctr binary, eliminating the need to install Docker.
# Via containerd (ctr) on an RKE2 node
sudo /var/lib/rancher/rke2/bin/ctr --address /run/k3s/containerd/containerd.sock --namespace k8s.io run --rm --net-host --user 0:0 \
--mount type=bind,src=/var/lib/rancher/rke2/server/tls/etcd,dst=/var/lib/rancher/rke2/server/tls/etcd,options=rbind:ro \
docker.io/library/etcdoc:local etcdoc-diag /etcdoc --onceThe tool is configured via config.yaml or Environment Variables. Key thresholds include:
FSYNC_LATENCY_SECONDS: Max allowed latency for disk fsync.BACKEND_COMMIT_LATENCY_SECONDS: Max allowed latency for backend commits.MAX_LEADER_CHANGES_5M: Max number of leader elections within a 5-minute window.MAX_PENDING_PROPOSALS: Max allowed pending raft proposals.MAX_DB_SIZE_BYTES: Max allowed database size in bytes (default 8GB).
The tool exposes its own health and operational metrics on port 8080:
/health: Returns the last evaluated state in JSON format./metrics: Exposes Prometheus metrics regarding the tool's internal operations and dispatched alerts. Critical alerting now relies on external Alertmanager setups scraping these metrics.
To build the Go binary locally:
go build -o bin/etcdoc ./cmd/etcdocTo build and push your own image:
# Build the image
docker build -t your-registry/etcdoc:latest .
# Push the image
docker push your-registry/etcdoc:latestNote: Ensure you update the image reference in the DaemonSet manifest after pushing your custom image.