diff --git a/docs/operator-public-documentation/preview/monitoring/metrics.md b/docs/operator-public-documentation/preview/monitoring/metrics.md new file mode 100644 index 00000000..95540c0c --- /dev/null +++ b/docs/operator-public-documentation/preview/monitoring/metrics.md @@ -0,0 +1,315 @@ +# Metrics Reference + +This page documents the key metrics available when monitoring a DocumentDB cluster, organized by source. Each section includes the metric name, description, labels, and example PromQL queries. + +## Container Resource Metrics + +These metrics are collected via the kubelet/cAdvisor interface (or the OpenTelemetry `kubeletstats` receiver). They cover CPU, memory, network, and filesystem for the **postgres** and **documentdb-gateway** containers in each DocumentDB pod. + +### CPU + +| Metric | Type | Description | +|--------|------|-------------| +| `container_cpu_usage_seconds_total` | Counter | Cumulative CPU time consumed in seconds | +| `container_spec_cpu_quota` | Gauge | CPU quota (microseconds per `cpu_period`) | +| `container_spec_cpu_period` | Gauge | CPU CFS scheduling period (microseconds) | + +**Common labels:** `namespace`, `pod`, `container`, `node` + +#### Example Queries + +CPU usage rate per container over 5 minutes: + +```promql +rate(container_cpu_usage_seconds_total{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" +}[5m]) +``` + +CPU utilization as a percentage of limit: + +```promql +(rate(container_cpu_usage_seconds_total{ + container="postgres", + pod=~".*documentdb.*" +}[5m]) +/ on(pod, container) +(container_spec_cpu_quota{ + container="postgres", + pod=~".*documentdb.*" +} +/ container_spec_cpu_period{ + container="postgres", + pod=~".*documentdb.*" +})) * 100 +``` + +Compare gateway vs. postgres CPU across all pods: + +```promql +sum by (container) ( + rate(container_cpu_usage_seconds_total{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" + }[5m]) +) +``` + +### Memory + +| Metric | Type | Description | +|--------|------|-------------| +| `container_memory_working_set_bytes` | Gauge | Current working set memory (bytes) | +| `container_memory_rss` | Gauge | Resident set size (bytes) | +| `container_memory_cache` | Gauge | Page cache memory (bytes) | +| `container_spec_memory_limit_bytes` | Gauge | Memory limit (bytes) | + +**Common labels:** `namespace`, `pod`, `container`, `node` + +#### Example Queries + +Memory usage in MiB per container: + +```promql +container_memory_working_set_bytes{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" +} / 1024 / 1024 +``` + +Memory utilization as a percentage of limit: + +```promql +(container_memory_working_set_bytes{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" +} +/ container_spec_memory_limit_bytes{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" +}) * 100 +``` + +Top 5 pods by memory usage: + +```promql +topk(5, + sum by (pod) ( + container_memory_working_set_bytes{ + container=~"postgres|documentdb-gateway", + pod=~".*documentdb.*" + } + ) +) +``` + +### Network + +| Metric | Type | Description | +|--------|------|-------------| +| `container_network_receive_bytes_total` | Counter | Bytes received | +| `container_network_transmit_bytes_total` | Counter | Bytes transmitted | + +**Common labels:** `namespace`, `pod`, `interface` + +#### Example Queries + +Network throughput (bytes/sec) per pod: + +```promql +sum by (pod) ( + rate(container_network_receive_bytes_total{ + pod=~".*documentdb.*" + }[5m]) + + rate(container_network_transmit_bytes_total{ + pod=~".*documentdb.*" + }[5m]) +) +``` + +### Filesystem + +| Metric | Type | Description | +|--------|------|-------------| +| `container_fs_usage_bytes` | Gauge | Filesystem usage (bytes) | +| `container_fs_reads_bytes_total` | Counter | Filesystem read bytes | +| `container_fs_writes_bytes_total` | Counter | Filesystem write bytes | + +**Common labels:** `namespace`, `pod`, `container`, `device` + +#### Example Queries + +Disk I/O rate for the postgres container: + +```promql +rate(container_fs_writes_bytes_total{ + container="postgres", + pod=~".*documentdb.*" +}[5m]) +``` + +## Operator Metrics (controller-runtime) + +The DocumentDB operator binary exposes standard controller-runtime metrics on its metrics endpoint. These track reconciliation performance and work queue health. + +### Reconciliation + +| Metric | Type | Description | +|--------|------|-------------| +| `controller_runtime_reconcile_total` | Counter | Total reconciliations | +| `controller_runtime_reconcile_errors_total` | Counter | Total reconciliation errors | +| `controller_runtime_reconcile_time_seconds` | Histogram | Time spent in reconciliation | + +**Common labels:** `controller` (e.g., `documentdb-controller`, `backup`, `scheduledbackup`, `certificate-controller`, `persistentvolume`), `result` (`success`, `error`, `requeue`, `requeue_after`) + +#### Example Queries + +Reconciliation error rate by controller: + +```promql +sum by (controller) ( + rate(controller_runtime_reconcile_errors_total[5m]) +) +``` + +P95 reconciliation latency for the DocumentDB controller: + +```promql +histogram_quantile(0.95, + sum by (le) ( + rate(controller_runtime_reconcile_time_seconds_bucket{ + controller="documentdb-controller" + }[5m]) + ) +) +``` + +Reconciliation throughput (reconciles/sec): + +```promql +sum by (controller) ( + rate(controller_runtime_reconcile_total[5m]) +) +``` + +### Work Queue + +| Metric | Type | Description | +|--------|------|-------------| +| `workqueue_depth` | Gauge | Current number of items in the queue | +| `workqueue_adds_total` | Counter | Total items added | +| `workqueue_queue_duration_seconds` | Histogram | Time items spend in queue before processing | +| `workqueue_work_duration_seconds` | Histogram | Time spent processing items | +| `workqueue_retries_total` | Counter | Total retries | + +**Common labels:** `name` (queue name, maps to controller name) + +#### Example Queries + +Work queue depth by controller: + +```promql +workqueue_depth{name=~"documentdb-controller|backup|scheduledbackup|certificate-controller"} +``` + +Average time items spend waiting in queue: + +```promql +rate(workqueue_queue_duration_seconds_sum{name="documentdb-controller"}[5m]) +/ rate(workqueue_queue_duration_seconds_count{name="documentdb-controller"}[5m]) +``` + +## CNPG / PostgreSQL Metrics + +CloudNative-PG exposes PostgreSQL-level metrics from each managed pod. These are available when CNPG monitoring is enabled. For the full list, see the [CloudNative-PG monitoring docs](https://cloudnative-pg.io/documentation/current/monitoring/). + +### Replication + +| Metric | Type | Description | +|--------|------|-------------| +| `cnpg_pg_replication_lag` | Gauge | Replication lag in seconds | +| `cnpg_pg_replication_streaming_replicas` | Gauge | Number of streaming replicas | + +#### Example Queries + +Replication lag per pod: + +```promql +cnpg_pg_replication_lag{pod=~".*documentdb.*"} +``` + +### Connections + +| Metric | Type | Description | +|--------|------|-------------| +| `cnpg_pg_stat_activity_count` | Gauge | Active backend connections by state | + +#### Example Queries + +Active connections by state: + +```promql +sum by (state) ( + cnpg_pg_stat_activity_count{pod=~".*documentdb.*"} +) +``` + +### Storage + +| Metric | Type | Description | +|--------|------|-------------| +| `cnpg_pg_database_size_bytes` | Gauge | Total database size | +| `cnpg_pg_stat_bgwriter_buffers_checkpoint` | Counter | Buffers written during checkpoints | + +#### Example Queries + +Database size in GiB: + +```promql +cnpg_pg_database_size_bytes{pod=~".*documentdb.*"} / 1024 / 1024 / 1024 +``` + +### Cluster Health + +| Metric | Type | Description | +|--------|------|-------------| +| `cnpg_collector_up` | Gauge | 1 if the CNPG metrics collector is running | +| `cnpg_pg_postmaster_start_time` | Gauge | PostgreSQL start timestamp | + +#### Example Queries + +Detect pods where the metrics collector is down: + +```promql +cnpg_collector_up{pod=~".*documentdb.*"} == 0 +``` + +## Gateway Metrics (Future) + +The DocumentDB Gateway does not currently expose application-level metrics. When implemented, expect metrics like: + +| Metric | Type | Description | +|--------|------|-------------| +| `documentdb_gateway_requests_total` | Counter | Total API requests (labels: `method`, `status`) | +| `documentdb_gateway_request_duration_seconds` | Histogram | Request latency | +| `documentdb_gateway_active_connections` | Gauge | Current connection count | +| `documentdb_gateway_read_operations_total` | Counter | Read operations (labels: `database`, `collection`) | +| `documentdb_gateway_write_operations_total` | Counter | Write operations (labels: `database`, `collection`) | +| `documentdb_gateway_errors_total` | Counter | Error count (labels: `error_type`, `operation`) | + +These will be collected via Prometheus scraping (`/metrics` endpoint) or OTLP push. See the [telemetry design document](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/telemetry/telemetry-design.md) for the planned implementation. + +## OpenTelemetry Metric Names + +When using the OpenTelemetry `kubeletstats` receiver, metric names use the OpenTelemetry naming convention instead of Prometheus-style names: + +| OpenTelemetry Name | Prometheus Equivalent | +|---|---| +| `k8s.container.cpu.time` | `container_cpu_usage_seconds_total` | +| `k8s.container.memory.usage` | `container_memory_working_set_bytes` | +| `k8s.container.cpu.limit` | `container_spec_cpu_quota` | +| `k8s.container.memory.limit` | `container_spec_memory_limit_bytes` | +| `k8s.pod.network.io` | `container_network_*_bytes_total` | + +When writing queries, use the naming convention matching your collection method. The telemetry playground uses the OpenTelemetry names; a direct Prometheus scrape of cAdvisor uses Prometheus names. diff --git a/docs/operator-public-documentation/preview/monitoring/overview.md b/docs/operator-public-documentation/preview/monitoring/overview.md new file mode 100644 index 00000000..6a67361b --- /dev/null +++ b/docs/operator-public-documentation/preview/monitoring/overview.md @@ -0,0 +1,211 @@ +# Monitoring Overview + +This guide describes how to monitor DocumentDB clusters running on Kubernetes using OpenTelemetry, Prometheus, and Grafana. + +## Prerequisites + +- A running Kubernetes cluster with the DocumentDB operator installed +- [Helm 3](https://helm.sh/docs/intro/install/) for deploying Prometheus and Grafana +- [kubectl](https://kubernetes.io/docs/tasks/tools/) configured for your cluster +- [`jq`](https://jqlang.github.io/jq/) for processing JSON in verification commands +- (Optional) [OpenTelemetry Operator](https://opentelemetry.io/docs/kubernetes/operator/) for managed collector deployments + +## Architecture + +A DocumentDB pod contains two containers: + +- **PostgreSQL container** — the DocumentDB engine (PostgreSQL with DocumentDB extensions) +- **Gateway container** — MongoDB-compatible API sidecar + +The recommended monitoring stack collects infrastructure metrics from these containers and stores them for visualization and alerting. + +``` +┌──────────────────────────────────────────────────────┐ +│ Grafana │ +│ (dashboards & alerts) │ +└──────────────────┬───────────────────────────────────┘ + │ +┌──────────────────┴───────────────────────────────────┐ +│ Prometheus │ +│ (metrics storage) │ +└──────────────────┬───────────────────────────────────┘ + │ remote write +┌──────────────────┴───────────────────────────────────┐ +│ OpenTelemetry Collector │ +│ Receivers: kubeletstats, k8s_cluster, prometheus │ +│ Processors: resource detection, attribute enrichment │ +│ Exporters: prometheusremotewrite │ +└──────────────────┬───────────────────────────────────┘ + │ scrape +┌──────────────────┴───────────────────────────────────┐ +│ DocumentDB Pods │ +│ ┌──────────────┐ ┌──────────────┐ │ +│ │ PostgreSQL │ │ Gateway │ │ +│ │ container │ │ container │ │ +│ └──────────────┘ └──────────────┘ │ +└──────────────────────────────────────────────────────┘ +``` + +### Collector deployment modes + +The [telemetry design document](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/telemetry/telemetry-design.md) recommends the OpenTelemetry Collector as a **DaemonSet** (one collector per node) for single-tenant clusters. This provides: + +- Lower resource overhead — one collector per node instead of one per pod +- Node-level metrics visibility (CPU, memory, filesystem) +- Simpler configuration and management + +The [telemetry playground](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/telemetry) implements a **Deployment** (one collector per namespace) instead, which is better suited for multi-tenant setups requiring per-namespace metric isolation. Choose the mode that fits your isolation requirements. + +## Prometheus Integration + +### Operator Metrics + +The DocumentDB operator exposes a metrics endpoint via controller-runtime. By default: + +- **Bind address**: controlled by `--metrics-bind-address` (default `0`, disabled) +- **Secure mode**: `--metrics-secure=true` serves via HTTPS with authn/authz +- **Certificates**: supply `--metrics-cert-path` for custom TLS, otherwise self-signed certs are generated + +To enable metrics scraping, set the bind address in the operator deployment (for example, `:8443` for HTTPS or `:8080` for HTTP). + +### CNPG Cluster Metrics + +The underlying CloudNative-PG cluster exposes PostgreSQL metrics on each pod. These are collected by the OpenTelemetry Collector's Prometheus receiver via Kubernetes service discovery. Key metric sources: + +| Source | Method | Metrics | +|--------|--------|---------| +| kubelet/cAdvisor | `kubeletstats` receiver | Container CPU, memory, network, filesystem | +| Kubernetes API | `k8s_cluster` receiver | Pod status, restart counts, resource requests/limits | +| Application endpoints | `prometheus` receiver | Custom application metrics (when available) | + +### ServiceMonitor / PodMonitor + +The operator does not ship a metrics `Service` or `ServiceMonitor` by default. If you use the Prometheus Operator and want to scrape controller-runtime metrics, create a `Service` and `ServiceMonitor` matching your deployment. For example, with a Helm release named `documentdb`: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: documentdb-operator-metrics + namespace: documentdb-operator + labels: + app: documentdb +spec: + selector: + app: documentdb # must match your Helm release name + ports: + - name: metrics + port: 8443 + targetPort: 8443 +--- +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: documentdb-operator + namespace: documentdb-operator +spec: + selector: + matchLabels: + app: documentdb # must match the Service labels above + endpoints: + - port: metrics + scheme: https + tlsConfig: + insecureSkipVerify: true # use a proper CA bundle in production +``` + +!!! note + Adjust the `app` label to match your Helm release name. The operator must be started with `--metrics-bind-address=:8443` for the endpoint to be available. + +## Key Metrics + +### Container Resource Metrics + +| Metric | Description | Container | +|--------|-------------|-----------| +| `container_cpu_usage_seconds_total` | Cumulative CPU time consumed | postgres, documentdb-gateway | +| `container_memory_working_set_bytes` | Current memory usage | postgres, documentdb-gateway | +| `container_spec_memory_limit_bytes` | Memory limit | postgres, documentdb-gateway | +| `container_network_receive_bytes_total` | Network bytes received | pod-level | +| `container_fs_reads_bytes_total` | Filesystem read bytes | postgres | + +### Controller-Runtime Metrics + +| Metric | Description | +|--------|-------------| +| `controller_runtime_reconcile_total` | Total reconciliations by controller and result | +| `controller_runtime_reconcile_errors_total` | Total reconciliation errors | +| `controller_runtime_reconcile_time_seconds` | Reconciliation duration histogram | +| `workqueue_depth` | Current depth of the work queue | +| `workqueue_adds_total` | Total items added to the work queue | + +### CNPG / PostgreSQL Metrics + +When the CNPG monitoring is enabled, additional PostgreSQL-level metrics are available: + +| Metric | Description | +|--------|-------------| +| `cnpg_collector_up` | Whether the CNPG metrics collector is running | +| `cnpg_pg_replication_lag` | Replication lag in seconds | +| `cnpg_pg_stat_activity_count` | Number of active connections | +| `cnpg_pg_database_size_bytes` | Database size | + +For the full CNPG metrics reference, see the [CloudNative-PG monitoring documentation](https://cloudnative-pg.io/documentation/current/monitoring/). + +## Telemetry Playground + +The [`documentdb-playground/telemetry/`](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/telemetry) directory contains a complete reference implementation with: + +- Multi-tenant namespace isolation (separate Prometheus + Grafana per team) +- OpenTelemetry Collector configurations for cAdvisor metric scraping +- Automated Grafana dashboard provisioning scripts +- AKS cluster setup with the OpenTelemetry Operator + +Run the quickstart: + +```bash +cd documentdb-playground/telemetry/scripts/ + +# One-time infrastructure setup +./create-cluster.sh --install-all + +# Deploy multi-tenant DocumentDB + monitoring +./deploy-multi-tenant-telemetry.sh + +# Create Grafana dashboards +./setup-grafana-dashboards.sh sales-namespace + +# Access Grafana +kubectl port-forward -n sales-namespace svc/grafana-sales 3001:3000 & +# Open http://localhost:3001 (playground default: admin / admin123 — change in production) +``` + +See the [telemetry design document](https://github.com/documentdb/documentdb-kubernetes-operator/blob/main/documentdb-playground/telemetry/telemetry-design.md) for the full architecture rationale including DaemonSet vs. sidecar trade-offs, OTLP receiver plans, and future application-level metrics. + +## Verification + +After deploying the monitoring stack, confirm that metrics are flowing: + +```bash +# Check that the OpenTelemetry Collector pods are running +kubectl get pods -l app.kubernetes.io/name=opentelemetry-collector + +# Verify Prometheus is receiving metrics (port-forward first) +kubectl port-forward svc/prometheus-server 9090:80 & +curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result | length' + +# Confirm DocumentDB container metrics are present +curl -s 'http://localhost:9090/api/v1/query?query=container_cpu_usage_seconds_total{pod=~".*documentdb.*"}' \ + | jq '.data.result | length' +``` + +If no metrics appear, check: + +- The collector's service account has RBAC access to the kubelet metrics API +- Namespace label filters in the collector config match your DocumentDB namespace +- The Prometheus remote-write endpoint is reachable from the collector + +## Next Steps + +- [Metrics Reference](metrics.md) — detailed metric descriptions and PromQL query examples +- [CloudNative-PG Monitoring](https://cloudnative-pg.io/documentation/current/monitoring/) — upstream PostgreSQL metrics diff --git a/mkdocs.yml b/mkdocs.yml index efcb23f1..6dcd7e3f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,9 @@ nav: - Get Started: preview/index.md - Advanced Configuration: preview/advanced-configuration/README.md - Backup and Restore: preview/backup-and-restore.md + - Monitoring: + - Overview: preview/monitoring/overview.md + - Metrics Reference: preview/monitoring/metrics.md - FAQ: preview/faq.md - Tools: - Kubectl Plugin: preview/kubectl-plugin.md