docs: Add high availability documentation for local HA configuration#294
docs: Add high availability documentation for local HA configuration#294hossain-rayhan wants to merge 1 commit intodocumentdb:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds new High Availability (HA) documentation to the public “preview” docs set, and updates MkDocs configuration/navigation to publish the new section (including Mermaid diagram rendering).
Changes:
- Adds an HA overview page describing HA options and decision guidance.
- Adds a local HA configuration guide (instances, anti-affinity, failover behavior, testing/troubleshooting).
- Updates
mkdocs.ymlto include HA navigation entries and Mermaid-related Markdown configuration.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| mkdocs.yml | Adds HA nav entries and enables Mermaid/tabbed content features for the docs site. |
| docs/operator-public-documentation/preview/high-availability/overview.md | Introduces HA overview content and diagrams for local/multi-region/multi-cloud concepts. |
| docs/operator-public-documentation/preview/high-availability/local-ha.md | Documents local HA configuration, failover timeline, and operational testing/troubleshooting steps. |
| !!! note "Zero Data Loss" | ||
| Because replication is synchronous, a committed write exists on at least one replica before acknowledgment. Failover promotes a replica with all committed data. | ||
|
|
There was a problem hiding this comment.
The "Zero Data Loss" note assumes synchronous replication, but the operator doesn’t configure synchronous replication for single-cluster deployments in the generated CNPG Cluster spec. This should be updated to avoid overstating durability guarantees (or updated to describe the exact configuration that enforces synchronous/quorum writes, if applicable).
| **Solution**: | ||
| ```bash | ||
| # Check operator logs | ||
| kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=100 |
There was a problem hiding this comment.
The operator Helm chart’s pod template labels the operator pods with app: <release-name> (it does not include app.kubernetes.io/name on the pod template). This kubectl logs selector likely won’t match any pods. Prefer kubectl logs deployment/documentdb-operator -n <ns> or a selector that matches the actual pod labels (e.g., -l app=<helm-release-name>).
| kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=100 | |
| kubectl logs deployment/documentdb-operator -n documentdb-operator --tail=100 |
| | **Scope** | Single Kubernetes cluster | | ||
| | **Instances** | 1-3 instances (primary + replicas) | | ||
| | **Failover** | Automatic, typically < 30 seconds | | ||
| | **Data Loss** | Zero (synchronous replication) | |
There was a problem hiding this comment.
The overview states Local HA has "Zero (synchronous replication)" data loss / RPO 0, but the operator’s CNPG Cluster spec generation does not configure synchronous replication for the single-cluster case (it only sets instances). Unless synchronous replication is actually enforced elsewhere, this should be revised to reflect async replication/possible replication lag, or the docs should explicitly describe what configuration guarantees zero RPO.
| | **Data Loss** | Zero (synchronous replication) | | |
| | **Data Loss** | Potentially minimal (async replication by default; RPO 0 requires synchronous replication configuration) | |
| Local high availability (HA) deploys multiple DocumentDB instances within a single Kubernetes cluster, providing automatic failover and zero data loss during instance failures. | ||
|
|
||
| ## Overview | ||
|
|
||
| Local HA uses synchronous replication between a primary instance and one or two replicas. When the primary fails, a replica is automatically promoted to primary. | ||
|
|
There was a problem hiding this comment.
This page claims local HA provides "zero data loss" and describes replication as synchronous. The operator’s CNPG Cluster spec for single-cluster deployments sets the instance count but does not configure synchronous replication, so this statement is likely inaccurate and could mislead users about durability guarantees. Please align the wording with the actual behavior (e.g., async replication with potential lag) or document the specific settings that guarantee zero RPO.
| ```yaml title="documentdb-node-affinity.yaml" | ||
| apiVersion: documentdb.io/preview | ||
| kind: DocumentDB | ||
| metadata: | ||
| name: my-documentdb | ||
| namespace: documentdb | ||
| spec: | ||
| instancesPerNode: 3 | ||
| affinity: | ||
| enablePodAntiAffinity: true | ||
| topologyKey: kubernetes.io/hostname # (1)! | ||
| ``` |
There was a problem hiding this comment.
This node anti-affinity example also omits required spec fields (e.g., spec.resource.storage.pvcSize). Like the zone example, either present it as a partial snippet to merge into an existing manifest, or include the required fields to make it directly runnable.
| | HA Type | Typical RPO | | ||
| |---------|-------------| | ||
| | Local HA | 0 (synchronous) | | ||
| | Multi-Region | Seconds (replication lag) | | ||
| | Multi-Cloud | Seconds (replication lag) | |
There was a problem hiding this comment.
The RPO table lists Local HA as 0 (synchronous), but the operator’s default single-cluster CNPG Cluster spec does not set synchronous replication configuration. This row should be updated to avoid promising a zero RPO unless that guarantee is actually enforced.
| G --> K[See Multi-Cloud Guide] | ||
| H --> L[See Multi-Region Guide] |
There was a problem hiding this comment.
The decision tree points to "See Multi-Cloud Guide" / "See Multi-Region Guide" but doesn’t provide links (and these guides aren’t present in the preview docs nav). Consider replacing these nodes with actual links to existing guidance (e.g., the multi-cloud deployment README in documentdb-playground, and the AKS Fleet multi-region guide/scripts) or mark them explicitly as “coming soon” to avoid a dead-end.
| G --> K[See Multi-Cloud Guide] | |
| H --> L[See Multi-Region Guide] | |
| G --> K[See Multi-Cloud Guide (coming soon)] | |
| H --> L[See Multi-Region Guide (coming soon)] |
| default: latest | ||
|
|
||
| extra_javascript: | ||
| - https://unpkg.com/mermaid@10/dist/mermaid.min.js |
There was a problem hiding this comment.
Loading Mermaid from a third-party CDN (unpkg.com) introduces a supply-chain risk and can break doc builds in restricted/offline environments. Consider vendoring Mermaid into the docs assets (or pinning an exact patch version with integrity checks) and referencing the local file instead of a remote URL.
| - https://unpkg.com/mermaid@10/dist/mermaid.min.js | |
| - js/mermaid.min.js |
| storage: | ||
| size: 10Gi | ||
| storageClassName: managed-csi |
There was a problem hiding this comment.
The manifest example uses spec.storage.size and spec.storage.storageClassName, but the DocumentDB API uses spec.resource.storage.pvcSize and spec.resource.storage.storageClass. As written, this example won’t pass CRD validation / won’t configure storage as intended.
| storage: | |
| size: 10Gi | |
| storageClassName: managed-csi | |
| resource: | |
| storage: | |
| pvcSize: 10Gi | |
| storageClass: managed-csi |
| ```yaml title="documentdb-zone-affinity.yaml" | ||
| apiVersion: documentdb.io/preview | ||
| kind: DocumentDB | ||
| metadata: | ||
| name: my-documentdb | ||
| namespace: documentdb | ||
| spec: | ||
| instancesPerNode: 3 | ||
| affinity: | ||
| enablePodAntiAffinity: true | ||
| topologyKey: topology.kubernetes.io/zone # (1)! | ||
| ``` |
There was a problem hiding this comment.
This zone anti-affinity example omits required spec fields (notably spec.resource.storage.pvcSize, and any other required fields). If it’s intended as a patch snippet, it should say so explicitly; otherwise include a complete minimal DocumentDB spec so users can apply it without validation errors.
|
My Agent says: Author: Rayhan Hossain | Files changed: 3 (+477 lines) Summary Adds two new documentation pages (overview.md and local-ha.md) under a new "High Availability" section, plus mkdocs.yml updates for 🔴 Critical
All three YAML examples use a storage field directly under spec, but the actual API uses resource.storage with different field names: ❌ In the docs (local-ha.md lines ~37-45):spec: ✅ Correct per API (documentdb_types.go):spec: Issues:
This affects the first YAML example and both affinity YAML examples. Users copy-pasting these will get validation errors. 🟠 Major
The "Test 1: Verify Instance Distribution" section uses: kubectl get pods -n documentdb -l documentdb.io/cluster=my-documentdb But pods are managed by CNPG and carry the label cnpg.io/cluster, not documentdb.io/cluster. The documentdb.io/cluster label is only kubectl get pods -n documentdb -l cnpg.io/cluster=my-documentdb-node-1 (Note: the CNPG cluster name is derived from the DocumentDB name, typically {name}-node-{n}.)
The FAQ (faq.md) links to advanced-configuration/README.md#high-availability, and the advanced config page has a minimal HA section. 🟡 Minor
extra_javascript: This pulls mermaid from unpkg.com CDN at runtime. If the CDN is unavailable, diagrams won't render. Consider pinning to an exact version
The overview mentions multi-region and multi-cloud deployment types with detailed tables, but "Next Steps" only links to Local HA.
The table says failoverDelay is not configurable, but since it defaults to 0 (instant failover), users may eventually want to tune it. A 🟢 Nitpick
✅ What looks good
|
Summary
Adds comprehensive high availability documentation covering local HA concepts, configuration, and automatic failover behavior.
Changes
New Files
docs/operator-public-documentation/preview/high-availability/overview.md- HA concepts overviewdocs/operator-public-documentation/preview/high-availability/local-ha.md- Local HA configuration guideUpdated Files
mkdocs.yml- Added High Availability navigation section and enabled Mermaid diagram supportDocumentation Highlights
HA Overview:
Local HA Guide:
instancesPerNode: 1-3)stopDelayis currently the only configurable timeout)Notes
kubectl get documentdb <name> -o jsonpath='{.status.connectionString}'