Skip to content

docs: Add high availability documentation for local HA configuration#294

Open
hossain-rayhan wants to merge 1 commit intodocumentdb:mainfrom
hossain-rayhan:rayhan/docs-local-ha
Open

docs: Add high availability documentation for local HA configuration#294
hossain-rayhan wants to merge 1 commit intodocumentdb:mainfrom
hossain-rayhan:rayhan/docs-local-ha

Conversation

@hossain-rayhan
Copy link
Collaborator

Summary

Adds comprehensive high availability documentation covering local HA concepts, configuration, and automatic failover behavior.

Changes

New Files

  • docs/operator-public-documentation/preview/high-availability/overview.md - HA concepts overview
  • docs/operator-public-documentation/preview/high-availability/local-ha.md - Local HA configuration guide

Updated Files

  • mkdocs.yml - Added High Availability navigation section and enabled Mermaid diagram support

Documentation Highlights

HA Overview:

  • Explains three HA types: Local HA, Multi-Region, Multi-Cloud
  • RTO/RPO concepts and trade-offs
  • Decision tree for selecting appropriate HA strategy
  • Architecture diagrams using Mermaid

Local HA Guide:

  • Instance configuration (instancesPerNode: 1-3)
  • Pod anti-affinity setup (zone-level and node-level distribution)
  • Automatic failover process with CNPG timing parameters
  • Documents configurable vs default parameters (stopDelay is currently the only configurable timeout)
  • Testing procedures and troubleshooting

Notes

  • Failover timing parameters are inherited from CloudNative-PG
  • Manual failover for local HA is not currently supported via kubectl plugin (plugin is for multi-cluster promotion only)
  • Connection string can be retrieved from kubectl get documentdb <name> -o jsonpath='{.status.connectionString}'

Copilot AI review requested due to automatic review settings March 6, 2026 20:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new High Availability (HA) documentation to the public “preview” docs set, and updates MkDocs configuration/navigation to publish the new section (including Mermaid diagram rendering).

Changes:

  • Adds an HA overview page describing HA options and decision guidance.
  • Adds a local HA configuration guide (instances, anti-affinity, failover behavior, testing/troubleshooting).
  • Updates mkdocs.yml to include HA navigation entries and Mermaid-related Markdown configuration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
mkdocs.yml Adds HA nav entries and enables Mermaid/tabbed content features for the docs site.
docs/operator-public-documentation/preview/high-availability/overview.md Introduces HA overview content and diagrams for local/multi-region/multi-cloud concepts.
docs/operator-public-documentation/preview/high-availability/local-ha.md Documents local HA configuration, failover timeline, and operational testing/troubleshooting steps.

Comment on lines +176 to +178
!!! note "Zero Data Loss"
Because replication is synchronous, a committed write exists on at least one replica before acknowledgment. Failover promotes a replica with all committed data.

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Zero Data Loss" note assumes synchronous replication, but the operator doesn’t configure synchronous replication for single-cluster deployments in the generated CNPG Cluster spec. This should be updated to avoid overstating durability guarantees (or updated to describe the exact configuration that enforces synchronous/quorum writes, if applicable).

Copilot uses AI. Check for mistakes.
**Solution**:
```bash
# Check operator logs
kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=100
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator Helm chart’s pod template labels the operator pods with app: <release-name> (it does not include app.kubernetes.io/name on the pod template). This kubectl logs selector likely won’t match any pods. Prefer kubectl logs deployment/documentdb-operator -n <ns> or a selector that matches the actual pod labels (e.g., -l app=<helm-release-name>).

Suggested change
kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=100
kubectl logs deployment/documentdb-operator -n documentdb-operator --tail=100

Copilot uses AI. Check for mistakes.
| **Scope** | Single Kubernetes cluster |
| **Instances** | 1-3 instances (primary + replicas) |
| **Failover** | Automatic, typically < 30 seconds |
| **Data Loss** | Zero (synchronous replication) |
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overview states Local HA has "Zero (synchronous replication)" data loss / RPO 0, but the operator’s CNPG Cluster spec generation does not configure synchronous replication for the single-cluster case (it only sets instances). Unless synchronous replication is actually enforced elsewhere, this should be revised to reflect async replication/possible replication lag, or the docs should explicitly describe what configuration guarantees zero RPO.

Suggested change
| **Data Loss** | Zero (synchronous replication) |
| **Data Loss** | Potentially minimal (async replication by default; RPO 0 requires synchronous replication configuration) |

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +17
Local high availability (HA) deploys multiple DocumentDB instances within a single Kubernetes cluster, providing automatic failover and zero data loss during instance failures.

## Overview

Local HA uses synchronous replication between a primary instance and one or two replicas. When the primary fails, a replica is automatically promoted to primary.

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page claims local HA provides "zero data loss" and describes replication as synchronous. The operator’s CNPG Cluster spec for single-cluster deployments sets the instance count but does not configure synchronous replication, so this statement is likely inaccurate and could mislead users about durability guarantees. Please align the wording with the actual behavior (e.g., async replication with potential lag) or document the specific settings that guarantee zero RPO.

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +103
```yaml title="documentdb-node-affinity.yaml"
apiVersion: documentdb.io/preview
kind: DocumentDB
metadata:
name: my-documentdb
namespace: documentdb
spec:
instancesPerNode: 3
affinity:
enablePodAntiAffinity: true
topologyKey: kubernetes.io/hostname # (1)!
```
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This node anti-affinity example also omits required spec fields (e.g., spec.resource.storage.pvcSize). Like the zone example, either present it as a partial snippet to merge into an existing manifest, or include the required fields to make it directly runnable.

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +132
| HA Type | Typical RPO |
|---------|-------------|
| Local HA | 0 (synchronous) |
| Multi-Region | Seconds (replication lag) |
| Multi-Cloud | Seconds (replication lag) |
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RPO table lists Local HA as 0 (synchronous), but the operator’s default single-cluster CNPG Cluster spec does not set synchronous replication configuration. This row should be updated to avoid promising a zero RPO unless that guarantee is actually enforced.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
G --> K[See Multi-Cloud Guide]
H --> L[See Multi-Region Guide]
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision tree points to "See Multi-Cloud Guide" / "See Multi-Region Guide" but doesn’t provide links (and these guides aren’t present in the preview docs nav). Consider replacing these nodes with actual links to existing guidance (e.g., the multi-cloud deployment README in documentdb-playground, and the AKS Fleet multi-region guide/scripts) or mark them explicitly as “coming soon” to avoid a dead-end.

Suggested change
G --> K[See Multi-Cloud Guide]
H --> L[See Multi-Region Guide]
G --> K[See Multi-Cloud Guide (coming soon)]
H --> L[See Multi-Region Guide (coming soon)]

Copilot uses AI. Check for mistakes.
default: latest

extra_javascript:
- https://unpkg.com/mermaid@10/dist/mermaid.min.js
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading Mermaid from a third-party CDN (unpkg.com) introduces a supply-chain risk and can break doc builds in restricted/offline environments. Consider vendoring Mermaid into the docs assets (or pinning an exact patch version with integrity checks) and referencing the local file instead of a remote URL.

Suggested change
- https://unpkg.com/mermaid@10/dist/mermaid.min.js
- js/mermaid.min.js

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +49
storage:
size: 10Gi
storageClassName: managed-csi
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifest example uses spec.storage.size and spec.storage.storageClassName, but the DocumentDB API uses spec.resource.storage.pvcSize and spec.resource.storage.storageClass. As written, this example won’t pass CRD validation / won’t configure storage as intended.

Suggested change
storage:
size: 10Gi
storageClassName: managed-csi
resource:
storage:
pvcSize: 10Gi
storageClass: managed-csi

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +84
```yaml title="documentdb-zone-affinity.yaml"
apiVersion: documentdb.io/preview
kind: DocumentDB
metadata:
name: my-documentdb
namespace: documentdb
spec:
instancesPerNode: 3
affinity:
enablePodAntiAffinity: true
topologyKey: topology.kubernetes.io/zone # (1)!
```
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zone anti-affinity example omits required spec fields (notably spec.resource.storage.pvcSize, and any other required fields). If it’s intended as a patch snippet, it should say so explicitly; otherwise include a complete minimal DocumentDB spec so users can apply it without validation errors.

Copilot uses AI. Check for mistakes.
@xgerman
Copy link
Collaborator

xgerman commented Mar 9, 2026

My Agent says:
PR #294 Review: "docs: Add high availability documentation for local HA configuration"

Author: Rayhan Hossain | Files changed: 3 (+477 lines)


Summary

Adds two new documentation pages (overview.md and local-ha.md) under a new "High Availability" section, plus mkdocs.yml updates for
mermaid diagrams, tabbed content, and new nav entries. The content is well-structured with decision trees, trade-off tables, failover
timelines, and troubleshooting guides.


🔴 Critical

  1. Incorrect YAML spec structure in local-ha.md

All three YAML examples use a storage field directly under spec, but the actual API uses resource.storage with different field names:

❌ In the docs (local-ha.md lines ~37-45):

spec:
instancesPerNode: 3
storage:
size: 10Gi
storageClassName: managed-csi

✅ Correct per API (documentdb_types.go):

spec:
nodeCount: 1
instancesPerNode: 3
resource:
storage:
pvcSize: 10Gi
storageClass: managed-csi

Issues:

  • storage should be nested under resource
  • size → pvcSize
  • storageClassName → storageClass
  • nodeCount is a required field (no omitempty) but is missing from all examples
  • exposeViaService is also missing (compare with documentdb-cluster.yaml at repo root)

This affects the first YAML example and both affinity YAML examples. Users copy-pasting these will get validation errors.


🟠 Major

  1. Incorrect pod label selector in testing section

The "Test 1: Verify Instance Distribution" section uses:

kubectl get pods -n documentdb -l documentdb.io/cluster=my-documentdb

But pods are managed by CNPG and carry the label cnpg.io/cluster, not documentdb.io/cluster. The documentdb.io/cluster label is only
applied to PVs. The correct selector should be:

kubectl get pods -n documentdb -l cnpg.io/cluster=my-documentdb-node-1

(Note: the CNPG cluster name is derived from the DocumentDB name, typically {name}-node-{n}.)

  1. Existing docs not cross-referenced to new HA pages

The FAQ (faq.md) links to advanced-configuration/README.md#high-availability, and the advanced config page has a minimal HA section.
These should be updated to link to (or redirect to) the new comprehensive HA documentation. Otherwise users find the shallow version
first and miss the detailed docs.


🟡 Minor

  1. External CDN dependency for Mermaid JS

extra_javascript:
- https://unpkg.com/mermaid@10/dist/mermaid.min.js

This pulls mermaid from unpkg.com CDN at runtime. If the CDN is unavailable, diagrams won't render. Consider pinning to an exact version
(e.g., mermaid@10.9.1) for reproducibility. Also, the pymdownx.superfences mermaid fence already handles rendering with Material theme's
built-in mermaid support — the external JS may be redundant and could cause conflicts.

  1. Overview "Next Steps" section is sparse

The overview mentions multi-region and multi-cloud deployment types with detailed tables, but "Next Steps" only links to Local HA.
Consider adding placeholder links or "coming soon" notes for multi-region and multi-cloud guides to set expectations.

  1. failoverDelay documented as "Configurable: No"

The table says failoverDelay is not configurable, but since it defaults to 0 (instant failover), users may eventually want to tune it. A
brief note explaining why it's not exposed (e.g., "instant failover is preferred for local HA") would be helpful.


🟢 Nitpick

  1. The mermaid flowchart in overview.md using ~~~ spacers is a mermaid-specific trick that may not render consistently across all
    mermaid versions.

  2. In the failover sequence diagram, "Wait failoverDelay (default: 0s)" followed immediately by "Mark TargetPrimary pending" could be
    simplified since the delay is 0 — showing it may confuse readers into thinking there's an actual wait.

  3. Consider adding a note to the overview page about the relationship between DocumentDB HA and CloudNative-PG, since CNPG is the
    underlying mechanism. Users debugging issues may need to know to look at CNPG resources.


✅ What looks good

  • Clear information architecture with overview → specific configuration pages
  • Mermaid diagrams (decision tree, architecture, failover timeline) are excellent
  • RTO/RPO tables with concrete numbers
  • Troubleshooting section with symptoms/causes/solutions format
  • Proper use of MkDocs admonitions (!!! tip, !!! warning, !!! danger)
  • Annotated code blocks (# (1)!) for inline explanations
  • mkdocs.yml extensions are well-chosen for the content needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants