diff --git a/docs/operator-public-documentation/preview/high-availability/local-ha.md b/docs/operator-public-documentation/preview/high-availability/local-ha.md
new file mode 100644
index 00000000..dfee5f01
--- /dev/null
+++ b/docs/operator-public-documentation/preview/high-availability/local-ha.md
@@ -0,0 +1,289 @@
+---
+title: Local High Availability
+description: Configure local high availability for DocumentDB with multiple instances, pod anti-affinity, and automatic failover.
+tags:
+ - high-availability
+ - configuration
+ - failover
+---
+
+# Local High Availability
+
+Local high availability (HA) deploys multiple DocumentDB instances within a single Kubernetes cluster, providing automatic failover and zero data loss during instance failures.
+
+## Overview
+
+Local HA uses synchronous replication between a primary instance and one or two replicas. When the primary fails, a replica is automatically promoted to primary.
+
+```mermaid
+flowchart LR
+ subgraph zone1[Zone A]
+ P[Primary]
+ end
+ subgraph zone2[Zone B]
+ R1[Replica 1]
+ end
+ subgraph zone3[Zone C]
+ R2[Replica 2]
+ end
+
+ App([Application]) --> P
+ P -->|Sync Replication| R1
+ P -->|Sync Replication| R2
+```
+
+## Instance Configuration
+
+Configure the number of instances using the `instancesPerNode` field:
+
+```yaml title="documentdb-ha.yaml"
+apiVersion: documentdb.io/preview
+kind: DocumentDB
+metadata:
+ name: my-documentdb
+ namespace: documentdb
+spec:
+ instancesPerNode: 3 # (1)!
+ storage:
+ size: 10Gi
+ storageClassName: managed-csi
+```
+
+1. Valid values: `1` (no HA), `2` (primary + 1 replica), `3` (primary + 2 replicas, recommended for production)
+
+### Instance Count Options
+
+| Instances | Configuration | Use Case |
+|-----------|---------------|----------|
+| `1` | Single instance, no replicas | Development, testing |
+| `2` | Primary + 1 replica | Cost-sensitive production |
+| `3` | Primary + 2 replicas | **Recommended** for production |
+
+!!! tip "Why 3 instances?"
+ Three instances provide quorum-based failover. With 2 instances, the system cannot distinguish between a network partition and a failed primary. With 3 instances, the system can achieve consensus and safely promote a replica.
+
+## Pod Anti-Affinity
+
+Pod anti-affinity ensures DocumentDB instances are distributed across failure domains (nodes, zones) for resilience.
+
+### Zone-Level Distribution (Recommended)
+
+Distribute instances across availability zones:
+
+```yaml title="documentdb-zone-affinity.yaml"
+apiVersion: documentdb.io/preview
+kind: DocumentDB
+metadata:
+ name: my-documentdb
+ namespace: documentdb
+spec:
+ instancesPerNode: 3
+ affinity:
+ enablePodAntiAffinity: true
+ topologyKey: topology.kubernetes.io/zone # (1)!
+```
+
+1. Distributes pods across different availability zones. Requires a cluster with nodes in multiple zones.
+
+### Node-Level Distribution
+
+For clusters without multiple zones, distribute across nodes:
+
+```yaml title="documentdb-node-affinity.yaml"
+apiVersion: documentdb.io/preview
+kind: DocumentDB
+metadata:
+ name: my-documentdb
+ namespace: documentdb
+spec:
+ instancesPerNode: 3
+ affinity:
+ enablePodAntiAffinity: true
+ topologyKey: kubernetes.io/hostname # (1)!
+```
+
+1. Distributes pods across different nodes. Requires at least 3 nodes in the cluster.
+
+### Affinity Configuration Reference
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `enablePodAntiAffinity` | boolean | Enable/disable pod anti-affinity |
+| `topologyKey` | string | Kubernetes topology label for distribution |
+| `podAntiAffinityType` | string | `preferred` (default) or `required` |
+
+!!! warning "Required vs Preferred"
+ Using `required` anti-affinity prevents scheduling if constraints cannot be met. Use `preferred` (default) to allow scheduling even when ideal placement isn't possible.
+
+## Automatic Failover
+
+DocumentDB uses CloudNative-PG's failover mechanism to automatically detect primary failure and promote a replica. No manual intervention is required for local HA failover.
+
+### Failover Timeline
+
+```mermaid
+sequenceDiagram
+ participant App as Application
+ participant P as Primary
+ participant R as Replica
+ participant Op as Operator
+
+ Note over P: Primary fails
+ App->>P: Connection fails
+ Op->>P: Readiness probe fails
+ Op->>Op: Wait failoverDelay (default: 0s)
+ Op->>P: Mark TargetPrimary pending
+ P->>P: Fast shutdown (up to 30s)
+ Op->>R: Leader election
+ R->>R: Promote to primary
+ Op->>App: Update service endpoint
+ App->>R: Reconnect to new primary
+ Note over R: New Primary
+```
+
+### Failover Timing Parameters
+
+DocumentDB inherits these timing controls from CloudNative-PG:
+
+| Parameter | Default | Configurable | Description |
+|-----------|---------|--------------|-------------|
+| `failoverDelay` | 0 seconds | No | Delay before initiating failover after detecting unhealthy primary |
+| `stopDelay` | 30 seconds | **Yes** | Time allowed for graceful PostgreSQL shutdown |
+| `switchoverDelay` | 3600 seconds | No | Time for primary to gracefully shutdown during planned switchover |
+| `livenessProbeTimeout` | 30 seconds | No | Time allowed for liveness probe response |
+
+!!! note "Current Configuration"
+ Currently, only `stopDelay` is configurable via `spec.timeouts.stopDelay`. Other parameters use CloudNative-PG default values. Additional timing parameters may be exposed in future releases.
+
+### Failover Process
+
+The failover process occurs in two phases:
+
+**Phase 1: Primary Shutdown**
+
+1. Readiness probe detects the primary is unhealthy
+2. After `failoverDelay` (default: 0s), operator marks `TargetPrimary` as pending
+3. Primary pod initiates fast shutdown (up to `stopDelay` seconds)
+4. WAL receivers on replicas stop to prevent timeline discrepancies
+
+**Phase 2: Promotion**
+
+1. Leader election selects the most up-to-date replica
+2. Selected replica promotes to primary and begins accepting writes
+3. Kubernetes service endpoints update to point to new primary
+4. Former primary restarts as a replica when recovered
+
+!!! note "Zero Data Loss"
+ Because replication is synchronous, a committed write exists on at least one replica before acknowledgment. Failover promotes a replica with all committed data.
+
+### RTO and RPO Impact
+
+| Scenario | RTO Impact | RPO Impact |
+|----------|------------|------------|
+| Fast shutdown succeeds | Seconds to tens of seconds | Zero data loss |
+| Fast shutdown times out | Up to `stopDelay` (30s default) | Possible data loss |
+| Network partition | Depends on quorum | Zero if quorum maintained |
+
+!!! tip "Tuning for RTO vs RPO"
+ Lower `stopDelay` values favor faster recovery (RTO) but may increase data loss risk (RPO). Higher values prioritize data safety but may delay recovery.
+
+## Testing High Availability
+
+Verify your HA configuration works correctly.
+
+### Test 1: Verify Instance Distribution
+
+```bash
+# Check pod distribution across zones/nodes
+kubectl get pods -n documentdb -l documentdb.io/cluster=my-documentdb \
+ -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\\.kubernetes\\.io/zone
+```
+
+Expected output shows pods on different nodes/zones:
+```
+NAME NODE ZONE
+my-documentdb-1 node-1 zone-a
+my-documentdb-2 node-2 zone-b
+my-documentdb-3 node-3 zone-c
+```
+
+### Test 2: Simulate Failure
+
+!!! danger "Production Warning"
+ Only perform failure testing in non-production environments or during planned maintenance windows.
+
+```bash
+# Delete the primary pod to simulate failure
+kubectl delete pod my-documentdb-1 -n documentdb
+
+# Watch failover (in another terminal)
+kubectl get pods -n documentdb -w
+
+# Check pod status after failover
+kubectl get pods -n documentdb -l documentdb.io/cluster=my-documentdb
+```
+
+### Test 3: Application Connectivity
+
+```bash
+# Get the connection string from DocumentDB status
+CONNECTION_STRING=$(kubectl get documentdb my-documentdb -n documentdb -o jsonpath='{.status.connectionString}')
+echo "Connection string: $CONNECTION_STRING"
+
+# Test application can reconnect after failover
+mongosh "$CONNECTION_STRING" --eval "print('Connection successful')"
+```
+
+## Troubleshooting
+
+### Pods Not Distributing Across Zones
+
+**Symptom**: Multiple DocumentDB pods scheduled on the same node or zone.
+
+**Cause**: Anti-affinity set to `preferred` and insufficient nodes/zones available.
+
+**Solution**:
+1. Add more nodes to different zones
+2. Or change to `required` anti-affinity (may prevent scheduling if constraints can't be met)
+
+```bash
+# Check node zone labels
+kubectl get nodes -L topology.kubernetes.io/zone
+```
+
+### Failover Taking Too Long
+
+**Symptom**: Failover takes longer than expected.
+
+**Possible Causes**:
+- `stopDelay` set to high value
+- Storage latency affecting shutdown
+- Network issues delaying probe failures
+
+**Solution**:
+```bash
+# Check operator logs
+kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=100
+
+# Check events
+kubectl get events -n documentdb --sort-by='.lastTimestamp' | tail -20
+```
+
+### Replica Not Catching Up
+
+**Symptom**: Replica shows increasing replication lag.
+
+**Possible Causes**:
+- Network bandwidth limitation
+- Storage I/O bottleneck on replica
+- High write load on primary
+
+**Solution**:
+```bash
+# Check replica pod resources
+kubectl top pod my-documentdb-2 -n documentdb
+
+# Check pod logs for replication issues
+kubectl logs my-documentdb-2 -n documentdb --tail=50
+```
+
diff --git a/docs/operator-public-documentation/preview/high-availability/overview.md b/docs/operator-public-documentation/preview/high-availability/overview.md
new file mode 100644
index 00000000..1111890a
--- /dev/null
+++ b/docs/operator-public-documentation/preview/high-availability/overview.md
@@ -0,0 +1,169 @@
+---
+title: High Availability Overview
+description: Understanding high availability options for DocumentDB on Kubernetes - local HA, multi-region, and multi-cloud deployments.
+tags:
+ - high-availability
+ - architecture
+ - disaster-recovery
+search:
+ boost: 2
+---
+
+# High Availability Overview
+
+High availability (HA) ensures your DocumentDB deployment remains accessible and operational despite component failures. This guide covers the HA options available and helps you choose the right approach for your requirements.
+
+## What is High Availability?
+
+High availability in DocumentDB means:
+
+- **Automatic failover**: When a primary instance fails, a replica is automatically promoted
+- **Data durability**: Data is replicated across multiple instances before acknowledging writes
+- **Minimal downtime**: Recovery happens automatically without manual intervention
+- **Continuous operation**: Applications experience brief interruption rather than extended outages
+
+## Types of High Availability
+
+DocumentDB supports three levels of high availability, each providing different trade-offs between complexity, cost, and resilience:
+
+```mermaid
+flowchart LR
+ subgraph Local["Local HA (Single Cluster)"]
+ direction TB
+ P1[Primary] --> R1[Replica 1]
+ P1 --> R2[Replica 2]
+ end
+
+ subgraph MultiRegion["Multi-Region (Same Cloud)"]
+ direction TB
+ subgraph Region1[Region A]
+ P2[Primary Cluster]
+ end
+ subgraph Region2[Region B]
+ S2[Standby Cluster]
+ end
+ P2 -.->|Async Replication| S2
+ end
+
+ subgraph MultiCloud["Multi-Cloud"]
+ direction TB
+ subgraph Cloud1[Azure]
+ P3[Primary]
+ end
+ subgraph Cloud2[GCP]
+ S3a[Replica]
+ end
+ subgraph Cloud3[AWS]
+ S3b[Replica]
+ end
+ P3 -.->|Cross-Cloud| S3a
+ P3 -.->|Cross-Cloud| S3b
+ end
+
+ Local ~~~ MultiRegion ~~~ MultiCloud
+```
+
+### Local High Availability
+
+Local HA runs multiple DocumentDB instances within a single Kubernetes cluster, distributed across availability zones.
+
+| Aspect | Details |
+|--------|---------|
+| **Scope** | Single Kubernetes cluster |
+| **Instances** | 1-3 instances (primary + replicas) |
+| **Failover** | Automatic, typically < 30 seconds |
+| **Data Loss** | Zero (synchronous replication) |
+| **Use Case** | Standard production deployments |
+
+**Best for:** Most production workloads requiring high availability without geographic distribution.
+
+[Configure Local HA →](local-ha.md)
+
+### Multi-Region Deployment
+
+Multi-region runs DocumentDB clusters across multiple regions within the same cloud provider, connected via the cloud's native networking.
+
+| Aspect | Details |
+|--------|---------|
+| **Scope** | Multiple regions, single cloud provider |
+| **Networking** | Azure Fleet, VNet peering |
+| **Failover** | Manual promotion required |
+| **Data Loss** | Minimal (async replication lag) |
+| **Use Case** | Disaster recovery, data locality |
+
+**Best for:** Disaster recovery requirements, regulatory compliance requiring data in specific regions, or reducing latency for geographically distributed users.
+
+### Multi-Cloud Deployment
+
+Multi-cloud runs DocumentDB across different cloud providers (Azure, AWS, GCP), connected via service mesh.
+
+| Aspect | Details |
+|--------|---------|
+| **Scope** | Multiple cloud providers |
+| **Networking** | Istio service mesh |
+| **Failover** | Manual promotion required |
+| **Data Loss** | Minimal (async replication lag) |
+| **Use Case** | Vendor independence, maximum resilience |
+
+**Best for:** Organizations requiring cloud vendor independence, maximum disaster resilience, or hybrid cloud strategies.
+
+## RTO and RPO Concepts
+
+When planning for high availability, understand these key metrics:
+
+### Recovery Time Objective (RTO)
+
+**RTO** is the maximum acceptable time your application can be unavailable after a failure.
+
+| HA Type | Typical RTO |
+|---------|-------------|
+| Local HA | < 30 seconds |
+| Multi-Region | Minutes (manual failover) |
+| Multi-Cloud | Minutes (manual failover) |
+
+### Recovery Point Objective (RPO)
+
+**RPO** is the maximum acceptable amount of data loss measured in time.
+
+| HA Type | Typical RPO |
+|---------|-------------|
+| Local HA | 0 (synchronous) |
+| Multi-Region | Seconds (replication lag) |
+| Multi-Cloud | Seconds (replication lag) |
+
+## Decision Tree
+
+Use this guide to select the appropriate HA strategy:
+
+```mermaid
+flowchart TD
+ A[Start] --> B{Need geographic
distribution?}
+ B -->|No| C{Need zone-level
resilience?}
+ B -->|Yes| D{Need cloud vendor
independence?}
+
+ C -->|Yes| E[Local HA
Multiple instances]
+ C -->|No| F[Single instance
Development only]
+
+ D -->|Yes| G[Multi-Cloud
Istio mesh]
+ D -->|No| H[Multi-Region
Same cloud provider]
+
+ E --> I[Configure instancesPerNode: 3]
+ F --> J[Configure instancesPerNode: 1]
+ G --> K[See Multi-Cloud Guide]
+ H --> L[See Multi-Region Guide]
+```
+
+## Trade-offs Summary
+
+| Factor | Local HA | Multi-Region | Multi-Cloud |
+|--------|----------|--------------|-------------|
+| **Complexity** | Low | Medium | High |
+| **Cost** | $ | $$ | $$$ |
+| **RTO** | Seconds | Minutes | Minutes |
+| **RPO** | Zero | Replication lag | Replication lag |
+| **Blast radius** | Zone outage | Region outage | Cloud outage |
+| **Network latency** | Minimal | Regional | Variable |
+
+## Next Steps
+
+- [Configure Local HA](local-ha.md) - Set up high availability within a single cluster
diff --git a/mkdocs.yml b/mkdocs.yml
index b9af440f..71b3b550 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -2,12 +2,15 @@ site_name: DocumentDB-Kubernetes-Operator
repo_url: https://github.com/documentdb/documentdb-kubernetes-operator
edit_uri: edit/main/docs/operator-public-documentation/
docs_dir: docs/operator-public-documentation
+
# Material theme is required for mike's version selector dropdown;
# the readthedocs theme does not support it.
theme:
name: material
features:
- content.code.copy
+ - content.tabs.link
+ - content.code.annotate
# mike (https://github.com/jimporter/mike) enables versioned docs on GitHub Pages.
# Each release gets a frozen snapshot (e.g., /0.2.0/), and "latest" always points
@@ -18,12 +21,18 @@ extra:
provider: mike
default: latest
+extra_javascript:
+ - https://unpkg.com/mermaid@10/dist/mermaid.min.js
+
nav:
- Preview:
- Get Started: preview/index.md
- Advanced Configuration: preview/advanced-configuration/README.md
- Backup and Restore: preview/backup-and-restore.md
- API Reference: preview/api-reference.md
+ - High Availability:
+ - Overview: preview/high-availability/overview.md
+ - Local HA: preview/high-availability/local-ha.md
- FAQ: preview/faq.md
- Tools:
- Kubectl Plugin: preview/kubectl-plugin.md
@@ -33,5 +42,15 @@ plugins:
markdown_extensions:
- admonition
+ - pymdownx.details
+ - pymdownx.superfences:
+ custom_fences:
+ - name: mermaid
+ class: mermaid
+ format: !!python/name:pymdownx.superfences.fence_code_format
+ - pymdownx.tabbed:
+ alternate_style: true
+ - attr_list
+ - md_in_html
- toc:
permalink: true