A Kubernetes operator for automated Kafka backup and disaster recovery. Built with Rust using kube-rs for high performance and reliability.
- Scheduled Backups - Cron-based automatic backups with configurable retention
- Point-in-Time Recovery (PITR) - Restore data to any specific timestamp
- Multi-Cloud Storage - Support for PVC, S3, Azure Blob Storage, and GCS
- Azure Workload Identity - Secure, secretless authentication for Azure
- Compression - LZ4 and Zstd compression support with configurable levels
- Checkpointing - Resumable backups that survive pod restarts
- Rate Limiting - Control backup/restore throughput to minimize cluster impact
- Circuit Breaker - Automatic failure detection and recovery
- Topic Mapping - Restore to different topic names or partitions
- Consumer Offset Management - Reset and rollback consumer group offsets
- Prometheus Metrics - Full observability with built-in metrics endpoint
- Kubernetes cluster (1.26+)
- Helm 3.x
- A running Kafka cluster
# Add the OSO DevOps Helm repository
helm repo add oso https://osodevops.github.io/helm-charts/
helm repo update
# Install the operator
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--create-namespaceapiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: my-backup
namespace: kafka
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
- events
storage:
storageType: pvc
pvc:
claimName: kafka-backups
schedule: "0 */6 * * *" # Every 6 hours
compression: zstdkubectl apply -f backup.yamlThe operator provides four CRDs for managing Kafka backup and restore operations:
| CRD | Short Name | Description |
|---|---|---|
KafkaBackup |
kb |
Define backup schedules and configurations |
KafkaRestore |
kr |
Trigger restore operations from backups |
KafkaOffsetReset |
kor |
Reset consumer group offsets |
KafkaOffsetRollback |
korb |
Rollback offsets after failed restores |
apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: production-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
securityProtocol: SASL_SSL
tlsSecret:
name: kafka-tls
saslSecret:
name: kafka-sasl
mechanism: SCRAM-SHA-512
topics:
- orders
- inventory
- events
storage:
storageType: azure
azure:
container: kafka-backups
accountName: mystorageaccount
prefix: production
useWorkloadIdentity: true
schedule: "0 2 * * *" # Daily at 2 AM
compression: zstd
compressionLevel: 3
checkpoint:
enabled: true
intervalSecs: 30apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaBackup
metadata:
name: s3-backup
spec:
kafkaCluster:
bootstrapServers:
- kafka:9092
topics:
- my-topic
storage:
storageType: s3
s3:
bucket: my-kafka-backups
region: eu-west-1
prefix: backups
credentialsSecret:
name: aws-credentials
accessKeyIdKey: AWS_ACCESS_KEY_ID
secretAccessKeyKey: AWS_SECRET_ACCESS_KEY
schedule: "0 */4 * * *"apiVersion: kafka.oso.sh/v1alpha1
kind: KafkaRestore
metadata:
name: restore-orders
spec:
backupRef:
name: production-backup
backupId: "production-backup-20251210-020000" # Optional: specific backup
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
topics:
- orders
# Optional: Point-in-time recovery
pitr:
endTime: "2025-12-10T12:00:00Z"
# Optional: Restore to different topic
topicMapping:
orders: orders-restored
# Safety: Create snapshot before restore
rollback:
snapshotBeforeRestore: true
autoRollbackOnFailure: trueapiVersion: kafka.oso.sh/v1alpha1
kind: KafkaOffsetReset
metadata:
name: reset-consumer
spec:
kafkaCluster:
bootstrapServers:
- kafka-bootstrap:9092
consumerGroup: my-consumer-group
topics:
- orders
resetStrategy: earliest # earliest, latest, timestamp, offsetKey configuration options for the Helm chart:
# values.yaml
replicaCount: 1
image:
repository: ghcr.io/osodevops/kafka-backup-operator
tag: "" # Defaults to appVersion
serviceAccount:
create: true
annotations: {}
# Azure Workload Identity
azureWorkloadIdentity:
enabled: false
clientId: ""
# Logging
logging:
level: "info,kafka_backup_operator=debug"
# Metrics
metrics:
enabled: true
serviceMonitor:
enabled: false
interval: 30s
# Resources
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512MiFor secure, secretless authentication to Azure Blob Storage:
# Enable Workload Identity on AKS
az aks update --resource-group myRG --name myAKS \
--enable-oidc-issuer --enable-workload-identity
# Create managed identity
az identity create --resource-group myRG --name kafka-backup-identity
# Assign Storage Blob Data Contributor role
az role assignment create \
--assignee-object-id $(az identity show -g myRG -n kafka-backup-identity --query principalId -o tsv) \
--role "Storage Blob Data Contributor" \
--scope /subscriptions/.../storageAccounts/mystorageaccount
# Create federated credential
az identity federated-credential create \
--resource-group myRG \
--identity-name kafka-backup-identity \
--name kafka-backup-fedcred \
--issuer $(az aks show -g myRG -n myAKS --query oidcIssuerProfile.issuerUrl -o tsv) \
--subject system:serviceaccount:kafka-backup-system:kafka-backup-operator \
--audience api://AzureADTokenExchange
# Install with Workload Identity enabled
helm install kafka-backup-operator oso/kafka-backup-operator \
--namespace kafka-backup-system \
--set azureWorkloadIdentity.enabled=true \
--set azureWorkloadIdentity.clientId=$(az identity show -g myRG -n kafka-backup-identity --query clientId -o tsv)See docs/azure-workload-identity.md for detailed setup instructions.
The operator exposes Prometheus metrics on port 8080:
| Metric | Description |
|---|---|
kafka_backup_reconciliations_total |
Total reconciliation attempts |
kafka_backup_reconcile_duration_seconds |
Reconciliation duration histogram |
kafka_backup_backups_total |
Total backups by status |
kafka_backup_backup_size_bytes |
Backup size in bytes |
kafka_backup_backup_records |
Records processed |
kafka_backup_restores_total |
Total restores by status |
# Enable in Helm values
metrics:
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus- Normal Operation:
KafkaBackupruns on schedule, storing backups to cloud storage - Disaster Occurs: Kafka cluster fails or data is corrupted
- Recovery:
- Identify the backup to restore from:
kubectl get kafkabackup my-backup -o yaml - Create a
KafkaRestoreresource pointing to the backup - Monitor progress:
kubectl get kafkarestore -w
- Identify the backup to restore from:
- Post-Recovery: Optionally reset consumer offsets with
KafkaOffsetReset
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ kafka-backup-operator │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Backup │ │ Restore │ │ Offset Reset/ │ │ │
│ │ │ Controller │ │ Controller │ │ Rollback Ctrl │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │ │
│ │ │ │ │ │ │
│ │ └───────────────┼────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ kafka-backup-core │ │ │
│ │ │ (Rust lib) │ │ │
│ │ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Kafka Cluster │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Cloud Storage │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ S3 │ │ Azure │ │ GCS │ │
│ │ │ │ Blob │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
# Clone the repository
git clone https://github.com/osodevops/kafka-backup-operator.git
cd kafka-backup-operator
# Build
cargo build --release
# Generate CRDs
cargo run --bin crdgen > deploy/crds/all.yaml
# Run tests
cargo testSee minikube/README.md for local development setup with Confluent for Kubernetes.
Contributions are welcome! Please read our Contributing Guide for details on the process for submitting pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- kafka-backup-core - The core backup library
- OSO DevOps Helm Charts - Helm chart repository