Skip to content

feat(clickhouse): enhance addon with HA, backup, security, ops and observability#2550

Open
realzyy wants to merge 1 commit intoapecloud:mainfrom
realzyy:feat/clickhouse-addon-enhancements
Open

feat(clickhouse): enhance addon with HA, backup, security, ops and observability#2550
realzyy wants to merge 1 commit intoapecloud:mainfrom
realzyy:feat/clickhouse-addon-enhancements

Conversation

@realzyy
Copy link
Copy Markdown

@realzyy realzyy commented Apr 8, 2026

Summary

This PR brings the ClickHouse addon closer to feature parity with ClickHouse Cloud, covering 21 items across HA, backup/restore, security, operations, observability, and read-write separation.

High Availability

  • Pod Anti-Affinity + TopologySpreadConstraints for CH Server and Keeper (multi-AZ spread)
  • PodDisruptionBudget for CH Server and Keeper
  • livenessProbe + startupProbe for CH Server (dead-lock recovery, slow-start protection)
  • nodeSelector and priorityClassName exposed via values.yaml
  • podManagementPolicy: OrderedReady for make-before-break rolling updates

Backup & Restore

  • Fix: remove hardcoded exit 1 in incremental-restore.sh that blocked TLS-mode restores
  • Backup schedules enabled by default; full (weekly) and incremental (Mon–Sat) schedules configurable via values.yaml
  • Enable UNDROP TABLE support (allow_experimental_undrop_table_query)

Security

  • Transparent Data Encryption (TDE) config option using AES-256-GCM-SIV, controlled by values.yaml
  • Default IP allowlist tightened to cluster-internal CIDRs (10.0/8, 172.16/12, 192.168/16)
  • Three pre-built user role profiles: readonly, monitoring, ingest
  • rotate-password OpsDefinition: updates ClickHouse SQL users + patches K8s Secret via in-cluster curl

Operations (OpsDefinitions)

  • vscale-check: pre/post memory safety check before vertical scaling
  • pre-scale-in-shard: migrates data off a shard before removal (dry-run supported)
  • diagnose: cluster health report — replica sync, merge queue, Keeper status, slow queries Top-10
  • RBAC: grants kb-<cmpdName> service account Secret patch permission for password rotation

Observability

  • PrometheusRule template with common ClickHouse alerts (replica lag, merge queue, connections, memory)
  • Grafana dashboard — Query Insights: slow query rate, P99/P95 latency, failed queries
  • Grafana dashboard — Keeper health: leader elections, request latency, connections
  • Grafana dashboard — AsyncMetrics: RSS/virtual memory, disk usage, CPU time, MergeTree bytes on disk

Read-Write Separation

  • New clickhouse-readonly ComponentDefinition with readonly=2 profile (SELECT allowed, DDL/DML blocked)
  • Two new cluster topologies: standalone-with-readonly, cluster-with-readonly
  • ComponentVersion compatibility rules updated

Test plan

  • helm template renders without errors on all modified templates
  • All 4 new OpsDefinitions (diagnose, vscale-check, rotate-password, pre-scale-in-shard) reach Succeed status in k3d
  • Readonly component: SELECT succeeds, INSERT/CREATE TABLE return READONLY Code 164
  • TLS-mode incremental restore completes without exit-1 error
  • Backup schedules reflect values.yaml overrides in rendered BackupPolicyTemplate

🤖 Generated with Claude Code

…servability improvements

High Availability:
- Add Pod Anti-Affinity and TopologySpreadConstraints for CH Server and Keeper
- Add PodDisruptionBudget for CH Server and Keeper
- Add livenessProbe and startupProbe for CH Server
- Expose nodeSelector and priorityClassName via values.yaml
- Set podManagementPolicy to OrderedReady (make-before-break rolling update)

Backup & Restore:
- Fix TLS mode restore: remove hardcoded exit-1 block in incremental-restore.sh
- Enable backup schedules by default with configurable cron via values.yaml
- Decouple full (weekly) and incremental (daily Mon-Sat) backup schedules
- Enable UNDROP TABLE support (allow_experimental_undrop_table_query)

Security:
- Add Transparent Data Encryption (TDE) config option (AES-256-GCM-SIV)
- Tighten default IP allowlist to cluster-internal CIDRs only
- Add readonly/monitoring/ingest user role profiles
- Add rotate-password OpsDefinition with K8s Secret patching via in-cluster curl

Operations:
- Add vscale-check OpsDefinition (pre/post memory safety check)
- Add pre-scale-in-shard OpsDefinition (data migration before shard removal)
- Add diagnose OpsDefinition (replica sync, merge queue, Keeper health, slow queries)
- Add RBAC for OpsDefinition workload pods to patch Secrets

Observability:
- Add PrometheusRule template for common ClickHouse alerts
- Upgrade Grafana dashboard: Query Insights panels (slow queries, P99 latency)
- Add Keeper health panels to Grafana dashboard
- Add AsyncMetrics system resource panels (memory, disk, CPU, MergeTree bytes)

Read-Write Separation:
- Add clickhouse-readonly ComponentDefinition (readonly=2 profile, no DDL/DML)
- Add standalone-with-readonly and cluster-with-readonly cluster topologies
- Register readonly component in ComponentVersion with compatibility rules

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@realzyy realzyy requested review from a team as code owners April 8, 2026 03:24
@apecloud-bot apecloud-bot added the chart-release Auto release chart when PR merged main. label Apr 8, 2026
@realzyy realzyy added the nopick Not auto cherry-pick when PR merged label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chart-release Auto release chart when PR merged main. nopick Not auto cherry-pick when PR merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants