feat(mariadb): add 11.4 with async, semi-sync and Galera topologies by weicao · Pull Request #2548 · apecloud/kubeblocks-addons

weicao · 2026-04-05T03:19:58Z

Summary

Extend the MariaDB addon to support MariaDB 11.4 with three replication topologies.

New Topologies (via `ClusterDefinition: mariadb`)

Topology	Replicas	HA	Notes
`standalone`	1	—	Existing behavior, now also supports 11.4
`async`	≥2	syncer	Primary-replica async replication, GTID-based
`semisync`	≥2	syncer	Same cmpd as async; semi-sync plugins loaded, enabled via config
`galera`	≥3	wsrep	Multi-primary Galera Cluster, automatic SST

New ComponentDefinitions

mariadb-replication-1.1.0-alpha.1 — syncer sidecar HA, --plugin-load-add=semisync_master/slave, GTID strict mode
mariadb-galera-1.1.0-alpha.1 — wsrep lifecycle (bootstrap/join/leave/roleprobe), ports 3306/4444/4567/4568

New Versions

Version	Supported Topologies
10.6.15	standalone, async, semisync
11.4.5	standalone, async, semisync, galera

Files Added

addons/mariadb/templates/cmpd-replication.yaml — replication ComponentDefinition
addons/mariadb/templates/cmpd-galera.yaml — Galera ComponentDefinition
addons/mariadb/templates/clusterdefinition.yaml — 4 topologies
addons/mariadb/templates/configmap.yaml — config templates for replication + Galera
addons/mariadb/templates/configmap-scripts-galera.yaml — Galera lifecycle scripts
addons/mariadb/config/mariadb-replication.tpl — binlog/GTID/semisync my.cnf
addons/mariadb/config/mariadb-galera.tpl — wsrep/Galera my.cnf
addons/mariadb/scripts/galera-{start,roleprobe,member-join,member-leave}.sh

Usage

# Galera cluster (3 nodes)
helm install mariadb-galera kubeblocks-addons/mariadb-cluster   --set topology=galera --set replicas=3 -n demo

# Semi-sync (primary + replica)
helm install mariadb-ss kubeblocks-addons/mariadb-cluster   --set topology=semisync --set replicas=2 -n demo

Notes

Galera requires mariadb:11.4+ (wsrep built-in); 10.6.15 is excluded from Galera compatibility
Syncer image: apecloud/syncer:0.6.8 (same as MySQL addon)
Chart version bumped to 1.1.0-alpha.1 for both addons and addons-cluster charts

Test plan

helm template passes for all 4 topologies (verified locally)
Standalone 11.4 deployment and connection
Async/semisync: syncer reports correct primary/secondary roles
Galera: wsrep_cluster_size=3 after 3-node deploy
Galera scale-out 3→4: SST completes, new node synced

🤖 Generated with Claude Code

…logies - Add MariaDB 11.4.5 to ComponentVersion (10.6.15 retained) - Add mariadb-replication ComponentDefinition: primary-replica with syncer sidecar HA, semi-sync plugins loaded, GTID-based replication - Add mariadb-galera ComponentDefinition: multi-primary Galera Cluster with wsrep bootstrap/join scripts, roleProbe, memberJoin/Leave - Add ClusterDefinition with topologies: standalone, async, semisync, galera - Add config templates for replication (my.cnf) and Galera (wsrep settings) - Add Galera lifecycle scripts: galera-start.sh, galera-roleprobe.sh, galera-member-join.sh, galera-member-leave.sh - Update addons-cluster: topology field replaces mode, supports all 4 topologies - Bump chart versions to 1.1.0-alpha.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Critical fixes: - cmpv.yaml: image key changed from 'syncer' to 'init-syncer' to match init container name — was causing syncer binary to be missing at startup - cmpd-replication: call docker-entrypoint.sh mariadbd (not raw mariadbd) to ensure data directory initialization on first boot - mariadb-galera.tpl: move all wsrep settings from invalid [galera] section to [mysqld] — MariaDB does not read [galera] section HA/availability fixes: - galera-start.sh: handle full cluster restart by reading grastate.dat and bootstrapping on the node with safe_to_bootstrap=1 - galera-start.sh: remove --mariadb-root-host (not a valid mariadbd flag); use docker-entrypoint.sh for proper DB initialization - galera-start.sh: write wsrep_sst_auth to a 600-mode config file instead of passing credentials on the command line (ps exposure) - galera-member-join.sh: increase SST timeout from 300s to 3600s (large DB) - cmpd-replication: add kbreplicator system account required by syncer to set up replication between primary and secondary - cmpd-galera: expose MARIADB_ROOT_PASSWORD env for docker-entrypoint.sh - cmpd-replication: expose MARIADB_ROOT_PASSWORD env for docker-entrypoint.sh - cmpd-galera: increase roleProbe timeoutSeconds from 1s to 5s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov-commenter · 2026-04-05T03:35:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (fa480e5) to head (43f79e6).

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #2548   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         72      72           
  Lines       9164    9164           
=====================================
  Misses      9164    9164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Round 2 fixes: - Use ^mariadb-[0-9] regex for standalone topology to avoid matching replication/galera cmpds in cmpv.yaml and clusterdefinition.yaml - Remove duplicate MARIADB_ROOT_PASSWORD env entry from cmpd-replication and cmpd-galera (KubeBlocks vars injection already handles this) - Fix galera-start.sh: write wsrep_sst_auth to DATA_DIR (PVC, writable) instead of /etc/mysql/conf.d (ConfigMap mount, read-only) - Fix actionset.yaml restore: use chown+chmod 750 instead of chmod 0777 Round 3 fixes: - Use /tools/syncer full path in cmpd-replication container command - chown mysql:mysql galera-sst-auth.cnf so mariadbd (running as mysql) can read it - Use bash arrays for MARIADB_CMD in all Galera scripts to safely handle passwords with special characters - Add init-syncer image to 10.6.15 release in cmpv.yaml so replication cmpd initContainer has an image when used with MariaDB 10.6.15

Without roleSelector, the service routes traffic to all pods including donor/joiner nodes during SST. With roleSelector: primary, only fully synced nodes (wsrep_local_state=4) receive service traffic.

…ridge, replication shebang/POSIX, paramsdef/scripts hardening Single commit landing the chart changes that unblock async/semisync/galera topology smoke on KubeBlocks main + apecloud-registry stable-tools 1.0.2. All 8 fresh namespaces (standalone/semisync/async/galera x n01/n02) pass end-to-end on alpha.16 with helm rev 12 in the IDC vcluster smoke harness. Galera SST EPERM fix (path 3, rsync): config/mariadb-galera.tpl wsrep_sst_method: mariabackup -> rsync Reason: mariabackup --move-back invokes chown(2) on extracted SST files. The mariadbd container runs as the mysql user (UID 999) with CapEff=0 because k8s securityContext.capabilities.add only lands caps in the bounding set; setuid(mysql) clears the effective set, and k8s does not currently expose ambient caps. rsync SST does not perform chown(2) and bypasses the EPERM signal-11 crash. cap-add was tried and confirmed ineffective; intentionally NOT carried in the chart so we keep least-privilege. Role-probe file bridge (cross-container) and Galera lifecycle scripts: scripts/galera-start.sh Adds a persistent background watcher that polls wsrep_local_state via the local mysqld socket every 3s and atomically writes DATA_DIR/.galera-role with primary (state=4 Synced) or secondary (anything else). The file is chowned to mysql:mysql so that subsequent SST/IST file operations on the data directory do not encounter root-owned debris. The watcher uses set +e so a single transient mysqld socket error never kills the loop; tmp+atomic-mv avoids torn reads. The watcher also touches .galera-synced once on first state=4 transition for the memberJoin signal. scripts/galera-roleprobe.sh POSIX-only (#!/bin/sh); kbagent ships busybox sh and runs the script inside its own container, so bash arrays / bashisms cause syntax errors and a permanent role-probe fail. Reads DATA_DIR/.galera-role written by galera-start.sh and emits primary/secondary. No mariadb client invocation in this container, which is required because KubeBlocks main dropped ExecAction.container so the cmpd-galera.yaml roleProbe.exec.container=mariadb hint is silently discarded and the script always runs in the kbagent container (no mariadb binary available there). scripts/galera-member-join.sh POSIX-only; replaces array-driven mariadb invocation with a file-based wait on DATA_DIR/.galera-synced (written by galera-start.sh background watcher). kbagent has no mariadb binary, so the previous direct-query approach was non-functional even before the SST fix. scripts/galera-member-leave.sh POSIX-only; mariadb_cmd shell function replaces bash array. Replication topology hardening (Path A chart-side reconfigureAction + shebang/POSIX in role-probe + bounded retries + ordered bootstrap): templates/cmpd.yaml templates/cmpd-galera.yaml templates/cmpd-replication.yaml Add include mariadb.config.reconfigureAction so the reconfigure action is rendered into the CmpD configs[] block. Earlier paramsdef-based legacy reloadAction path no longer reliably triggers on KubeBlocks main; the CmpD-side reconfigureAction is the documented new path and is honored by main today. templates/cmpd-replication.yaml Replication topology was previously async/semi-sync mixed; split out proper async vs semisync ComponentDefinitions so the bootstrap and member-join flows can each enforce their own readiness gates. podManagementPolicy: Parallel -> OrderedReady so primary bootstrap completes before secondaries try to register, eliminating a class of early-bootstrap split-brain races. scripts/replication-roleprobe.sh POSIX-only (#!/bin/sh) for the same kbagent-busybox reason as Galera. Replaces 4 occurrences of (printf | grep) pipelines with shell-builtin case statements so the probe runs entirely in builtins when mariadb client is not available on the probe path. Paramsdef / pcr / configmap fixes: templates/paramsdef.yaml Fix templateName: mariadb-galera-config-template (referenced by KB main reconfigure path) -> mariadb-galera-config so the parameter template is actually resolvable. Previous typo silently failed. templates/cmpv.yaml, templates/pcr.yaml, templates/configmap.yaml, templates/configmap-scripts-replication.yaml Add the bundled but previously-untracked rendering for replication scripts and ParameterTemplate inventory so a clean install picks up the same files we have been deploying via helm upgrade. config/mariadb-semisync.tpl, config/mariadb-standalone.tpl, config/mariadb-config-effect-scope.yaml Add the corresponding config templates and effect-scope mapping that the chart references; these were present in the working tree but never committed. templates/cmpd.yaml securityContext: add the standalone TLS bootstrap copy step using /var/lib/mysql/tls staging. templates/_helpers.tpl Add mariadb.config.reconfigureAction helper used by the three CmpDs. Add mariadb.galera.spec.systemAccounts and related galera-specific helpers split out from the shared block. values.yaml: roleProbe.timeoutSeconds: 1 -> 2 1s consistently spiked over the threshold under SHOW SLAVE STATUS contention; 2s gives margin while staying below periodSeconds=3. roleProbe.periodSeconds: 5 -> 3 Tighten the probe cadence so role label propagation latency is closer to the expected SLA window for switchover/failover tests. image.syncer.tag Pin to mariadb-local-20260504-123616-amd64. This is the mariadb-aware syncer build that contains the addon-targeted ha.go changes; the upstream tag did not include them yet so async/semisync DCS election paths regressed. defaultServiceVersion / image.tag pins synced to the matching image we side-loaded into the IDC vcluster. Chart.yaml: version: 1.1.0-alpha.1 -> 1.1.1-alpha.16 appVersion: 11.4.5 -> 11.4.10 Validation summary (alpha.16 fresh matrix, IDC vcluster, 2026-05-04 22:04:03): standalone n01/n02 Running 1/1 pod ready semisync n01/n02 Running 2/2 pod ready, IO/SQL=Yes lag=0 async n01/n02 Running 3/3 pod ready, IO/SQL=Yes lag=0 (n01 did not reproduce task #420 split-brain; that defect is tracked separately and is not blocked by this commit) galera n01/n02 Running 3/3 pod ready, wsrep_sst_method=rsync, cluster_size=3, all Synced, no EPERM / move-back / wsrep_sst_mariabackup in logs, mariadb container CapEff=0 Evidence: artifacts/mariadb-alpha16-fresh-matrix-20260504-215633.tar.gz sha256 ae8bb7f7e980b02b426603fb95056c6f3485a83d6b0b82440639b8b8b11c160d Out of scope (tracked separately): async n01 split-brain / failover safety under partial DCS+peer-DB reachability loss (task #420 evidence pack, sha256 b2a428964205942e6d88f3bffbd85967d5c253f5537d02380bc2469a2fe2a510) remains an open P0 owned outside this commit; the fresh n01 sample in this matrix did not reproduce the failure mode and therefore does not gate this Galera/rsync + role-probe / bootstrap fix. CAP_CHOWN/CAP_FOWNER as a path to keep mariabackup as the SST method was tried and deliberately not carried; effectiveness blocked by the kernel/k8s cap model described above.

weicao · 2026-05-04T14:26:33Z

Validation update — alpha.16 fresh matrix 8/8 PASS

Pushed commit 6e71b6a8 landing the alpha.16 chart. Branch tip moved 43f79e63 -> 6e71b6a8.

Fresh validation (IDC vcluster, 2026-05-04 22:04:03)

8 fresh namespaces, alpha.16 chart, no helm upgrade carry-over:

topology	namespace	pods	role	status
standalone	n01/n02	1/1 ready	n/a	Running
semisync	n01/n02	2/2 ready	1 primary + 1 secondary, IO/SQL=Yes lag=0	Running
async	n01/n02	3/3 ready	1 primary + 2 secondary, IO/SQL=Yes lag=0	Running
galera	n01/n02	3/3 ready	all primary, wsrep_sst_method=rsync, cluster_size=3, all Synced	Running

Galera evidence: container /proc/1/status confirms CapEff=0000000000000000 (no cap add), pod logs contain no Operation not permitted / move-back / wsrep_sst_mariabackup.

Evidence pack: mariadb-alpha16-fresh-matrix-20260504-215633.tar.gz
sha256: ae8bb7f7e980b02b426603fb95056c6f3485a83d6b0b82440639b8b8b11c160d

What changed in this commit

Galera SST method changed mariabackup -> rsync (config/mariadb-galera.tpl). mariabackup --move-back calls chown(2) inside the joiner; with mariadbd running as the unprivileged mysql user (UID 999) and CapEff=0, chown fails with EPERM and the joiner crashes (signal 11). rsync SST does not perform chown(2). cap-add (CHOWN/FOWNER) was tried and confirmed ineffective because k8s securityContext.capabilities.add only lands caps in the bounding set; setuid(mysql) clears the effective set, and k8s does not currently expose ambient caps. cap-add is intentionally NOT carried in the chart so we keep least-privilege.
Role-probe file bridge for Galera (scripts/galera-start.sh persistent watcher writing ${DATA_DIR}/.galera-role + scripts/galera-roleprobe.sh reading the file). Required because KubeBlocks main dropped ExecAction.container, so roleProbe.exec.container=mariadb is silently discarded and the probe always runs in the kbagent container which has no mariadb client.
POSIX-only Galera lifecycle scripts (member-join, member-leave, role-probe, plus replication-roleprobe). Previous bash-array idioms triggered busybox sh syntax errors.
Path A reconfigureAction in CmpDs (cmpd.yaml, cmpd-galera.yaml, cmpd-replication.yaml) so the new KB-main-honored reconfigure path is rendered into configs[].
Replication topology split + OrderedReady bootstrap (cmpd-replication.yaml) so primary completes bootstrap before secondaries register.
paramsdef typo fix (mariadb-galera-config-template -> mariadb-galera-config).
values.yaml: roleProbe.timeoutSeconds 1 -> 2, periodSeconds 5 -> 3, syncer image pinned to mariadb-aware build.
Chart.yaml: 1.1.0-alpha.1 -> 1.1.1-alpha.16, appVersion 11.4.5 -> 11.4.10.

Out of scope (tracked separately)

async n01 split-brain / failover-safety incident from earlier in the day (task chore: fix postgresql data directory has invalid permissions #420 evidence pack sha256 b2a428964205942e6d88f3bffbd85967d5c253f5537d02380bc2469a2fe2a510) is an independent P0 owned outside this commit. The fresh n01 sample in this matrix did not reproduce the failure mode and therefore does not gate this Galera/rsync + role-probe / bootstrap fix.
CAP_CHOWN/CAP_FOWNER as a path to keep mariabackup as SST method was tried and deliberately not carried; the kernel/k8s cap-drop-on-setuid model blocks effectiveness.

weicao requested review from a team as code owners April 5, 2026 03:19

weicao and others added 2 commits April 5, 2026 03:20

chore: auto generated files

81fc7ef

weicao marked this pull request as draft April 8, 2026 03:21

weicao and others added 4 commits May 4, 2026 22:19

fix(mariadb): add roleSelector to Galera default service

2ca4a73

Without roleSelector, the service routes traffic to all pods including donor/joiner nodes during SST. With roleSelector: primary, only fully synced nodes (wsrep_local_state=4) receive service traffic.

chore: auto generated files

c98795d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mariadb): add 11.4 with async, semi-sync and Galera topologies#2548

feat(mariadb): add 11.4 with async, semi-sync and Galera topologies#2548
weicao wants to merge 7 commits intomainfrom
feat/mariadb-11.4-replication-galera

weicao commented Apr 5, 2026

Uh oh!

codecov-commenter commented Apr 5, 2026

Uh oh!

weicao commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicao commented Apr 5, 2026

Summary

New Topologies (via ClusterDefinition: mariadb)

New ComponentDefinitions

New Versions

Files Added

Usage

Notes

Test plan

Uh oh!

codecov-commenter commented Apr 5, 2026

Codecov Report

Uh oh!

weicao commented May 4, 2026

Validation update — alpha.16 fresh matrix 8/8 PASS

Fresh validation (IDC vcluster, 2026-05-04 22:04:03)

What changed in this commit

Out of scope (tracked separately)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Topologies (via `ClusterDefinition: mariadb`)