feat(mariadb): add 11.4 with async, semi-sync and Galera topologies#2548
Draft
feat(mariadb): add 11.4 with async, semi-sync and Galera topologies#2548
Conversation
…logies - Add MariaDB 11.4.5 to ComponentVersion (10.6.15 retained) - Add mariadb-replication ComponentDefinition: primary-replica with syncer sidecar HA, semi-sync plugins loaded, GTID-based replication - Add mariadb-galera ComponentDefinition: multi-primary Galera Cluster with wsrep bootstrap/join scripts, roleProbe, memberJoin/Leave - Add ClusterDefinition with topologies: standalone, async, semisync, galera - Add config templates for replication (my.cnf) and Galera (wsrep settings) - Add Galera lifecycle scripts: galera-start.sh, galera-roleprobe.sh, galera-member-join.sh, galera-member-leave.sh - Update addons-cluster: topology field replaces mode, supports all 4 topologies - Bump chart versions to 1.1.0-alpha.1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Critical fixes: - cmpv.yaml: image key changed from 'syncer' to 'init-syncer' to match init container name — was causing syncer binary to be missing at startup - cmpd-replication: call docker-entrypoint.sh mariadbd (not raw mariadbd) to ensure data directory initialization on first boot - mariadb-galera.tpl: move all wsrep settings from invalid [galera] section to [mysqld] — MariaDB does not read [galera] section HA/availability fixes: - galera-start.sh: handle full cluster restart by reading grastate.dat and bootstrapping on the node with safe_to_bootstrap=1 - galera-start.sh: remove --mariadb-root-host (not a valid mariadbd flag); use docker-entrypoint.sh for proper DB initialization - galera-start.sh: write wsrep_sst_auth to a 600-mode config file instead of passing credentials on the command line (ps exposure) - galera-member-join.sh: increase SST timeout from 300s to 3600s (large DB) - cmpd-replication: add kbreplicator system account required by syncer to set up replication between primary and secondary - cmpd-galera: expose MARIADB_ROOT_PASSWORD env for docker-entrypoint.sh - cmpd-replication: expose MARIADB_ROOT_PASSWORD env for docker-entrypoint.sh - cmpd-galera: increase roleProbe timeoutSeconds from 1s to 5s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2548 +/- ##
=====================================
Coverage 0.00% 0.00%
=====================================
Files 72 72
Lines 9164 9164
=====================================
Misses 9164 9164 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Round 2 fixes: - Use ^mariadb-[0-9] regex for standalone topology to avoid matching replication/galera cmpds in cmpv.yaml and clusterdefinition.yaml - Remove duplicate MARIADB_ROOT_PASSWORD env entry from cmpd-replication and cmpd-galera (KubeBlocks vars injection already handles this) - Fix galera-start.sh: write wsrep_sst_auth to DATA_DIR (PVC, writable) instead of /etc/mysql/conf.d (ConfigMap mount, read-only) - Fix actionset.yaml restore: use chown+chmod 750 instead of chmod 0777 Round 3 fixes: - Use /tools/syncer full path in cmpd-replication container command - chown mysql:mysql galera-sst-auth.cnf so mariadbd (running as mysql) can read it - Use bash arrays for MARIADB_CMD in all Galera scripts to safely handle passwords with special characters - Add init-syncer image to 10.6.15 release in cmpv.yaml so replication cmpd initContainer has an image when used with MariaDB 10.6.15
Without roleSelector, the service routes traffic to all pods including donor/joiner nodes during SST. With roleSelector: primary, only fully synced nodes (wsrep_local_state=4) receive service traffic.
…ridge, replication shebang/POSIX, paramsdef/scripts hardening
Single commit landing the chart changes that unblock async/semisync/galera
topology smoke on KubeBlocks main + apecloud-registry stable-tools 1.0.2.
All 8 fresh namespaces (standalone/semisync/async/galera x n01/n02) pass
end-to-end on alpha.16 with helm rev 12 in the IDC vcluster smoke harness.
Galera SST EPERM fix (path 3, rsync):
config/mariadb-galera.tpl
wsrep_sst_method: mariabackup -> rsync
Reason: mariabackup --move-back invokes chown(2) on extracted SST
files. The mariadbd container runs as the mysql user (UID 999) with
CapEff=0 because k8s securityContext.capabilities.add only lands caps
in the bounding set; setuid(mysql) clears the effective set, and k8s
does not currently expose ambient caps. rsync SST does not perform
chown(2) and bypasses the EPERM signal-11 crash. cap-add was tried
and confirmed ineffective; intentionally NOT carried in the chart so
we keep least-privilege.
Role-probe file bridge (cross-container) and Galera lifecycle scripts:
scripts/galera-start.sh
Adds a persistent background watcher that polls wsrep_local_state via
the local mysqld socket every 3s and atomically writes
DATA_DIR/.galera-role with primary (state=4 Synced) or secondary
(anything else). The file is chowned to mysql:mysql so that
subsequent SST/IST file operations on the data directory do not
encounter root-owned debris. The watcher uses set +e so a single
transient mysqld socket error never kills the loop; tmp+atomic-mv
avoids torn reads. The watcher also touches .galera-synced once on
first state=4 transition for the memberJoin signal.
scripts/galera-roleprobe.sh
POSIX-only (#!/bin/sh); kbagent ships busybox sh and runs the script
inside its own container, so bash arrays / bashisms cause syntax
errors and a permanent role-probe fail.
Reads DATA_DIR/.galera-role written by galera-start.sh and emits
primary/secondary. No mariadb client invocation in this container,
which is required because KubeBlocks main dropped ExecAction.container
so the cmpd-galera.yaml roleProbe.exec.container=mariadb hint is
silently discarded and the script always runs in the kbagent
container (no mariadb binary available there).
scripts/galera-member-join.sh
POSIX-only; replaces array-driven mariadb invocation with a file-based
wait on DATA_DIR/.galera-synced (written by galera-start.sh
background watcher). kbagent has no mariadb binary, so the previous
direct-query approach was non-functional even before the SST fix.
scripts/galera-member-leave.sh
POSIX-only; mariadb_cmd shell function replaces bash array.
Replication topology hardening (Path A chart-side reconfigureAction +
shebang/POSIX in role-probe + bounded retries + ordered bootstrap):
templates/cmpd.yaml
templates/cmpd-galera.yaml
templates/cmpd-replication.yaml
Add include mariadb.config.reconfigureAction so the reconfigure
action is rendered into the CmpD configs[] block. Earlier
paramsdef-based legacy reloadAction path no longer reliably triggers
on KubeBlocks main; the CmpD-side reconfigureAction is the documented
new path and is honored by main today.
templates/cmpd-replication.yaml
Replication topology was previously async/semi-sync mixed; split out
proper async vs semisync ComponentDefinitions so the bootstrap and
member-join flows can each enforce their own readiness gates.
podManagementPolicy: Parallel -> OrderedReady so primary bootstrap
completes before secondaries try to register, eliminating a class of
early-bootstrap split-brain races.
scripts/replication-roleprobe.sh
POSIX-only (#!/bin/sh) for the same kbagent-busybox reason as
Galera. Replaces 4 occurrences of (printf | grep) pipelines with
shell-builtin case statements so the probe runs entirely in builtins
when mariadb client is not available on the probe path.
Paramsdef / pcr / configmap fixes:
templates/paramsdef.yaml
Fix templateName: mariadb-galera-config-template (referenced by KB
main reconfigure path) -> mariadb-galera-config so the parameter
template is actually resolvable. Previous typo silently failed.
templates/cmpv.yaml, templates/pcr.yaml, templates/configmap.yaml,
templates/configmap-scripts-replication.yaml
Add the bundled but previously-untracked rendering for replication
scripts and ParameterTemplate inventory so a clean install picks up
the same files we have been deploying via helm upgrade.
config/mariadb-semisync.tpl, config/mariadb-standalone.tpl,
config/mariadb-config-effect-scope.yaml
Add the corresponding config templates and effect-scope mapping that
the chart references; these were present in the working tree but
never committed.
templates/cmpd.yaml
securityContext: add the standalone TLS bootstrap copy step using
/var/lib/mysql/tls staging.
templates/_helpers.tpl
Add mariadb.config.reconfigureAction helper used by the three CmpDs.
Add mariadb.galera.spec.systemAccounts and related galera-specific
helpers split out from the shared block.
values.yaml:
roleProbe.timeoutSeconds: 1 -> 2
1s consistently spiked over the threshold under SHOW SLAVE STATUS
contention; 2s gives margin while staying below periodSeconds=3.
roleProbe.periodSeconds: 5 -> 3
Tighten the probe cadence so role label propagation latency is closer
to the expected SLA window for switchover/failover tests.
image.syncer.tag
Pin to mariadb-local-20260504-123616-amd64. This is the mariadb-aware
syncer build that contains the addon-targeted ha.go changes; the
upstream tag did not include them yet so async/semisync DCS election
paths regressed.
defaultServiceVersion / image.tag pins synced to the matching image we
side-loaded into the IDC vcluster.
Chart.yaml:
version: 1.1.0-alpha.1 -> 1.1.1-alpha.16
appVersion: 11.4.5 -> 11.4.10
Validation summary (alpha.16 fresh matrix, IDC vcluster, 2026-05-04
22:04:03):
standalone n01/n02 Running 1/1 pod ready
semisync n01/n02 Running 2/2 pod ready, IO/SQL=Yes lag=0
async n01/n02 Running 3/3 pod ready, IO/SQL=Yes lag=0
(n01 did not reproduce task #420
split-brain; that defect is tracked
separately and is not blocked by this
commit)
galera n01/n02 Running 3/3 pod ready, wsrep_sst_method=rsync,
cluster_size=3, all Synced, no EPERM /
move-back / wsrep_sst_mariabackup in
logs, mariadb container CapEff=0
Evidence: artifacts/mariadb-alpha16-fresh-matrix-20260504-215633.tar.gz
sha256 ae8bb7f7e980b02b426603fb95056c6f3485a83d6b0b82440639b8b8b11c160d
Out of scope (tracked separately):
async n01 split-brain / failover safety under partial DCS+peer-DB
reachability loss (task #420 evidence pack, sha256
b2a428964205942e6d88f3bffbd85967d5c253f5537d02380bc2469a2fe2a510)
remains an open P0 owned outside this commit; the fresh n01 sample in
this matrix did not reproduce the failure mode and therefore does not
gate this Galera/rsync + role-probe / bootstrap fix.
CAP_CHOWN/CAP_FOWNER as a path to keep mariabackup as the SST method
was tried and deliberately not carried; effectiveness blocked by the
kernel/k8s cap model described above.
Contributor
Author
Validation update — alpha.16 fresh matrix 8/8 PASSPushed commit Fresh validation (IDC vcluster, 2026-05-04 22:04:03)8 fresh namespaces, alpha.16 chart, no helm upgrade carry-over:
Galera evidence: container Evidence pack: What changed in this commit
Out of scope (tracked separately)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extend the MariaDB addon to support MariaDB 11.4 with three replication topologies.
New Topologies (via
ClusterDefinition: mariadb)standaloneasyncsemisyncgaleraNew ComponentDefinitions
mariadb-replication-1.1.0-alpha.1— syncer sidecar HA,--plugin-load-add=semisync_master/slave, GTID strict modemariadb-galera-1.1.0-alpha.1— wsrep lifecycle (bootstrap/join/leave/roleprobe), ports 3306/4444/4567/4568New Versions
Files Added
addons/mariadb/templates/cmpd-replication.yaml— replication ComponentDefinitionaddons/mariadb/templates/cmpd-galera.yaml— Galera ComponentDefinitionaddons/mariadb/templates/clusterdefinition.yaml— 4 topologiesaddons/mariadb/templates/configmap.yaml— config templates for replication + Galeraaddons/mariadb/templates/configmap-scripts-galera.yaml— Galera lifecycle scriptsaddons/mariadb/config/mariadb-replication.tpl— binlog/GTID/semisync my.cnfaddons/mariadb/config/mariadb-galera.tpl— wsrep/Galera my.cnfaddons/mariadb/scripts/galera-{start,roleprobe,member-join,member-leave}.shUsage
Notes
mariadb:11.4+(wsrep built-in);10.6.15is excluded from Galera compatibilityapecloud/syncer:0.6.8(same as MySQL addon)1.1.0-alpha.1for both addons and addons-cluster chartsTest plan
helm templatepasses for all 4 topologies (verified locally)wsrep_cluster_size=3after 3-node deploy🤖 Generated with Claude Code