Skip to content

feat(mariadb): add 11.4 replication and semisync hardening#2633

Open
weicao wants to merge 23 commits intomainfrom
feat/mariadb-alpha37-semisync-fencing-pr
Open

feat(mariadb): add 11.4 replication and semisync hardening#2633
weicao wants to merge 23 commits intomainfrom
feat/mariadb-alpha37-semisync-fencing-pr

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 9, 2026

Summary

  • Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources.
  • Version the replication script ConfigMap and wire replication and semisync ComponentDefinitions to the same versioned script name.
  • Harden semisync startup, role publication, switchover fencing, preStop handling, rebuilt-old-primary rejoin fencing, and fresh bootstrap role publication.
  • Gate semisync primary role publication on both user-facing root and internal-root local writes while read_only=0.
  • Keep remote root table writes fenced during secondary states while preserving the privileges needed for follow and status repair, including REPLICATION MASTER ADMIN.
  • During switchover, require candidate root writes, candidate syncer role, old-primary secondary syncer role, and Primary Service routing before action success.
  • Repair kubeblocks.kb_health_check duplicate-key follow failures through a local maintenance path that temporarily opens read_only and restores fail-closed state afterward.
  • Add shell specs for replication member join, role probe, switchover, semisync rejoin fencing, and standalone template mapping.

Local validation

  • git diff --check
  • bash -n addons/mariadb/scripts/replication-switchover.sh
  • helm lint addons/mariadb
  • helm lint /tmp/mariadb-1.1.1-alpha.58.tgz
  • helm template mariadb addons/mariadb
  • helm template mariadb /tmp/mariadb-1.1.1-alpha.58.tgz
  • shellspec -I shellspec --helperdir shellspec addons/mariadb/scripts-ut-spec passed: 135 examples, 0 failures
  • PR body and branch commits were checked for attribution text.

Current retest package

  • Latest commit: 1b4979a9.
  • Chart: 1.1.1-alpha.58.
  • Patch diff sha256: 796fca1e5b63f592d0b5ee26bb1cbff92a958a769f6594772411e3f79782d91f.
  • Package sha256: 59605d81654983a005040d1a1ebd0086ccb1dfc693f0096bedd914393ad7bc8a.
  • Rendered manifest sha256: adde0882e6511e4e6d93e3b5095e676e9a59babe8079525b687f14ca0af231f0.
  • Rendered roleProbe script sha256: 281166e54676d251ace1746945df724a92371c8e3acf722791b292a8b7d53d4a.
  • Rendered switchover script sha256: 3f8eebe9f3497787b8ff8c80e9c6e511d5c34e42f417ebd2303b9e7f99c5d25c.

Latest focused evidence

  • Chart 1.1.1-alpha.57 live gate and bootstrap/role publication each passed only their single scopes. Bootstrap eventually published one primary role and one Primary Service endpoint, then cleaned all test resources.
  • Chart 1.1.1-alpha.57 switchover and role-transition under load, 1 sample, failed only that scope and the namespace was preserved for inspection. The OpsRequest started and then failed with action timeout. The writer saw 64 successful writes, 35 failures, 12 no-primary samples, and no confirmed double-writable samples.
  • The preserved scene did not support a broad data-loss conclusion. It did show an inconsistent failed state: Primary Service and Kubernetes role label pointed at pod0 while the leader record and SQL write side had moved to pod1. The old pod also had replication SQL stopped on kubeblocks.kb_health_check duplicate-key 1062, and candidate root writes returned permission error during the fault window.
  • The follow-up fix in chart 1.1.1-alpha.58 closes action success behind candidate root writes, syncer role agreement on both pods, old-primary follow repair, and Primary Service routing.

Boundary

  • These results cover the listed scopes only.
  • This is not a release pass claim, not a long-running soak claim, and not a full operation matrix claim.

weicao and others added 2 commits May 9, 2026 15:05
Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources.

Harden semisync startup, role publication, switchover fencing, and script distribution.

Add shell specs for replication member join, role probe, switchover, and standalone template mapping.
@weicao weicao requested review from a team and leon-ape as code owners May 9, 2026 07:06
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

Codecov Report

❌ Patch coverage is 0% with 1546 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (69b3b6d) to head (1b4979a).

Files with missing lines Patch % Lines
...adb/scripts-ut-spec/replication_switchover_spec.sh 0.00% 460 Missing ⚠️
...db/scripts-ut-spec/replication_member_join_spec.sh 0.00% 432 Missing ⚠️
...iadb/scripts-ut-spec/replication_roleprobe_spec.sh 0.00% 336 Missing ⚠️
...pts-ut-spec/semisync_rejoin_fence_template_spec.sh 0.00% 281 Missing ⚠️
...cripts-ut-spec/standalone_template_mapping_spec.sh 0.00% 37 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2633     +/-   ##
=======================================
  Coverage   0.00%   0.00%             
=======================================
  Files         73      78      +5     
  Lines       9197   10743   +1546     
=======================================
- Misses      9197   10743   +1546     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

weicao added 20 commits May 9, 2026 17:06
Keep the KubeBlocks health-check table schema on fresh replicas and clear only local rows before starting or repairing SQL replication. This prevents the replica repair path from changing a duplicate-key error into a missing-table replication error.
Require internal local admin read-only privileges before role decisions.

Track primary read/write readiness after local root unlock and read_only repair.

Repair syncer primary reconciliation when the listener is already exposed but local write readiness is missing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants