Skip to content

feat(valkey): add Valkey addon with sentinel-based replication#2592

Open
weicao wants to merge 31 commits intomainfrom
feat/valkey-addon
Open

feat(valkey): add Valkey addon with sentinel-based replication#2592
weicao wants to merge 31 commits intomainfrom
feat/valkey-addon

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented Apr 20, 2026

Closes #2591

Summary

  • Add Valkey addon supporting sentinel-based replication topology
  • Supports Valkey 8.x and 9.x with separate ComponentDefinitions
  • TLS, ACL, backup/restore, reconfiguring, and switchover (auto + targeted)

Script Highlights

  • valkey-start.sh: bootstrap-primary election via sentinel query + lexicographic fallback; cascading-chain detection with follow_slave_to_master()
  • valkey-sentinel-start.sh: _background_monitor_discovery for simultaneous pod restart recovery; TLS-aware sentinel replication
  • switchover.sh: strict candidate verification — OpsRequest fails if wrong node elected
  • valkey-member-leave.sh: config-epoch–based sentinel selection; skip RESET on fast-path
  • check-role.sh: sync-stall detection with SIGTERM recovery

Test Results (EKS ap-southeast-1, KubeBlocks 1.0.2)

All 20 documented operations pass. Full suite: 620 PASS, 0 FAIL across smoke, regression, chaos, TLS, upgrade, sentinel, ACL, custom-secret, and rebuild suites.

Adds a new Valkey addon (Chart version 0.1.0) supporting a sentinel-based
replication topology (3 data pods + 3 sentinel pods) for Valkey 8.x and 9.x.

Key design points:

valkey-start.sh: quorum-based sentinel query
- query_sentinel_quorum_for_master requires strict majority (>= N/2+1) of
  sentinels to agree on the same master before trusting the result
- prevents split-brain during FAILOVER convergence windows where different
  sentinels transiently hold different epoch/master values
- verifies quorum-elected pod actually reports role=master (sentinel can lag
  between election and data pod confirming demotion)
- falls back to pod scan, then lexicographic heuristic; follows slave's
  master_host chain if heuristic pod is itself a slave (prevents cascading
  sub-slave topology that sentinel does not auto-correct)
- always writes masterauth before the replicaof check so that if sentinel
  later demotes this pod it can authenticate to the new master

valkey-member-leave.sh: safe FAILOVER + deferred RESET
- selects the highest config-epoch sentinel (avoids isolated/stale sentinels
  that fell behind after repeated failovers and have num-slaves: 0)
- checks sentinel's current master opinion before issuing FAILOVER; skips
  both FAILOVER and RESET when sentinel already elected a new master
- skips SENTINEL RESET for slave departures and when FAILOVER did not
  complete: premature RESET zeros num-slaves temporarily, causing pods
  restarting in that window to fall through to the heuristic path and
  potentially elect a second standalone master

switchover.sh: strict candidate confirmation + TLS fixes
- _do_set_replica_priority captures stdout only (stderr to /dev/null) and
  strips \r; wrapped in call_func_with_retry 3 3
- execute_sentinel_failover strips \r from output; checks exit code
- wait_for_new_master || return 1 for targeted switchover (no silent OK)

valkey-sentinel-start.sh: background master discovery + TLS
- _background_monitor_discovery loop runs in background after sentinel starts;
  polls indefinitely until master found, then calls SENTINEL MONITOR
- tls-replication yes in append_dynamic_conf TLS block (sentinel cannot
  connect to data pods with port 0 / TLS-only without this)
- TLS args added to _sentinel_cli and _find_master_fqdn

Backup/restore: physical backup via kopia; post-restore script
re-registers the restored pod with sentinel using a retry loop.

All scripts updated for ShellSpec unit-test framework compatibility
(ut_mode block, load_common_library(), ${__SOURCED__} guard).

Closes #2589
@weicao weicao requested review from a team as code owners April 20, 2026 05:46
weicao added 19 commits April 21, 2026 04:21
Three issues fixed in switchover_with_sentinel():

1. Idempotency on double-call: if candidate is already master (KB
   reconcile retry or prior auto-failover), return success immediately
   instead of erroring on role != slave.

2. All-Sentinel priority confirmation: add wait_sentinel_sees_priority()
   that polls ALL Sentinels until they reflect the updated replica-priority
   before issuing SENTINEL FAILOVER. Avoids stale-cache race where the
   Sentinel receiving FAILOVER hasn't refreshed its replica info yet and
   promotes the wrong pod.

3. Deferred priority restore: restore replica-priority=100 AFTER
   wait_for_new_master() completes, not before. Restoring too early
   equalises candidate and non-candidate priorities during Sentinel's
   async slave-selection window (~1s after FAILOVER OK), allowing
   Sentinel to pick the wrong replica by run_id/offset.

Also set switchover.timeoutSeconds=-1 in cmpd.yaml so the kbagent
exec layer does not cut short the 30s Sentinel cache poll and the
300s master-confirmation wait.

ShellSpec unit tests updated: 21 examples, 0 failures.
Integration verified: 2× full suite (smoke+chaos+regression+tls+
upgrade+sentinel+acl+custom-secret+rebuild), PASS 660/FAIL 4/SKIP 7,
results bit-for-bit identical across both runs.
Add 7 new ShellSpec cases to valkey_switchover_spec.sh:

wait_for_new_master():
- success: target becomes master before timeout
- timeout: max_wait=0 returns failure with WARNING
- exclude_fqdn: skips old master, accepts new master on different node
- empty expected_fqdn: accepts any new master

switchover_with_sentinel() success path:
- ordering test: verify restore happens after wfnm (symmetric with existing
  failure-path ordering test)

switchover_with_sentinel() cleanup:
- execute_sentinel_failover failure: priorities restored, wfnm not called

switchover_with_sentinel() no-candidate:
- candidate_fqdn="" path: no priority bias set, delegates directly to Sentinel

Total: 28 examples, 0 failures.
49th example: verifies that when valkey-cli exits non-zero the function
returns failure and emits both stdout "Promoting" and stderr "ERROR: ...failed on".
50th example: verifies that when valkey-cli exits non-zero during REPLICAOF
the function returns failure and emits "ERROR: REPLICAOF command failed on".
51st example: verifies that when execute_sentinel_failover fails with no
candidate specified, set_replica_priority is never called (no priorities
to restore) and the function returns failure.
…lTrigger

- Add dynamicParameters list (maxmemory-policy, appendonly, loglevel, etc.)
  and staticParameters (databases, tcp-backlog) to ParametersDefinition
- Add reloadAction.shellTrigger that sends inline Redis protocol (AUTH +
  CONFIG SET) via nc to apply dynamic param changes without pod restart;
  uses argv ($0=key, $1=value) as KB passes params, not stdin
- Fix externalManaged comment in cmpd.yaml: clarifies that this flag is
  REQUIRED for KB to create ComponentParameters and invoke shellTrigger
@weicao weicao force-pushed the feat/valkey-addon branch from 8e7a8dc to c2182be Compare April 21, 2026 05:10
@weicao weicao added the nopick Not auto cherry-pick when PR merged label Apr 29, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 0% with 1549 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (091d07c) to head (dc11d6b).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...s/valkey/scripts-ut-spec/valkey_switchover_spec.sh 0.00% 743 Missing ⚠️
...ns/valkey/scripts-ut-spec/valkey_self_heal_spec.sh 0.00% 233 Missing ⚠️
addons/valkey/scripts-ut-spec/valkey_start_spec.sh 0.00% 213 Missing ⚠️
...valkey/scripts-ut-spec/valkey_member_leave_spec.sh 0.00% 127 Missing ⚠️
addons/valkey/scripts-ut-spec/check_role_spec.sh 0.00% 121 Missing ⚠️
...ons/valkey/scripts-ut-spec/valkey_sync_acl_spec.sh 0.00% 96 Missing ⚠️
addons/valkey/scripts-ut-spec/utils.sh 0.00% 16 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2592     +/-   ##
=======================================
  Coverage   0.00%   0.00%             
=======================================
  Files         73      80      +7     
  Lines       9197   10746   +1549     
=======================================
- Misses      9197   10746   +1549     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nopick Not auto cherry-pick when PR merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(valkey): add Valkey addon with sentinel-based replication

2 participants