feat(valkey): add Valkey addon with sentinel-based replication#2592
Open
feat(valkey): add Valkey addon with sentinel-based replication#2592
Conversation
Adds a new Valkey addon (Chart version 0.1.0) supporting a sentinel-based
replication topology (3 data pods + 3 sentinel pods) for Valkey 8.x and 9.x.
Key design points:
valkey-start.sh: quorum-based sentinel query
- query_sentinel_quorum_for_master requires strict majority (>= N/2+1) of
sentinels to agree on the same master before trusting the result
- prevents split-brain during FAILOVER convergence windows where different
sentinels transiently hold different epoch/master values
- verifies quorum-elected pod actually reports role=master (sentinel can lag
between election and data pod confirming demotion)
- falls back to pod scan, then lexicographic heuristic; follows slave's
master_host chain if heuristic pod is itself a slave (prevents cascading
sub-slave topology that sentinel does not auto-correct)
- always writes masterauth before the replicaof check so that if sentinel
later demotes this pod it can authenticate to the new master
valkey-member-leave.sh: safe FAILOVER + deferred RESET
- selects the highest config-epoch sentinel (avoids isolated/stale sentinels
that fell behind after repeated failovers and have num-slaves: 0)
- checks sentinel's current master opinion before issuing FAILOVER; skips
both FAILOVER and RESET when sentinel already elected a new master
- skips SENTINEL RESET for slave departures and when FAILOVER did not
complete: premature RESET zeros num-slaves temporarily, causing pods
restarting in that window to fall through to the heuristic path and
potentially elect a second standalone master
switchover.sh: strict candidate confirmation + TLS fixes
- _do_set_replica_priority captures stdout only (stderr to /dev/null) and
strips \r; wrapped in call_func_with_retry 3 3
- execute_sentinel_failover strips \r from output; checks exit code
- wait_for_new_master || return 1 for targeted switchover (no silent OK)
valkey-sentinel-start.sh: background master discovery + TLS
- _background_monitor_discovery loop runs in background after sentinel starts;
polls indefinitely until master found, then calls SENTINEL MONITOR
- tls-replication yes in append_dynamic_conf TLS block (sentinel cannot
connect to data pods with port 0 / TLS-only without this)
- TLS args added to _sentinel_cli and _find_master_fqdn
Backup/restore: physical backup via kopia; post-restore script
re-registers the restored pod with sentinel using a retry loop.
All scripts updated for ShellSpec unit-test framework compatibility
(ut_mode block, load_common_library(), ${__SOURCED__} guard).
Closes #2589
Three issues fixed in switchover_with_sentinel(): 1. Idempotency on double-call: if candidate is already master (KB reconcile retry or prior auto-failover), return success immediately instead of erroring on role != slave. 2. All-Sentinel priority confirmation: add wait_sentinel_sees_priority() that polls ALL Sentinels until they reflect the updated replica-priority before issuing SENTINEL FAILOVER. Avoids stale-cache race where the Sentinel receiving FAILOVER hasn't refreshed its replica info yet and promotes the wrong pod. 3. Deferred priority restore: restore replica-priority=100 AFTER wait_for_new_master() completes, not before. Restoring too early equalises candidate and non-candidate priorities during Sentinel's async slave-selection window (~1s after FAILOVER OK), allowing Sentinel to pick the wrong replica by run_id/offset. Also set switchover.timeoutSeconds=-1 in cmpd.yaml so the kbagent exec layer does not cut short the 30s Sentinel cache poll and the 300s master-confirmation wait. ShellSpec unit tests updated: 21 examples, 0 failures. Integration verified: 2× full suite (smoke+chaos+regression+tls+ upgrade+sentinel+acl+custom-secret+rebuild), PASS 660/FAIL 4/SKIP 7, results bit-for-bit identical across both runs.
Add 7 new ShellSpec cases to valkey_switchover_spec.sh: wait_for_new_master(): - success: target becomes master before timeout - timeout: max_wait=0 returns failure with WARNING - exclude_fqdn: skips old master, accepts new master on different node - empty expected_fqdn: accepts any new master switchover_with_sentinel() success path: - ordering test: verify restore happens after wfnm (symmetric with existing failure-path ordering test) switchover_with_sentinel() cleanup: - execute_sentinel_failover failure: priorities restored, wfnm not called switchover_with_sentinel() no-candidate: - candidate_fqdn="" path: no priority bias set, delegates directly to Sentinel Total: 28 examples, 0 failures.
…r custom port tests (41 examples)
…y test (44 examples)
…e guard (46 examples)
49th example: verifies that when valkey-cli exits non-zero the function returns failure and emits both stdout "Promoting" and stderr "ERROR: ...failed on".
50th example: verifies that when valkey-cli exits non-zero during REPLICAOF the function returns failure and emits "ERROR: REPLICAOF command failed on".
51st example: verifies that when execute_sentinel_failover fails with no candidate specified, set_replica_priority is never called (no priorities to restore) and the function returns failure.
…lTrigger - Add dynamicParameters list (maxmemory-policy, appendonly, loglevel, etc.) and staticParameters (databases, tcp-backlog) to ParametersDefinition - Add reloadAction.shellTrigger that sends inline Redis protocol (AUTH + CONFIG SET) via nc to apply dynamic param changes without pod restart; uses argv ($0=key, $1=value) as KB passes params, not stdin - Fix externalManaged comment in cmpd.yaml: clarifies that this flag is REQUIRED for KB to create ComponentParameters and invoke shellTrigger
8e7a8dc to
c2182be
Compare
…lf-target races (#2615)
2 tasks
…od_role mocking (#2616)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2592 +/- ##
=======================================
Coverage 0.00% 0.00%
=======================================
Files 73 80 +7
Lines 9197 10746 +1549
=======================================
- Misses 9197 10746 +1549 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2591
Summary
Script Highlights
valkey-start.sh: bootstrap-primary election via sentinel query + lexicographic fallback; cascading-chain detection withfollow_slave_to_master()valkey-sentinel-start.sh:_background_monitor_discoveryfor simultaneous pod restart recovery; TLS-aware sentinel replicationswitchover.sh: strict candidate verification — OpsRequest fails if wrong node electedvalkey-member-leave.sh: config-epoch–based sentinel selection; skip RESET on fast-pathcheck-role.sh: sync-stall detection with SIGTERM recoveryTest Results (EKS ap-southeast-1, KubeBlocks 1.0.2)
All 20 documented operations pass. Full suite: 620 PASS, 0 FAIL across smoke, regression, chaos, TLS, upgrade, sentinel, ACL, custom-secret, and rebuild suites.