feat(move-tables): build cooperative cutover orchestration (#8209)#1704
Conversation
…ables (#8209) Adds the cutover orchestration entry point with T0-T6 protocol structure as labeled blocks. Protocol logic lands in the next commit. Replaces the //TODO: cutover here seam with the actual call site. Ref: database-infrastructure#8209
Postpone gate (flag file + socket, no heartbeat-lag check), RENAME on the privileged source connection with no retry, drain-GTID capture on the same connection, 100ms GTID-containment drain poll with timeout, CutOverCompleteFlag set before return, on-success hook. Drain GTID persistence is #8210. Hook env vars are #8211. Ref: database-infrastructure#8209
…er (#8209)
Adds 3 pure-unit + 3 testcontainers-backed tests covering moveTablesCutOver
protocol mechanics:
1. NoopShortCircuits - Noop returns immediately; no hooks, no flag
2. OnBeforeCutOverHookAborts* - T0 hook failure aborts BEFORE T1 RENAME
(proxy: nil inspector.db would panic)
3. PostponeGateFiresOnce - OnBeginPostponed fires exactly once;
IsPostponingCutOver resets to 0
4. TestHappyPath - full T0-T6: hook order, T4 flag set,
source-side rename verified both sides
5. TestRenameFailurePropagates - T1 RENAME error wrapped; flag stays 0;
no OnSuccess
6. TestDrainTimeoutPropagates - T3 timeout wrapped via patched package
vars (t.Cleanup restored); flag stays 0
End-to-end acceptance criteria (DML drain under load, pt-table-checksum
parity) are validated against the multi-cluster test bed separately.
Ref: database-infrastructure#8209
There was a problem hiding this comment.
Pull request overview
This PR implements phase 1.5 of move-tables mode by adding a dedicated moveTablesCutOver() orchestrator that executes the cooperative cutover protocol (T0–T6). It replaces the previous move-tables cutover placeholder behavior, while keeping the standard --alter migration cutover path unchanged.
Changes:
- Adds
moveTablesCutOver()to perform cooperative cutover (postpone gate, T0 hooks, T1 RENAME on source, T2 capture drain GTID, T3 drain poll, T4 flag set, T5 OnSuccess). - Removes the temporary “cutover filename” mechanism and wires
MoveTables()to call the new orchestrator. - Adds unit + integration tests for the move-tables cooperative cutover and strengthens some move-tables-related applier behavior/tests.
Show a summary per file
| File | Description |
|---|---|
| go/logic/migrator.go | Implements moveTablesCutOver(), wires it into MoveTables(), and adjusts logging/error wrapping. |
| go/logic/migrator_test.go | Updates tests to match updated applier method signatures. |
| go/logic/migrator_move_tables_cutover_test.go | Adds new unit/integration tests for cooperative cutover orchestration. |
| go/logic/applier.go | Improves error wrapping, documents move-tables changelog no-op behavior, and checks sqlRows.Err(). |
| go/logic/applier_test.go | Adds move-tables tests/cleanup and updates calls for new applier signatures. |
| go/cmd/gh-ost/main.go | Removes the temporary --target-cutover-filename flag. |
| go/base/context.go | Removes TmpCutoverFilename from move-tables config struct. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 7/7 changed files
- Comments generated: 4
- Add mgtr.checkAbort() at top of T3 drain poll loop so abort/panic stops the drain immediately instead of waiting the full 60s timeout - Assert NoError on CREATE DATABASE IF NOT EXISTS in applier_test.go SetupSuite (was silently discarding the error with _, _ =) - Rewrite T2 comment to frame @@GLOBAL.gtid_executed as a readability divergence (making scope unambiguous in SQL) rather than asserting session-vs-global semantics that may not hold in MySQL 8.0 Ref: database-infrastructure#8209
zacharysierakowski
left a comment
There was a problem hiding this comment.
Great job here! Love how easy the diff was to follow.
I do think eventually (maybe at the very end, after we get a round or two of reviews on the final feature branch) we may need to remove the comments/references to the design doc (like "T3" etc.). But for now I love having them there for reviewing.
I left some comments but it's looking close!
…ents Fixes the cutover rename+gtid selection and drain logic
|
Here's a demo of the cutover/drain working as expected: cutover-demo.mov |
| mgtr.migrationContext.Log.Debugf("T4: CutOverCompleteFlag set") | ||
|
|
||
| // ----- T5: on-success hook ----- | ||
| // Hook unlocks user_rw@target via db-user-management and flips the |
There was a problem hiding this comment.
I think we should revise these comments before merging this feature branch to the main branch.
They contain some info and URLs that are only relevant/accessible to GitHub internal users.
There was a problem hiding this comment.
Yea, I mentioned the same thing!
|
Going to merge this to keep our branches from getting too stacked. Can followup with a separate PR to remove internal references at some point |
Demo for branch:
cutover-demo.mov
Implements phase 1.5 of the move-tables epic by adding a dedicated
moveTablesCutOver()orchestrator that follows the T0-T6 cooperative cutover protocol from coop_cutover.md §1.3. This is a behavior change for move-tables mode (was a//TODO: cutover hereno-op before); standard--altermigrations are unaffected._ghk)GH_OST_DRAIN_GTID/GH_OST_TARGET_*hook env varspt-table-checksumparity) is test-bed scope and lives in gh-ost-tablemove-poc/localtests.What approach did you choose and why?
A new package-local
moveTablesCutOver()was built from scratch alongside the standardcutOver()(not as a branch inside it) because every internal call fromcutOver()—throttle,atomicCutOver,waitForEventsUpToLock, heartbeat-lag — assumes a single MySQL cluster. With move-tables the applier writes target and the streamer reads source, so each of those subsystems either nil-panics (throttler isnilper gh-ost-edge-cases.md MT #6), drains nothing (sentinel goes to target's binlog, not source's, per MT #4), or never fires (heartbeat is skipped in §1.2). Reuse would have required guarded branches in five hot spots; a sibling function keeps the standard path untouched.The drain itself follows the GTID-after-RENAME approach the design doc settled on (over
LOCK TABLES, sentinel detection, or queue-length polling — all rejected in move_table_mode.md §1.5 "What not to do"): T1 issuesRENAME TABLEon the privileged source connection (mgtr.inspector.db), T2 reads@@global.gtid_executedon the same connection to capture the drain GTID as a closed superset of every committed write, T3 pollsapplier.CurrentCoordinatesagainst that GTID usingGTIDBinlogCoordinates.SmallerThan, T4 setsCutOverCompleteFlagsocanStopStreaming()returns true (MT #5), T5 firesOnSuccess, T6 returns.The PR ships in three commits intentionally:
2db53320feat(move-tables): add moveTablesCutOver skeleton and wire into MoveTables (#8209)— Adds the function with all T0-T6 transitions as labeled comment blocks; replaces the//TODO: cutover hereseam ingo/logic/migrator.go. No protocol logic yet, so the diff is reviewable as pure structure.60acf2c2feat(move-tables): implement T0-T6 cooperative cutover protocol (#8209)— Fills in the protocol: postpone gate, T0 hook, T1 RENAME viasqlutils.ExecNoPrepare(no retry — RENAME is not idempotent), T2 GTID capture, T3 drain poll with timeout, T4 flag set, T5 OnSuccess hook.54652ab2test(move-tables): add unit and integration tests for moveTablesCutOver (#8209)— Six tests in one new file; details below.Scope (#8209 only):
moveTablesCutOver()ingo/logic/migrator.go:931-1061— protocol orchestrator.go/logic/migrator.go:33-42:moveTablesCutOverDrainTimeout = 60 * time.SecondandmoveTablesCutOverDrainPollInterval = 100 * time.Millisecond. The 60s timeout is a starting default and is up for review with Daniel/Eric — the inline comment atgo/logic/migrator.go:33-41explains whyCutOverLockTimeoutSecondswas not reused (itsSetCutOverLockTimeoutSecondssetter caps it at 1–10s, too short for a real drain under load). Both constants arevar, notconst, so tests can patch them witht.Cleanuprestore without adding a production seam.go/logic/migrator.go:991-995: the design specifies@@gtid_executed, this PR uses@@global.gtid_executed. The unqualified form resolves to session scope (only this session's GTIDs), but the drain needs the global GTID set (all committed transactions). The comment cites the pinned design URL so any future reviewer sees exactly what changed and why.go/logic/migrator.go:901-903: the//TODO: cutover hereseam inMoveTables()is replaced withmgtr.moveTablesCutOver(). The redundantmgtr.hooksExecutor.OnSuccess(false)placeholder that previously sat betweenfinalCleanup()and the "Done moving tables" log line (base-branchMoveTables()L899-901) is removed in the same commit, becauseOnSuccessnow fires at T5 inside the protocol. Removing it is in-scope for #8209: editingMoveTables()is move-tables-specific work; the standard-modeMigrate()path is not touched.go/logic/migrator_move_tables_cutover_test.go(328 lines, one new file). The test-suite driver isTestMoveTablesCutOverso-run MoveTablesCutOvermatches both the pure-unitTestMoveTablesCutOver_*tests and the suite'sTestMoveTablesCutOver/<method>sub-tests in one invocation. The SetupSuite block duplicatesMigratorTestSuiteto keep #8209 additive; a shared helper is a clean follow-up.Test → criterion mapping:
TestMoveTablesCutOver_NoopShortCircuitsTestMoveTablesCutOver_OnBeforeCutOverHookAbortsBeforeRenamenil inspector.dbwould panic if RENAME were attemptedTestMoveTablesCutOver_PostponeGateFiresOnBeginPostponedOnceIsPostponingCutOverpre and post state both asserted at 0)MoveTablesCutOverSuite.TestHappyPathER_NO_SUCH_TABLE")MoveTablesCutOverSuite.TestRenameFailurePropagatesOnSuccessMoveTablesCutOverSuite.TestDrainTimeoutPropagatest.Cleanup, wrapped err, flag stays 0Explicitly NOT in this PR (belongs to adjacent issues):
_ghk(#8210) — required for crash-safe resume between T2 and T3. The PR notes this at the doc comment header ofmoveTablesCutOver()(go/logic/migrator.go:925-929). Without it, a crash between T2 and T3 forces the operator to restart from row-copy.GH_OST_DRAIN_GTID,GH_OST_TARGET_*. Hooks fire today with the standard env vars only.initiateThrottler()early-returns in move-tables mode (go/logic/migrator.go:1627-1631) somgtr.throttleris nil; this is whymoveTablesCutOver()must not call the standardcutOver()path (gh-ost-edge-cases.md MT still experimenting, yes? go-mysql binlog parser looks good #6).GetTargetDatabaseName()inverted branches (#8207 scope) — the tests in this PR use identical source/target DB names (test), which masks that bug per gh-ost-edge-cases.md MT test #7.test+test_other); cross-cluster GTID drain across separate binlogs cannot be exercised here. Validation against the multi-cluster bed ingithub/gh-ost-tablemove-poc/localtests/is acceptance work tracked outside this PR.Which feature flags are involved in this change?
IsMoveTablesMode()predicate (#8167); this PR adds no new flags.Which environments does this change target?
script/build.Risk assessment
IsMoveTablesMode()is true. Standard--altermigrations bypass it entirely:MoveTables()andmoveTablesCutOver()are package-local and unreachable fromMigrate(). The only line touched inMoveTables()itself is the replacement of//TODO: cutover hereand the removal of the now-redundantOnSuccessplaceholder; nothing inMigrate()was modified. The 60s drain timeout is the one tuning parameter that could need adjustment based on real-world drain durations under load — flagged for Daniel/Eric in the inline comment so future operators can find the decision rationale.How did/will you validate this change?
go test -run MoveTablesCutOver -v ./go/logic/... 2>&1 | tail -40sleepWhileTrue1s tick).mysql:8.0.42).ok github.com/github/gh-ost/go/logic 26.502s.go build ./...clean.gofmt -wclean.go vet ./go/logic/...clean.RENAME TABLE \test`.`testing` TO `test`.`_testing_del`), T2 captured a real drain GTID (ff2fb2f9-6532-11f1-818e-560e4160caa3:1-14`), T3 reported "drain complete; applier caught up to drain GTID", and T4 set the flag.Known pre-existing test failures (not introduced by this PR)
Sharing the same root causes documented for #8206 on the base branch (
move-tables/1.2-skip-ghost-tables):GetTargetDatabaseName()produces"Incorrect database name ''"when source ≠ target — inverted branches atgo/base/context.go:411, owned by #8207.max_binlog_cache_sizedefaults trip some streamer-heavy suite tests.This PR's own tests (
-run MoveTablesCutOver) sidestep all three by using identical source/target DB names, not exercising the streamer at scale, and gracefully skipping under-shortwhen testcontainers is unavailable.Are there related full stack changes?
If something goes wrong, what are the mitigation and rollback strategies?
move-tables/1.2-skip-ghost-tables. The change is fully contained to gh-ost'sgo/logicpackage.--move-tables(per #8167). Operators on standard--alterare not affected; if a move-tables run hits an unexpected drain timeout, rerunning with a largermoveTablesCutOverDrainTimeout(currentlyvar, easy to change) or aborting via the existingpanicflag are the immediate levers.Reviewers: Please scrutinize the T2
@@global.gtid_executeddivergence (go/logic/migrator.go:991-995) and the 60s drain-timeout default (go/logic/migrator.go:33-41) — both are documented inline but are the two places I'd most expect pushback. Boundary issues #8210 / #8211 / #8212 are not addressed here by design (called out in the doc comment at the head ofmoveTablesCutOver).