Skip to content

feat(move-tables): build cooperative cutover orchestration (#8209)#1704

Merged
zacharysierakowski merged 9 commits into
feature-move-tablesfrom
womoruyi/move-tables-1.5-cutover
Jun 12, 2026
Merged

feat(move-tables): build cooperative cutover orchestration (#8209)#1704
zacharysierakowski merged 9 commits into
feature-move-tablesfrom
womoruyi/move-tables-1.5-cutover

Conversation

@womoruyi

@womoruyi womoruyi commented Jun 11, 2026

Copy link
Copy Markdown

Demo for branch:

cutover-demo.mov

Implements phase 1.5 of the move-tables epic by adding a dedicated moveTablesCutOver() orchestrator that follows the T0-T6 cooperative cutover protocol from coop_cutover.md §1.3. This is a behavior change for move-tables mode (was a //TODO: cutover here no-op before); standard --alter migrations are unaffected.

What approach did you choose and why?

A new package-local moveTablesCutOver() was built from scratch alongside the standard cutOver() (not as a branch inside it) because every internal call from cutOver()throttle, atomicCutOver, waitForEventsUpToLock, heartbeat-lag — assumes a single MySQL cluster. With move-tables the applier writes target and the streamer reads source, so each of those subsystems either nil-panics (throttler is nil per gh-ost-edge-cases.md MT #6), drains nothing (sentinel goes to target's binlog, not source's, per MT #4), or never fires (heartbeat is skipped in §1.2). Reuse would have required guarded branches in five hot spots; a sibling function keeps the standard path untouched.

The drain itself follows the GTID-after-RENAME approach the design doc settled on (over LOCK TABLES, sentinel detection, or queue-length polling — all rejected in move_table_mode.md §1.5 "What not to do"): T1 issues RENAME TABLE on the privileged source connection (mgtr.inspector.db), T2 reads @@global.gtid_executed on the same connection to capture the drain GTID as a closed superset of every committed write, T3 polls applier.CurrentCoordinates against that GTID using GTIDBinlogCoordinates.SmallerThan, T4 sets CutOverCompleteFlag so canStopStreaming() returns true (MT #5), T5 fires OnSuccess, T6 returns.

The PR ships in three commits intentionally:

  1. 2db53320 feat(move-tables): add moveTablesCutOver skeleton and wire into MoveTables (#8209) — Adds the function with all T0-T6 transitions as labeled comment blocks; replaces the //TODO: cutover here seam in go/logic/migrator.go. No protocol logic yet, so the diff is reviewable as pure structure.
  2. 60acf2c2 feat(move-tables): implement T0-T6 cooperative cutover protocol (#8209) — Fills in the protocol: postpone gate, T0 hook, T1 RENAME via sqlutils.ExecNoPrepare (no retry — RENAME is not idempotent), T2 GTID capture, T3 drain poll with timeout, T4 flag set, T5 OnSuccess hook.
  3. 54652ab2 test(move-tables): add unit and integration tests for moveTablesCutOver (#8209) — Six tests in one new file; details below.

Scope (#8209 only):

  1. New moveTablesCutOver() in go/logic/migrator.go:931-1061 — protocol orchestrator.
  2. Package-level constants for drain tuning in go/logic/migrator.go:33-42: moveTablesCutOverDrainTimeout = 60 * time.Second and moveTablesCutOverDrainPollInterval = 100 * time.Millisecond. The 60s timeout is a starting default and is up for review with Daniel/Eric — the inline comment at go/logic/migrator.go:33-41 explains why CutOverLockTimeoutSeconds was not reused (its SetCutOverLockTimeoutSeconds setter caps it at 1–10s, too short for a real drain under load). Both constants are var, not const, so tests can patch them with t.Cleanup restore without adding a production seam.
  3. T2 divergence from the design doc, documented inline at go/logic/migrator.go:991-995: the design specifies @@gtid_executed, this PR uses @@global.gtid_executed. The unqualified form resolves to session scope (only this session's GTIDs), but the drain needs the global GTID set (all committed transactions). The comment cites the pinned design URL so any future reviewer sees exactly what changed and why.
  4. Wire-up in go/logic/migrator.go:901-903: the //TODO: cutover here seam in MoveTables() is replaced with mgtr.moveTablesCutOver(). The redundant mgtr.hooksExecutor.OnSuccess(false) placeholder that previously sat between finalCleanup() and the "Done moving tables" log line (base-branch MoveTables() L899-901) is removed in the same commit, because OnSuccess now fires at T5 inside the protocol. Removing it is in-scope for #8209: editing MoveTables() is move-tables-specific work; the standard-mode Migrate() path is not touched.
  5. Six new tests in go/logic/migrator_move_tables_cutover_test.go (328 lines, one new file). The test-suite driver is TestMoveTablesCutOver so -run MoveTablesCutOver matches both the pure-unit TestMoveTablesCutOver_* tests and the suite's TestMoveTablesCutOver/<method> sub-tests in one invocation. The SetupSuite block duplicates MigratorTestSuite to keep #8209 additive; a shared helper is a clean follow-up.

Test → criterion mapping:

# Test Layer Maps to
1 TestMoveTablesCutOver_NoopShortCircuits unit Noop semantics: skip protocol entirely, no hooks, flag stays 0
2 TestMoveTablesCutOver_OnBeforeCutOverHookAbortsBeforeRename unit T0 abort safety (coop_cutover §1.3 "non-zero return aborts cutover") + proxy: nil inspector.db would panic if RENAME were attempted
3 TestMoveTablesCutOver_PostponeGateFiresOnBeginPostponedOnce unit Postpone-gate idempotency + Edge Case Quality #5 (IsPostponingCutOver pre and post state both asserted at 0)
4 MoveTablesCutOverSuite.TestHappyPath integration Full T0-T6: hook ordering, T4 flag set, source-side RENAME verified both sides (#8209 AC: "writes against the source's original table name fail with ER_NO_SUCH_TABLE")
5 MoveTablesCutOverSuite.TestRenameFailurePropagates integration T1 failure handling: wrapped err, flag stays 0, no OnSuccess
6 MoveTablesCutOverSuite.TestDrainTimeoutPropagates integration T3 timeout handling: package vars patched via t.Cleanup, wrapped err, flag stays 0

Explicitly NOT in this PR (belongs to adjacent issues):

  • Drain-GTID persistence to _ghk (#8210) — required for crash-safe resume between T2 and T3. The PR notes this at the doc comment header of moveTablesCutOver() (go/logic/migrator.go:925-929). Without it, a crash between T2 and T3 forces the operator to restart from row-copy.
  • Hook env-var enrichment (#8211) — GH_OST_DRAIN_GTID, GH_OST_TARGET_*. Hooks fire today with the standard env vars only.
  • Target-server throttling (#8212) — initiateThrottler() early-returns in move-tables mode (go/logic/migrator.go:1627-1631) so mgtr.throttler is nil; this is why moveTablesCutOver() must not call the standard cutOver() path (gh-ost-edge-cases.md MT still experimenting, yes? go-mysql binlog parser looks good #6).
  • GetTargetDatabaseName() inverted branches (#8207 scope) — the tests in this PR use identical source/target DB names (test), which masks that bug per gh-ost-edge-cases.md MT test  #7.
  • Multi-cluster end-to-end validation — the in-repo test suite uses one MySQL container with two databases (test + test_other); cross-cluster GTID drain across separate binlogs cannot be exercised here. Validation against the multi-cluster bed in github/gh-ost-tablemove-poc/localtests/ is acceptance work tracked outside this PR.

Which feature flags are involved in this change?

  • None. Move-tables mode itself is gated by the existing IsMoveTablesMode() predicate (#8167); this PR adds no new flags.

Which environments does this change target?

  • N/A — gh-ost is an operator tool, not a service. Built and shipped via script/build.

Risk assessment

  • Low risk. The new orchestrator only runs when IsMoveTablesMode() is true. Standard --alter migrations bypass it entirely: MoveTables() and moveTablesCutOver() are package-local and unreachable from Migrate(). The only line touched in MoveTables() itself is the replacement of //TODO: cutover here and the removal of the now-redundant OnSuccess placeholder; nothing in Migrate() was modified. The 60s drain timeout is the one tuning parameter that could need adjustment based on real-world drain durations under load — flagged for Daniel/Eric in the inline comment so future operators can find the decision rationale.

How did/will you validate this change?

  • Testsgo test -run MoveTablesCutOver -v ./go/logic/... 2>&1 | tail -40
    • 3 unit tests pass in ~1.0s (postpone gate test does a real sleepWhileTrue 1s tick).
    • 3 integration tests pass in ~0.4s of actual test time (the surrounding ~25s is testcontainers spinning up mysql:8.0.42).
    • Last clean run: all 6 PASS, ok github.com/github/gh-ost/go/logic 26.502s.
  • Othergo build ./... clean. gofmt -w clean. go vet ./go/logic/... clean.
  • Other — Manual proof-of-life from the integration logs: T1 RENAME executed via the inspector connection (RENAME TABLE \test`.`testing` TO `test`.`_testing_del`), T2 captured a real drain GTID (ff2fb2f9-6532-11f1-818e-560e4160caa3:1-14`), T3 reported "drain complete; applier caught up to drain GTID", and T4 set the flag.

Known pre-existing test failures (not introduced by this PR)

Sharing the same root causes documented for #8206 on the base branch (move-tables/1.2-skip-ghost-tables):

  • GetTargetDatabaseName() produces "Incorrect database name ''" when source ≠ target — inverted branches at go/base/context.go:411, owned by #8207.
  • max_binlog_cache_size defaults trip some streamer-heavy suite tests.
  • Testcontainers startup is occasionally flaky in CI when Docker is overloaded.

This PR's own tests (-run MoveTablesCutOver) sidestep all three by using identical source/target DB names, not exercising the streamer at scale, and gracefully skipping under -short when testcontainers is unavailable.

Are there related full stack changes?

  • No — this is an operator-tool change inside the gh-ost Go binary.

If something goes wrong, what are the mitigation and rollback strategies?

  • Rollback — Revert the merge commit on move-tables/1.2-skip-ghost-tables. The change is fully contained to gh-ost's go/logic package.
  • Operator workaround — Move-tables is opt-in via --move-tables (per #8167). Operators on standard --alter are not affected; if a move-tables run hits an unexpected drain timeout, rerunning with a larger moveTablesCutOverDrainTimeout (currently var, easy to change) or aborting via the existing panic flag are the immediate levers.

Reviewers: Please scrutinize the T2 @@global.gtid_executed divergence (go/logic/migrator.go:991-995) and the 60s drain-timeout default (go/logic/migrator.go:33-41) — both are documented inline but are the two places I'd most expect pushback. Boundary issues #8210 / #8211 / #8212 are not addressed here by design (called out in the doc comment at the head of moveTablesCutOver).

womoruyi added 3 commits June 10, 2026 23:53
…ables (#8209)

Adds the cutover orchestration entry point with T0-T6 protocol structure
as labeled blocks. Protocol logic lands in the next commit. Replaces the
//TODO: cutover here seam with the actual call site.

Ref: database-infrastructure#8209
Postpone gate (flag file + socket, no heartbeat-lag check), RENAME on
the privileged source connection with no retry, drain-GTID capture on
the same connection, 100ms GTID-containment drain poll with timeout,
CutOverCompleteFlag set before return, on-success hook.

Drain GTID persistence is #8210. Hook env vars are #8211.

Ref: database-infrastructure#8209
…er (#8209)

Adds 3 pure-unit + 3 testcontainers-backed tests covering moveTablesCutOver
protocol mechanics:

  1. NoopShortCircuits             - Noop returns immediately; no hooks, no flag
  2. OnBeforeCutOverHookAborts*    - T0 hook failure aborts BEFORE T1 RENAME
                                     (proxy: nil inspector.db would panic)
  3. PostponeGateFiresOnce         - OnBeginPostponed fires exactly once;
                                     IsPostponingCutOver resets to 0
  4. TestHappyPath                 - full T0-T6: hook order, T4 flag set,
                                     source-side rename verified both sides
  5. TestRenameFailurePropagates   - T1 RENAME error wrapped; flag stays 0;
                                     no OnSuccess
  6. TestDrainTimeoutPropagates    - T3 timeout wrapped via patched package
                                     vars (t.Cleanup restored); flag stays 0

End-to-end acceptance criteria (DML drain under load, pt-table-checksum
parity) are validated against the multi-cluster test bed separately.

Ref: database-infrastructure#8209
Copilot AI review requested due to automatic review settings June 11, 2026 01:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements phase 1.5 of move-tables mode by adding a dedicated moveTablesCutOver() orchestrator that executes the cooperative cutover protocol (T0–T6). It replaces the previous move-tables cutover placeholder behavior, while keeping the standard --alter migration cutover path unchanged.

Changes:

  • Adds moveTablesCutOver() to perform cooperative cutover (postpone gate, T0 hooks, T1 RENAME on source, T2 capture drain GTID, T3 drain poll, T4 flag set, T5 OnSuccess).
  • Removes the temporary “cutover filename” mechanism and wires MoveTables() to call the new orchestrator.
  • Adds unit + integration tests for the move-tables cooperative cutover and strengthens some move-tables-related applier behavior/tests.
Show a summary per file
File Description
go/logic/migrator.go Implements moveTablesCutOver(), wires it into MoveTables(), and adjusts logging/error wrapping.
go/logic/migrator_test.go Updates tests to match updated applier method signatures.
go/logic/migrator_move_tables_cutover_test.go Adds new unit/integration tests for cooperative cutover orchestration.
go/logic/applier.go Improves error wrapping, documents move-tables changelog no-op behavior, and checks sqlRows.Err().
go/logic/applier_test.go Adds move-tables tests/cleanup and updates calls for new applier signatures.
go/cmd/gh-ost/main.go Removes the temporary --target-cutover-filename flag.
go/base/context.go Removes TmpCutoverFilename from move-tables config struct.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 7/7 changed files
  • Comments generated: 4

Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go Outdated
Comment thread go/logic/applier_test.go
@womoruyi womoruyi requested review from danieljoos and ericyan June 11, 2026 02:09
- Add mgtr.checkAbort() at top of T3 drain poll loop so abort/panic
  stops the drain immediately instead of waiting the full 60s timeout
- Assert NoError on CREATE DATABASE IF NOT EXISTS in applier_test.go
  SetupSuite (was silently discarding the error with _, _ =)
- Rewrite T2 comment to frame @@GLOBAL.gtid_executed as a readability
  divergence (making scope unambiguous in SQL) rather than asserting
  session-vs-global semantics that may not hold in MySQL 8.0

Ref: database-infrastructure#8209
@zacharysierakowski zacharysierakowski changed the base branch from move-tables/1.2-skip-ghost-tables to feature-move-tables June 11, 2026 15:21
Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go Outdated
Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go Outdated
Comment thread go/logic/migrator.go Outdated

@zacharysierakowski zacharysierakowski left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job here! Love how easy the diff was to follow.

I do think eventually (maybe at the very end, after we get a round or two of reviews on the final feature branch) we may need to remove the comments/references to the design doc (like "T3" etc.). But for now I love having them there for reviewing.

I left some comments but it's looking close!

…ents

Fixes the cutover rename+gtid selection and drain logic

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 8/8 changed files
  • Comments generated: 5

Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go
Comment thread go/logic/migrator.go
Comment thread go/base/context.go
Comment thread script/move-tables/README.md
@zacharysierakowski

Copy link
Copy Markdown

Here's a demo of the cutover/drain working as expected:

cutover-demo.mov

Comment thread go/logic/migrator.go
mgtr.migrationContext.Log.Debugf("T4: CutOverCompleteFlag set")

// ----- T5: on-success hook -----
// Hook unlocks user_rw@target via db-user-management and flips the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should revise these comments before merging this feature branch to the main branch.
They contain some info and URLs that are only relevant/accessible to GitHub internal users.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I mentioned the same thing!

@danieljoos danieljoos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@zacharysierakowski

Copy link
Copy Markdown

Going to merge this to keep our branches from getting too stacked. Can followup with a separate PR to remove internal references at some point

@zacharysierakowski zacharysierakowski merged commit 865282c into feature-move-tables Jun 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature-move-tables PRs that are associated with the new move-tables feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants