Skip to content

cancelchecker: fix data race in GoroutineCPUHandle registration#164603

Draft
ZhouXing19 wants to merge 1 commit intocockroachdb:masterfrom
ZhouXing19:GoroutineCPUHandle-data-race
Draft

cancelchecker: fix data race in GoroutineCPUHandle registration#164603
ZhouXing19 wants to merge 1 commit intocockroachdb:masterfrom
ZhouXing19:GoroutineCPUHandle-data-race

Conversation

@ZhouXing19
Copy link
Collaborator

@ZhouXing19 ZhouXing19 commented Mar 1, 2026

Fixes #164346
Fixes #164347
Fixes #164348
Fixes #164349
Fixes #164350
Fixes #164352
Fixes #164353
Fixes #164354
Fixes #164355
Fixes #164594

The row-based CancelChecker.Reset() was eagerly calling SQLCPUHandle.RegisterGoroutine() during Reset(), which runs on the main goroutine. The resulting GoroutineCPUHandle was keyed to the main goroutine's ID. When the CancelChecker was later used inside a ParallelUnorderedSynchronizer worker goroutine (via a Columnarizer wrapping a row-based processor), it called measureAndAdmit() on the main goroutine's handle. Meanwhile, the main goroutine's vectorized CancelChecker lazily registered and obtained the same handle (same goroutine ID). Both goroutines then called measureAndAdmit() concurrently on unsynchronized fields (cpuAccounted, pauseDur, etc.), triggering a data race detected by the race detector.

The fix defers RegisterGoroutine() from Reset() to the first Check() call, matching the pattern already used in the vectorized cancel checker (colexecutils.CancelChecker) since PR #163771. This ensures each goroutine registers its own handle on the goroutine that actually uses it.

Release note: None

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

The row-based CancelChecker.Reset() was eagerly calling
SQLCPUHandle.RegisterGoroutine() during Reset(), which runs on the main
goroutine. The resulting GoroutineCPUHandle was keyed to the main
goroutine's ID. When the CancelChecker was later used inside a
ParallelUnorderedSynchronizer worker goroutine (via a Columnarizer
wrapping a row-based processor), it called measureAndAdmit() on the main
goroutine's handle. Meanwhile, the main goroutine's vectorized
CancelChecker lazily registered and obtained the same handle (same
goroutine ID). Both goroutines then called measureAndAdmit() concurrently
on unsynchronized fields (cpuAccounted, pauseDur, etc.), triggering a
data race detected by the race detector.

The fix defers RegisterGoroutine() from Reset() to the first Check()
call, matching the pattern already used in the vectorized cancel checker
(colexecutils.CancelChecker) since PR cockroachdb#163771. This ensures each
goroutine registers its own handle on the goroutine that actually uses
it.

Fixes cockroachdb#164346
Fixes cockroachdb#164347
Fixes cockroachdb#164348
Fixes cockroachdb#164349
Fixes cockroachdb#164350
Fixes cockroachdb#164352
Fixes cockroachdb#164353
Fixes cockroachdb#164354
Fixes cockroachdb#164355
Fixes cockroachdb#164594

Release note: None

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@trunk-io
Copy link
Contributor

trunk-io bot commented Mar 1, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@blathers-crl
Copy link

blathers-crl bot commented Mar 1, 2026

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment