logictest: add retry to flaky regional_by_row_cascade EXPLAIN#164575
logictest: add retry to flaky regional_by_row_cascade EXPLAIN#164575mw5h wants to merge 1 commit intocockroachdb:masterfrom
Conversation
The full EXPLAIN (VERBOSE) check in regional_by_row_cascade intermittently fails because, despite clearing the stats cache before the query, the optimizer occasionally does not see the freshly collected stats in time. This causes the plan output to differ (e.g. "missing stats" instead of actual row count estimates), which cascades into a full EXPLAIN mismatch. A retry was originally added in cockroachdb#156823 for a count-based query in the same file, then removed in cockroachdb#157039 when the metamorphic batch size issue was identified. The metamorphic fix did not fully resolve the flake; it continues to appear across master, release-26.1, and release-26.1.0-rc branches. Add retry to the full EXPLAIN query to handle the rare stats propagation timing window. Fixes: cockroachdb#164001 Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
Merging to
|
|
The difference in the failures #164001 and #164399 is actually not the stats issue
It seems as if we rarely choose to use the locking scan as the initial read in the DELETE. I don't see any interesting metamorphic variables. Also we do specify the column family definition, so it's not related to column family randomization. I wonder whether @michae2 has any hypothesis. Perhaps a retry would fix the flake, but I wonder whether we should just remove the whole output and just count the number of parallel readers (which this test is really about) with something like |
We expect to use a LOS, but in rare cases we're not. I don't think the LOS depends on stats, but rather regions... my guess is that when these failures happen the gateway node is missing region(s) for some reason. I'm hoping that #161999 and #164001 and company are the same as #161528 and have just been fixed by #163883. |

Summary
retrydirective to the fullEXPLAIN (VERBOSE)query inregional_by_row_cascadeto handle a rare stats propagation timing issue that causes intermittent test failures across master, release-26.1, and release-26.1.0-rc.Fixes: #164001
Test plan
The test itself is the test plan — the
retrydirective allows the EXPLAIN query to be retried with backoff if the output doesn't match on the first attempt.