Skip to content

maria: require resync on partial row events#4530

Open
dtunikov wants to merge 10 commits into
mainfrom
fix/dbi-815/mysql-fail-on-maria-partial-row-data-event
Open

maria: require resync on partial row events#4530
dtunikov wants to merge 10 commits into
mainfrom
fix/dbi-815/mysql-fail-on-maria-partial-row-data-event

Conversation

@dtunikov

@dtunikov dtunikov commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

We don't support partial row events. MariaDB 12.3+ emits those when payload size is bigger than binlog_row_event_fragment_threshold (default 1Gb).
This PR handles this event type and requires a resync.
In addition, it introduces UnsupportedBinlogEventCounter(eventType) metric to keep track of skipped events.

@dtunikov dtunikov requested a review from a team as a code owner July 2, 2026 11:24
@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

- add metric to count unsupported event types
@dtunikov dtunikov changed the title fail loudly on partial row maria event maria: require resync on partial row events Jul 2, 2026
Comment thread flow/connectors/mysql/cdc.go Outdated
Comment thread flow/connectors/mysql/cdc.go
Comment thread flow/e2e/clickhouse_mysql_test.go
Comment thread flow/e2e/mysql.go Fixed
Comment thread flow/alerting/classifier.go Outdated
Comment on lines +167 to +168
"unsupported MariaDB PARTIAL_ROW_DATA_EVENT (type %d): fragmented oversized row events cannot be processed; a resync is required",
e.EventType)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to also supply schema/table name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that generic event doesn't provide this out-of-the-box and implementing our own parser may be out-of-scope, although technically possible according to Claude:

Pull the table ID out of the first fragment. The first fragment's content embeds the original rows event verbatim: its own 19-byte common header, then the rows-event post-header, which begins with a 6-byte little-endian table ID (same layout RowsEvent.Decode reads at row_event.go:1021). So inside ev.Data:

// ev.Data layout for PARTIAL_ROW_DATA_EVENT (172):
// [0:4] seq_no, [4:8] total_fragments, [8] flags,
// [9:17] orig size (only if flags&0x01, i.e. first fragment), then content.
pos := 9
if ev.Data[8]&0x01 != 0 {
    pos += 8
}
content := ev.Data[pos:]
// content = original rows event: 19-byte header, then post-header
tableID := mysql.FixedLengthInt(content[19 : 19+6])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also start failing on tables outside of the pipe now without parsing the table name. Could either handle all at once (parsing and skipping if not included) or see if an escalation comes and handle both then

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logic to filter out partial raw failures related to the tables that aren't included in the pipe
the code is not very simple, but the e2e test covers it well
i can rollback if you think that it's too much complexity for such feature (as we actually don't know how often this event happens in production yet)

Comment thread flow/e2e/clickhouse_mysql_test.go Outdated

@jgao54 jgao54 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nit comment but lgtm overall

Comment thread flow/connectors/mysql/cdc.go Outdated
Comment thread flow/e2e/clickhouse_mysql_test.go
Comment on lines +167 to +168
"unsupported MariaDB PARTIAL_ROW_DATA_EVENT (type %d): fragmented oversized row events cannot be processed; a resync is required",
e.EventType)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also start failing on tables outside of the pipe now without parsing the table name. Could either handle all at once (parsing and skipping if not included) or see if an escalation comes and handle both then

Comment thread flow/connectors/mysql/cdc.go Outdated
default:
c.logger.Warn("unknown generic event", slog.Any("type", event.Header.EventType))
otelManager.Metrics.UnsupportedBinlogEventCounter.Add(ctx, 1, metric.WithAttributeSet(attribute.NewSet(
attribute.String(otel_metrics.BinlogEventTypeKey, strconv.Itoa(int(event.Header.EventType))),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an attribute string + itoa? If you're planning to include non-generic events here from the parent switch-case or other branches, these ones would be better as Generic_%d.

@dtunikov dtunikov Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to int
i don't think Generic_ prefix makes a lot of sense here, because we do have event type in a header for each incoming event (regardless generic or not)
and over time some of those Generic event might become distinct Events on go-mysql side
so, I would keep it as an integer instead (on go-mysql level it's a byte, but int should be ok for the label)
image

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2954 2 2952 343
View the top 1 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/e2e::TestPeerFlowE2ETestSuiteMariaDB_CH/Test_MariaDB_PartialRowEvent
Stack Traces | 195s run time
=== RUN   TestPeerFlowE2ETestSuiteMariaDB_CH/Test_MariaDB_PartialRowEvent
=== PAUSE TestPeerFlowE2ETestSuiteMariaDB_CH/Test_MariaDB_PartialRowEvent
=== CONT  TestPeerFlowE2ETestSuiteMariaDB_CH/Test_MariaDB_PartialRowEvent
2026/07/03 13:45:02 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mach_xsk2ix5z.test_mysql_schema_changes
    clickhouse_mysql_test.go:1957: WaitFor waiting for in-pipe row past out-of-pipe partial row event 2026-07-03 13:45:16.564518412 +0000 UTC m=+485.130611153
2026/07/03 13:45:16 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mach_1iwesbnb.test_blobs
2026/07/03 13:45:16 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_mach_6zelra5m.test_bit
    clickhouse_mysql_test.go:1957: UNEXPECTED TIMEOUT waiting for in-pipe row past out-of-pipe partial row event 2026-07-03 13:48:17.177314151 +0000 UTC m=+665.743406874
--- FAIL: TestPeerFlowE2ETestSuiteMariaDB_CH/Test_MariaDB_PartialRowEvent (194.78s)
View the full list of 1 ❄️ flaky test(s)
github.com/PeerDB-io/peerdb/flow/e2e::TestPeerFlowE2ETestSuiteMariaDB_CH

Flake rate in main: 14.29% (Passed 24 times, Failed 4 times)

Stack Traces | 0.01s run time
=== RUN   TestPeerFlowE2ETestSuiteMariaDB_CH
=== PAUSE TestPeerFlowE2ETestSuiteMariaDB_CH
=== CONT  TestPeerFlowE2ETestSuiteMariaDB_CH
--- FAIL: TestPeerFlowE2ETestSuiteMariaDB_CH (0.01s)
2026/07/03 13:49:14 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/07/03 13:49:14 INFO fetched schema x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} table=e2e_test_machcl_7ipudwcu.test_composite_pkey_ordering

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: The PR's own new test Test_MariaDB_PartialRowEvent fails deterministically across all three matrix jobs (mysql-gtid/pos, MySQL 6.0/7.0/8.0) after ~190s, indicating a real bug in the PR's partial-row-event resync logic rather than a flaky failure.
Confidence: 0.9

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

return b
}
// first fragment: total=9, seq=1, flags=FL_ORIG_EVENT_SIZE, then origSize + embedded rows event
firstFragment := decode("09000000010000000126200000000000002cc0476a" +

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are real payloads captured during local testing, so I think we can keep this unit test in addition to e2e one (would be faster to run locally when we make changes to parsing logic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants