maria: require resync on partial row events#4530
Conversation
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
- add metric to count unsupported event types
| "unsupported MariaDB PARTIAL_ROW_DATA_EVENT (type %d): fragmented oversized row events cannot be processed; a resync is required", | ||
| e.EventType) |
There was a problem hiding this comment.
nit: would be good to also supply schema/table name
There was a problem hiding this comment.
I see that generic event doesn't provide this out-of-the-box and implementing our own parser may be out-of-scope, although technically possible according to Claude:
Pull the table ID out of the first fragment. The first fragment's content embeds the original rows event verbatim: its own 19-byte common header, then the rows-event post-header, which begins with a 6-byte little-endian table ID (same layout RowsEvent.Decode reads at row_event.go:1021). So inside ev.Data:
// ev.Data layout for PARTIAL_ROW_DATA_EVENT (172):
// [0:4] seq_no, [4:8] total_fragments, [8] flags,
// [9:17] orig size (only if flags&0x01, i.e. first fragment), then content.
pos := 9
if ev.Data[8]&0x01 != 0 {
pos += 8
}
content := ev.Data[pos:]
// content = original rows event: 19-byte header, then post-header
tableID := mysql.FixedLengthInt(content[19 : 19+6])
There was a problem hiding this comment.
We would also start failing on tables outside of the pipe now without parsing the table name. Could either handle all at once (parsing and skipping if not included) or see if an escalation comes and handle both then
There was a problem hiding this comment.
added logic to filter out partial raw failures related to the tables that aren't included in the pipe
the code is not very simple, but the e2e test covers it well
i can rollback if you think that it's too much complexity for such feature (as we actually don't know how often this event happens in production yet)
jgao54
left a comment
There was a problem hiding this comment.
a few nit comment but lgtm overall
| "unsupported MariaDB PARTIAL_ROW_DATA_EVENT (type %d): fragmented oversized row events cannot be processed; a resync is required", | ||
| e.EventType) |
There was a problem hiding this comment.
We would also start failing on tables outside of the pipe now without parsing the table name. Could either handle all at once (parsing and skipping if not included) or see if an escalation comes and handle both then
| default: | ||
| c.logger.Warn("unknown generic event", slog.Any("type", event.Header.EventType)) | ||
| otelManager.Metrics.UnsupportedBinlogEventCounter.Add(ctx, 1, metric.WithAttributeSet(attribute.NewSet( | ||
| attribute.String(otel_metrics.BinlogEventTypeKey, strconv.Itoa(int(event.Header.EventType))), |
There was a problem hiding this comment.
Does this need to be an attribute string + itoa? If you're planning to include non-generic events here from the parent switch-case or other branches, these ones would be better as Generic_%d.
There was a problem hiding this comment.
changed to int
i don't think Generic_ prefix makes a lot of sense here, because we do have event type in a header for each incoming event (regardless generic or not)
and over time some of those Generic event might become distinct Events on go-mysql side
so, I would keep it as an integer instead (on go-mysql level it's a byte, but int should be ok for the label)

❌ 2 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 1 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
❌ Test FailureAnalysis: The PR's own new test Test_MariaDB_PartialRowEvent fails deterministically across all three matrix jobs (mysql-gtid/pos, MySQL 6.0/7.0/8.0) after ~190s, indicating a real bug in the PR's partial-row-event resync logic rather than a flaky failure. |
| return b | ||
| } | ||
| // first fragment: total=9, seq=1, flags=FL_ORIG_EVENT_SIZE, then origSize + embedded rows event | ||
| firstFragment := decode("09000000010000000126200000000000002cc0476a" + |
There was a problem hiding this comment.
these are real payloads captured during local testing, so I think we can keep this unit test in addition to e2e one (would be faster to run locally when we make changes to parsing logic)
We don't support partial row events. MariaDB 12.3+ emits those when payload size is bigger than
binlog_row_event_fragment_threshold(default 1Gb).This PR handles this event type and requires a resync.
In addition, it introduces
UnsupportedBinlogEventCounter(eventType)metric to keep track of skipped events.