agent: detect dropped RSP_DATA packets in read_memory via seq gap#106
Merged
Conversation
`FlashAgentClient.read_memory` would silently return short data when a
RSP_DATA packet got CRC-rejected mid-stream by `recv_packet`. Surfaced
backing up an hi3516av300:emmc camera's partitions over the new agent
eMMC reader: at 921600 baud a 9 MiB stream consistently dropped ~5%
of packets, and the client appended the surviving bytes contiguously
without realising data was missing, then accepted the agent's final
ACK_OK as success. Two back-to-back dumps disagreed on kernel/romfs
SHAs while uboot/env (both small enough to land in single packet bursts)
were byte-identical.
The agent already stamps the 2-byte seq prefix on every RSP_DATA in
`agent/main.c::handle_read` (pkt[0..1] = seq). Just validate it on the
host side. On a gap, raise loudly with the offset, the expected and
received seq numbers, and the actual addr the gap landed at — the
caller can chunk + retry.
Also added a final size check at ACK_OK: if `len(received) != size`,
raise. Catches the (rare) case of agent ACKing early without any seq
gap on the way.
Verified end-to-end with the hi3516av300:emmc partition dump:
- 115200 baud + 64-KiB chunks + byte-level idle-drain in the host
retry path → 0 retries, both two back-to-back dumps SHA256-identical
across all four partitions (uboot 320 KiB, env 64 KiB, kernel
2240 KiB, romfs 9216 KiB).
- 921600 baud single-shot → seq gap is now raised after the first
lost packet (was: silent short read).
3 new tests in TestReadMemorySeqValidation:
- test_normal_sequential_seq_passes (regression: clean seq stream
still works)
- test_seq_gap_raises (the bug fix)
- test_size_mismatch_at_ack_raises (belt-and-braces total-size check)
Followup work captured in kaeru:
`agent-read-seq-gap-detection-2026-05-16`. The host transport's
chunked-retry helper (which lived in /tmp/backup_emmc.py for this
validation) is a reasonable candidate for a public
`FlashAgentClient.read_memory_chunked` if more callers hit the same
problem.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
`FlashAgentClient.read_memory` silently returned short data when a `RSP_DATA` packet was CRC-rejected mid-stream by `recv_packet`. Surfaced backing up an `hi3516av300:emmc` camera with the new agent eMMC reader: at 921600 baud a 9 MiB stream consistently dropped ~5 % of packets, the client appended the surviving bytes contiguously without realising data was missing, then accepted the agent's final `ACK_OK` as success. Two back-to-back dumps disagreed on the kernel/romfs SHAs while uboot/env (small enough to land in single packet bursts) were byte-identical.
What
The agent already stamps the 2-byte seq prefix on every RSP_DATA in `agent/main.c::handle_read` (`pkt[0..1] = seq`). The client now validates it. On a gap, it raises loudly with the offset, the expected and received seq numbers, and the actual addr the gap landed at — the caller can chunk + retry:
```
RuntimeError: Read packet seq gap at offset 64512 (addr=0x1486d600):
expected seq=235, got seq=236; likely UART corruption on a long stream.
Try a smaller read or a lower baud.
```
Also added a final size check at ACK_OK: if `len(received) != size`, raise. Catches the (rare) case of an agent ACKing early without any seq gap on the way.
End-to-end validation on real hardware
Same partitions, dumped end-to-end with `fast=False` (115200 baud) + 64-KiB chunks + a byte-level UART-silence drain in the host retry path — zero retries needed across both runs.
(Pre-fix on the same hardware: same two runs produced 4 of 4 partitions matching only for the two small partitions; kernel was complete in run 1 but truncated in run 2; romfs was truncated in both with different sizes and different SHAs.)
Test plan
Followups noted in kaeru
`agent-read-seq-gap-detection-2026-05-16` captures the chunked-retry + byte-level-drain pattern that the host-side script used during validation; a reasonable candidate for a future public `FlashAgentClient.read_memory_chunked` if more callers hit the same problem.
🤖 Generated with Claude Code