Skip to content

profiles: per-board variant overrides; loud diagnostic when agent stays silent#102

Merged
widgetii merged 1 commit into
masterfrom
profile-board-variants
May 15, 2026
Merged

profiles: per-board variant overrides; loud diagnostic when agent stays silent#102
widgetii merged 1 commit into
masterfrom
profile-board-variants

Conversation

@widgetii
Copy link
Copy Markdown
Member

Why

Two hi3516av300 cameras side-by-side. Same chip silicon, same defib code, same boot protocol, same TAIL ACKs from the bootrom. The agent uploaded to 0x81000000 ran cleanly on the SPI NOR variant and hung silently on the eMMC variant. The bootrom source (`OpenIPC/openhisilicon: bootrom/hi3516av300/re/bootloader.c:uart0_recv_payload`) confirmed the protocol is fine — `((foreign_fn)frame[6])()` is called after every ACKed TAIL. What differed was `DDRSTEP0` — defib's per-chip DDR init is calibrated for ONE board and doesn't bring DDR up on the other variant, so the bootrom faithfully jumps to 0x81000000 but the CPU fetches garbage there and hangs without writing a single byte to UART. Full investigation captured in kaeru `av300-ddr-init-is-per-board-not-per-chip-2026-05-15`.

What

Two things, both keyed off the realisation that DDR init is per-board, not per-chip.

Per-board variant support in profiles

Optional `variants` map in profile JSON. Each variant overrides matching top-level keys (typically `DDRSTEP0`, sometimes `PRESTEP0`):

```json
{
"name": "hi3516av300",
"DDRSTEP0": [...],
...
"variants": {
"emmc": { "DDRSTEP0": [...board-specific DDR init...] }
}
}
```

CLI accepts `--chip hi3516av300:emmc` consistently. Variant suffix gets stripped in chip-keyed lookups (`firmware_url`, `get_cached_path`, `get_agent_binary`) — those resources are per-chip. Profile loader pops `variants` before pydantic validation and merges in via `dict.update()`, so the `SoCProfile` model itself stays variant-unaware. Aliases still work transparently (`hi3516dv300_alias:emmc` resolves the alias chain and then applies the variant on the final target).

Loud diagnostic when agent stays silent

`defib agent upload` / `defib agent flash` used to print one red line when the agent never returned READY:
```
Agent not responding
```

Now it explains what's happening:
```
Agent not responding
Boot-protocol upload completed but the agent never sent READY.

Most common cause: the chip profile's DDR init (PRESTEP0/DDRSTEP0)
doesn't match this board's DDR layout. The bootrom faithfully calls
the agent at 0x81000000, but DDR isn't backed there, so the CPU
fetches garbage and hangs silently (no UART output).

No board variants declared for hi3516av300.

Manual workaround (vendor U-Boot must be intact in flash):

  1. power-cycle the camera
  2. hold Ctrl-C to break U-Boot autoboot
  3. at the U-Boot prompt: loady 0x81000000
  4. YMODEM-send agent-.bin
  5. go 0x81000000
    ```

When variants are declared, the message names them and suggests `defib agent upload -c hi3516av300:` as the next step. JSON output mode wraps the same text under a `diagnostic` key.

Not in scope

The actual `hi3516av300:emmc` variant DATA — we don't have working DDR init bytes for the eMMC board. Extracting them from that board's vendor U-Boot is the follow-up task. This PR ships only the infrastructure plus the diagnostic.

Test plan

  • `uv run pytest tests/ -x --ignore=tests/fuzz` — 517 passed, 2 skipped (21 new tests: 17 in test_profiles.py, 4 in test_cli.py)
  • `uv run ruff check` on changed files — clean
  • `uv run mypy` on `src/defib/profiles/`, `firmware.py`, `agent/client.py` — no issues
  • `defib list-chips` / `defib agent upload --help` smoke-test — unchanged
  • Real-hardware findings from the investigation are preserved in kaeru (`av300-ddr-init-is-per-board-not-per-chip-2026-05-15`, `cv500-agent-jumps-from-vendor-uboot-not-bootrom`).

🤖 Generated with Claude Code

…ys silent

DDR init (PRESTEP0/DDRSTEP0) in defib's chip profiles is calibrated for
ONE board variant per chip. Two hi3516av300 cameras side-by-side: same
profile, same boot protocol, same TAIL ACKs — agent at 0x81000000
runs on the SPI NOR variant, hangs silently on the eMMC variant because
DDR isn't backed there. Bootrom source (OpenIPC/openhisilicon
bootrom/hi3516av300/re/bootloader.c:uart0_recv_payload) confirms the
bootrom faithfully calls the HEAD's load address after every ACKed
TAIL — the protocol is fine, the per-board DDR setup isn't.

This wires up the infrastructure to declare per-board overrides without
duplicating whole profile files, and replaces the "Agent not responding"
one-liner with a diagnostic that names the actual cause and a fix.

* `parse_chip_variant("hi3516av300:emmc") → ("hi3516av300", "emmc")` —
  colon syntax, accepted by `--chip` consistently.
* Profile JSON gains an optional `variants` map keyed by variant name;
  entries override matching top-level keys (typically DDRSTEP0,
  PRESTEP0). Schema itself stays variant-unaware — variants are popped
  before pydantic validation and merged in via dict.update().
* `list_variants(chip)` for surfacing options to humans.
* `get_agent_binary`, `firmware_url`, `get_cached_path`, `download_firmware`
  strip the variant suffix — those resources are per-chip, not per-board.
* New CLI helper `_agent_not_responding_message(chip, uboot_address)`
  names the DDR-mismatch root cause first, lists declared variants if
  any (with a concrete `-c chip:variant` next-step), and includes the
  vendor-U-Boot loadx fallback (loady → ymodem → go). Used by both
  `defib agent upload` and `defib agent flash`.
* 21 new tests in test_profiles.py and test_cli.py covering parsing,
  override merging, alias-chain transparency, unknown-variant errors,
  variant stripping in chip-keyed lookups, and the diagnostic content.

No variant data is shipped yet — the eMMC av300 variant needs DDR init
bytes extracted from a working vendor U-Boot on that board, which is a
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit f49379d into master May 15, 2026
13 checks passed
@widgetii widgetii deleted the profile-board-variants branch May 15, 2026 14:05
widgetii added a commit that referenced this pull request May 15, 2026
## Why

We have two hi3516av300 cameras on the bench: one with SPI NOR flash
(ether8), one with eMMC (ether1). They use different DDR chips, and the
OpenIPC U-Boot for hi3516av300 ships an SPL targeting SPI NOR boards. On
the eMMC board, defib's boot protocol completes every stage with ACKs
from the bootrom, but the bootrom faithfully calls the agent at
0x81000000 — DDR isn't backed there, the CPU fetches garbage, and the
link goes silent (0 bytes for 30s, no READY).

Two pieces here. The first builds on #102's variant infrastructure to
carry an SPL blob; the second is the actual extracted-vendor variant.

## What

### `SPL_BLOB` schema addition

Optional profile field naming a binary file (resolved relative to the
profile JSON's directory). The loader reads it into `profile.spl_data`.
The agent-upload CLI prefers `profile.spl_data` over the downloaded
U-Boot when set:

\`\`\`python
if profile.spl_data is not None:
    spl_data = profile.spl_data       # variant SPL takes precedence
else:
spl_data = cached_fw.read_bytes() # fall back to OpenIPC U-Boot first
20K
\`\`\`

Variant declaration looks like:

\`\`\`json
{
  \"name\": \"hi3516av300\",
  \"...\": \"...\",
  \"variants\": {
    \"emmc\": { \"SPL_BLOB\": \"hi3516av300-emmc-spl.bin\" }
  }
}
\`\`\`

### `hi3516av300:emmc` variant

20480 bytes extracted from a working eMMC av300 board's vendor U-Boot
(eMMC offset 0, truncated at the gzip boundary at 0x5000). Lives at
`src/defib/profiles/data/hi3516av300-emmc-spl.bin`.

End-to-end verified on real hardware:

| Camera | `--chip` | Result |
|---|---|---|
| SPI NOR av300 | `hi3516av300` | agent READY at t=0.3s |
| eMMC av300 | `hi3516av300` | 0 bytes for 30s (pre-existing failure
mode) |
| eMMC av300    | `hi3516av300:emmc` | **agent READY at t=0.3s** |

### Failure-diagnostic content update

The diagnostic message from #102 now actually has a real variant to
suggest:

\`\`\`
Known board variants for hi3516av300: emmc
  Try: defib agent upload -c hi3516av300:emmc ...
\`\`\`

## Extraction recipe

Captured in kaeru \`hi3516av300-emmc-variant-shipped-2026-05-15\` for
the next board family that hits this:

1. Catch vendor U-Boot prompt (^C bombardment)
2. \`mmc dev 0\` then \`mmc read 0 0x82000000 0 0x40\` — note: this
U-Boot 2016.11 wants \`mmc read DEV addr blk# cnt\`, not \`mmc read addr
blk# cnt\`
3. \`loady 0x81000000\` the defib agent, then \`go 0x81000000\`
4. \`agent.read_memory(0x82000000, 0x6000)\` to pull the bytes back
5. Truncate at the byte before the \`\\x1f\\x8b\\x08\` gzip signature
(0x5000 here) to drop the gzipped U-Boot tail
6. Drop into \`src/defib/profiles/data/<chip>-<variant>-spl.bin\` and
add a variant block

## Test plan

- [x] \`uv run pytest tests/ -x --ignore=tests/fuzz\` — **522 passed, 2
skipped** (5 new tests in test_profiles.py covering blob resolution,
missing-blob error path, blob-via-variant, real av300:emmc, real av300
base)
- [x] \`uv run ruff check\` + \`mypy\` on changed files — clean
- [x] Real-hardware: eMMC av300 reaches READY at t=0.3s with \`--chip
hi3516av300:emmc\` (was 0 bytes for 30s before this PR)
- [x] Real-hardware: SPI NOR av300 still reaches READY at t=0.3s with
the base \`--chip hi3516av300\` (no regression)
- [x] \`*.bin\` gitignore got a negation rule for
\`src/defib/profiles/data/*.bin\` so SPL blobs don't get hidden

## Aside

Found a separate routeros power-controller bug while iterating:
\`power_off → power_on\` over a port that was already off restores it to
\"off\" because \`power_off\` saves the current state (off) and
\`power_on\` restores it. Worked around in test scripts via
\`_set_poe(port, 'forced-on')\`. Worth fixing separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant