You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/test-agent/agents/test-agent/NOTES.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -259,6 +259,41 @@
259
259
- Run 1: pass (`lib/instances` 193.279s)
260
260
- Run 2: pass (`lib/instances` 261.633s)
261
261
- Run 3: pass (`lib/instances` 173.573s)
262
+
263
+
## 2026-04-07 - PR #184 follow-up CI round on `codex/standby-compression-delay`
264
+
265
+
### Initial CI red signature
266
+
- Linux `test` job failed on `TestDockerForwardChainRestored`.
267
+
- Failure:
268
+
-`ensureDockerForwardJump should have restored the DOCKER-FORWARD jump`
269
+
- raw `iptables -C FORWARD -j DOCKER-FORWARD` exited non-zero in the test after re-initialization.
270
+
271
+
### Root cause and fix
272
+
- The Docker-forward recovery path and the test both used plain `iptables` invocations with no wait for the xtables lock.
273
+
- Under parallel CI activity, a transient lock holder can cause checks/deletes/inserts to fail immediately and make the test observe a missing rule even though the recovery logic is otherwise correct.
274
+
- Fix:
275
+
- Added a small `newIPTablesCommand` helper in `lib/network/bridge_linux.go` that uses `iptables -w 5 ...` with the existing `CAP_NET_ADMIN` setup.
276
+
- Switched the bridge NAT/FORWARD rule management and `ensureDockerForwardJump` commands to that helper.
277
+
- Updated `TestDockerForwardChainRestored` in `lib/instances/network_test.go` to use `iptables -w 5` for its direct host-global mutations/checks.
278
+
279
+
### Secondary flake surfaced during Deft reruns
280
+
- A subsequent Deft full-suite rerun exposed a post-restore guest exec race in `TestCloudHypervisorStandbyRestoreCompressionScenarios`:
281
+
-`receive response (stdout=0, stderr=0): rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL`
282
+
- The compression integration harness was only waiting for the exec agent socket and then issuing marker reads/writes immediately after restore.
283
+
- Fix:
284
+
- Added a no-op post-restore guest exec readiness probe in `waitForRunningAndExecReady`.
285
+
- Added a small retry wrapper for the compression integration test’s guest marker read/write commands so transient post-restore transport resets do not fail the scenario immediately.
286
+
287
+
### Validation
288
+
- Deft targeted loop:
289
+
-`go test -count=20 -run '^TestDockerForwardChainRestored$' -v ./lib/instances`
290
+
- Result: pass
291
+
- Deft targeted loop:
292
+
-`go test -count=10 -run '^TestCloudHypervisorStandbyRestoreCompressionScenarios$' -tags containers_image_openpgp -timeout=30m ./lib/instances`
293
+
- Result: pass
294
+
- Local sanity:
295
+
-`go test ./lib/instances -count=1`
296
+
- Result: pass (`117.538s`)
262
297
-`exec-agent not ready for instance ... within 15s (last state: Initializing)`
263
298
264
299
### Additional flakes reproduced during Deft full-suite verification
0 commit comments