fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182
fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182eldar702 wants to merge 3 commits into
Conversation
| defer func() { | ||
| if r := recover(); r != nil { | ||
| fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r) | ||
| } | ||
| }() |
There was a problem hiding this comment.
Both panic recovery handlers log only %v of the recovered value, which typically yields just the panic message (e.g., "nil pointer dereference") without any stack trace. In concurrent goroutines where panics are being swallowed, this makes production debugging extremely difficult. Consider capturing the stack trace via debug.Stack() from runtime/debug.
Same suggestion applies to the subtask panic handler below.
Suggestion:
| defer func() { | |
| if r := recover(); r != nil { | |
| fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r) | |
| } | |
| }() | |
| defer func() { | |
| if r := recover(); r != nil { | |
| fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n%s\n", r, debug.Stack()) | |
| } | |
| }() |
Addresses OpenCodeReview finding on alibaba#182: both panic-recovery handlers logged only %v of the recovered value (the panic message, no stack), making swallowed panics in concurrent goroutines hard to debug in production. Capture the stack via debug.Stack() (runtime/debug) in both the CommentWorkerPool and per-file subtask recover handlers. go build/vet clean; internal/agent tests pass. [AI-assisted]
|
Good catch — addressed in |
lizhengfeng101
left a comment
There was a problem hiding this comment.
PR #182 Code Review
Changes: 2 files, +52/-0
Overview
This PR adds recover() guards to two goroutine entry points in agent.go:
- Site A:
CommentWorkerPool.Submit— comment worker pool goroutines - Site B:
dispatchSubtasks— per-file review goroutines
The goal is to prevent a panic in a single file review from crashing the entire process (CWE-248), converting panics into warnings consistent with the existing error handling path.
Positives
- Correct defer ordering: Both
recover()defers are registered last (run first in LIFO), ensuring panic recovery happens before semaphore release andwg.Done()— no deadlocks or goroutine leaks. - Consistent with existing error path: Site B reuses the
subtask_errorwarning type andsubtaskFailedcounter, so theall-failedrollup logic remains correct. recordWarningis thread-safe: Confirmed it useswarningsMumutex protection (agent.go:306-312).- Good test quality: The test case clearly validates panic isolation (panicking closure doesn't affect healthy closure). RED→GREEN evidence in PR description is thorough.
- Excellent PR description: Deep problem analysis, clear fix rationale, honest test-coverage disclosure.
Issues to Address
1. PR description claims "No new imports" — inaccurate
The PR body states "No new imports", but runtime/debug is a newly added import (not previously present in agent.go). Minor inaccuracy, but the description should be precise.
2. Site B is missing telemetry reporting
The existing error path has a telemetry.ErrorEvent call (agent.go:421-422), but the panic recovery path does not. This means panics will appear in warnings and stderr but not in telemetry/tracing.
Suggest adding telemetry in the recover block, using ctx (not fileCtx, since cancel() runs before recover in LIFO order and fileCtx may already be cancelled):
if r := recover(); r != nil {
atomic.AddInt64(&a.subtaskFailed, 1)
fmt.Fprintf(stdout.Writer(), "[ocr] Subtask panic for %s: %v\n%s\n", d.NewPath, r, debug.Stack())
a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r))
telemetry.ErrorEvent(ctx, "subtask.panic", fmt.Errorf("panic: %v", r),
telemetry.AnyToAttr("file.path", d.NewPath))
}3. Site A silently swallows panics
CommentWorkerPool.Submit's recover only prints a log. The caller of Await() cannot distinguish between "0 comments because nothing to report" and "0 comments because processing panicked". Consider whether a similar warning/error counting mechanism is needed, or at minimum document this behavior in Await's docstring.
4. What if operations inside recover themselves panic?
If a.recordWarning or debug.Stack() itself panics (extremely rare but theoretically possible), the second panic would not be caught by recover and would still crash the process. A conservative nested recover could guard against this, but this is a very low-priority edge case.
Summary
This is a high-quality defensive fix addressing a real process crash risk. Defer ordering, concurrency safety, and consistency with the existing error path are all handled well. The main suggestion is to add telemetry reporting in Site B so panics and errors have consistent observability. The remaining issues are lower priority and can be addressed as follow-ups.
Recommendation: merge after adding telemetry.
… comment-pool panic swallow (alibaba#171)
|
Thanks for the careful review — addressed in
|
Problem (CWE-248: Uncaught Exception)
ocrreviews each file in its own goroutine, and the comment worker pool runseach submitted unit of work in its own goroutine too. Both sites defer only
wg.Done()+ semaphore release — neither has arecover():CommentWorkerPool.Submit(internal/agent/agent.go)dispatchSubtasks(internal/agent/agent.go)In Go, an unrecovered panic in any goroutine terminates the entire process.
So a panic while reviewing a single file (a malformed diff, a nil deref deep in
a tool/LLM path, etc.) aborts the whole
ocrrun and discards every otherfile's results. This directly defeats the per-file fault isolation the code
already implements for
errorreturns: anerrorfrom one file is counted,recorded as a
subtask_errorwarning, and lets the other files finish — but apanicfrom the same code path crashes everything. The isolation is onlyhalf-built.
Fix
Add a
recover()defer to both goroutine sites. The crux is defer ordering:Go runs deferred calls LIFO, so the recover defer is registered last in each
goroutine and therefore runs first on unwind — it recovers before the
semaphore-release and
wg.Done()defers run, and all of them still execute.Result: the semaphore is always drained and the
WaitGroupalways decremented,so there is no goroutine/semaphore leak and
Await()/wg.Wait()neverdeadlocks. The recover defer itself does not re-panic.
CommentWorkerPool.Submit— on recover, log the panic to stdout. Thismatches the depth of the existing error handling on that path (a stderr log);
the pool has no per-task counter to update.
dispatchSubtasks— on recover, convert the panic into the existingcounted warning shape:
atomic.AddInt64(&a.subtaskFailed, 1)+a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r))+a stderr line. It reuses the existing
subtask_errorwarning type (no newtype invented), so panics and errors surface identically downstream and the
end-of-run all-failed rollup (
failed > 0 && failed == dispatched) stayscorrect. The recover defer is registered before the timeout-
cancel()defer,so
cancel()still runs first on unwind (the file context is cancelled beforethe panic is contained) — no ordering hazard.
No new imports (
fmt,sync/atomic, and thestdoutpackage were alreadyimported). The happy path is unchanged; no signatures change.
Tests — genuine RED → GREEN under
-raceTestCommentWorkerPool_PanicIsIsolated(newinternal/agent/worker_pool_test.go)submits a panicking closure plus a healthy closure and asserts that
Await()does not crash and returns the healthy result (
len == 1).RED (before the Site A fix) — the panic in one closure crashes the whole
test process:
GREEN (after the fix) — the panic is contained, the healthy closure's result
still arrives, race detector clean:
Also verified:
go build ./...(clean),go vet ./internal/agent/...(clean),gofmt -l internal/agent/agent.go internal/agent/worker_pool_test.go(empty),and the full
go test -race ./internal/agent/...suite passes.Test-coverage honesty
Submittakes a caller-supplied closure, so Site A is directly unit-testedwith no new production seam.
dispatchSubtaskscalls the privateexecuteSubtask, which needs a fully-wiredAgent+ LLM client + parsed diffs;forcing a panic there in a unit test would require adding a net-new production
seam purely for the test. Rather than reshape production code for a test hook,
Site B is covered by the identical
recover()idiom + code review (itconverts the panic into the same
subtask_errorwarning path that is alreadyexercised for
errorreturns). This asymmetry is intentional and called out soreviewers know Site B is pattern-covered, not unit-covered.
Fixes #171
AI-assisted disclosure: this change was prepared with AI assistance (analysis,
patch, and test authored with an AI coding assistant) and reviewed by the
submitter. The RED→GREEN
-raceoutput above is from a real local run on thechecked-out code.
Updates since opening (review follow-ups)
6d00066— capturedebug.Stack()in both recover logs (OpenCodeReview bot suggestion). This adds one import,runtime/debug.b5e2ca3— Site B panic-recover now callstelemetry.ErrorEvent(ctx, "subtask.panic", …)for telemetry parity with the error path (uses the parentctx, not the already-cancelledfileCtx); andCommentWorkerPool.Awaitnow documents that a panicking unit of work is recovered+logged but not surfaced in the result.Correction: the "No new imports" note above is superseded —
runtime/debugis now imported (fordebug.Stack()).fmt,sync/atomic,stdout, andtelemetrywere already imported.