refactor: gracefully abort restic tasks on shutdown#947
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5860e2bab4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/server/modules/lifecycle/shutdown.ts (1)
12-26:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe local-agent unmount guard is inverted.
This only tries to unmount
LOCAL_AGENT_IDvolumes whenenableLocalAgentisfalse, so the normal local-agent shutdown path skips the cleanup entirely.Suggested fix
const unmountVolumes = async () => { - if (!config.flags.enableLocalAgent) { - const volumes = await db.query.volumesTable.findMany({ - where: { AND: [{ status: "mounted" }, { agentId: LOCAL_AGENT_ID }] }, - }); - - for (const volume of volumes) { - const { status, error } = await withContext({ organizationId: volume.organizationId }, () => - volumeService.unmountVolume(volume.shortId, { persistStatus: false }), - ).catch((error) => ({ status: "error" as const, error: toMessage(error) })); - - logger.info(`Volume ${volume.name} unmount status: ${status}${error ? `, error: ${error}` : ""}`); - } + if (!config.flags.enableLocalAgent) { + return; + } + + const volumes = await db.query.volumesTable.findMany({ + where: { AND: [{ status: "mounted" }, { agentId: LOCAL_AGENT_ID }] }, + }); + + for (const volume of volumes) { + const { status, error } = await withContext({ organizationId: volume.organizationId }, () => + volumeService.unmountVolume(volume.shortId, { persistStatus: false }), + ).catch((error) => ({ status: "error" as const, error: toMessage(error) })); + + logger.info(`Volume ${volume.name} unmount status: ${status}${error ? `, error: ${error}` : ""}`); } };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/server/modules/lifecycle/shutdown.ts` around lines 12 - 26, The guard in unmountVolumes is inverted: it currently runs the LOCAL_AGENT_ID unmount logic only when config.flags.enableLocalAgent is false, skipping cleanup on normal local-agent shutdown; change the condition to run the findMany/unmount loop when enableLocalAgent is true (i.e., remove or invert the negation) so volumes from db.query.volumesTable.findMany filtered by { agentId: LOCAL_AGENT_ID, status: "mounted" } are iterated and each unmounted via withContext(..., () => volumeService.unmountVolume(volume.shortId, { persistStatus: false })) with the same error handling and logger call preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/server/core/repository-mutex.ts`:
- Around line 148-153: runWithLease should avoid invoking the operation if the
lease's abort signal is already aborted: check lease.signal.aborted immediately
after acquiring the lease and before calling operation({ signal: lease.signal
}); if aborted, still call lease.release() and propagate the abort (e.g. reject
with an AbortError/DOMException or rethrow the signal.reason) instead of
invoking RepositoryOperation; update the runWithLease implementation
(referencing runWithLease, RepositoryLease, RepositoryOperation, lease.signal,
lease.release) to perform this pre-check and proper abort propagation.
In `@app/server/modules/backups/backups.service.ts`:
- Around line 454-514: The catch path for repoMutex.runShared currently treats
shutdown rejections as failures because it only checks
abortController.signal.aborted in one place; update the catch handling (the
block surrounding the repoMutex.runShared call and the similar block at 515-526)
so that if abortController.signal.aborted (or the rejection is an AbortError)
you call handleBackupCancellation(scheduleId, ctx.organizationId, <optional
message>) and taskStore.cancel(task.id, message) and set domainHandlerCompleted
= true, otherwise continue to call handleBackupFailure/taskStore.fail; reference
repoMutex.runShared, abortController.signal, handleBackupCancellation,
handleBackupFailure, taskStore.cancel, and taskStore.fail and apply the same
change to both occurrences.
In `@app/server/modules/backups/helpers/backup-maintenance.ts`:
- Around line 114-130: The mirror copy is incorrectly taking an exclusive lock
on the destination repository; change the lock type for the mirror repository
entries in the repoMutex.runMany calls used by mirror sync to "shared" (i.e.,
where you currently pass { repositoryId: mirrorRepository.id, type: "exclusive",
operation: `mirror_sync:${scheduleId}` } and similar entries) so both source and
mirror use "shared" locks while calling restic.copy (and mirror-related
operations such as those near the other runMany block referenced in the
comment); keep exclusive locks reserved only for maintenance
(prune/rebuild-index).
In `@app/server/modules/lifecycle/shutdown.ts`:
- Around line 28-30: The shutdown function currently awaits
Promise.allSettled([Scheduler.stop(), repoMutex.shutdown(), unmountVolumes(),
stopApplicationRuntime()]) and ignores results so any rejected step is lost;
modify shutdown in shutdown.ts to collect the settled results, find any rejected
entries (e.g., from repoMutex.shutdown() or stopApplicationRuntime()), log or
aggregate their errors, and rethrow (or throw an AggregateError) so failures
propagate to callers like app/server/plugins/bootstrap.ts; reference the
shutdown function and the Promise.allSettled call and ensure the code returns a
rejected promise when any step failed.
In `@app/server/modules/repositories/__tests__/repositories.service.test.ts`:
- Around line 435-456: The test calls the shared singleton repoMutex.shutdown(),
causing later tests to fail; change the test to use an isolated mutex instance
(or a test-only reset) instead of the imported repoMutex: create or inject a
fresh RepositoryMutex for this test and pass it to the code under test (or
reinitialize the module so repositoriesService.dumpSnapshot uses the fresh
mutex), ensure the holder uses that fresh instance (the holderStarted Promise
using repoMutex.runExclusive), and avoid calling shutdown() on the global
repoMutex so other tests are unaffected.
In `@app/server/modules/repositories/restore-executor.ts`:
- Around line 117-134: The catch currently treats any rejection from
repoMutex.runShared as a failure; change it to map abort-style rejections to
cancellation: in the catch block check if the outer signal is aborted OR the
caught error is an abort error (e.g. error?.name === "AbortError" || error?.code
=== "ERR_ABORTED" or use a helper isAbortError) and return { status:
"cancelled", message: "Restore was cancelled" }; otherwise keep returning {
status: "failed", error: toMessage(error) }. Update the catch around
repoMutex.runShared/restore executor accordingly (references:
repoMutex.runShared, signal, executeControllerRestore, executeAgentRestore).
In `@app/server/plugins/bootstrap.ts`:
- Around line 9-24: runGracefulShutdown currently swallows errors by catching
them and not propagating them, so runSignalShutdown always calls
process.exit(0); change runGracefulShutdown (and its use in runSignalShutdown)
so shutdown errors are propagated or converted to a non-zero exit code: let
shutdownPromise rethrow or return a rejected promise when shutdown() fails, and
in runSignalShutdown await runGracefulShutdown(reason).catch(err => {
logger.error("Error during shutdown", err); process.exit(1); }) (or rethrow to
let the caller handle exit code) so a failing shutdown results in process.exit
with a non-zero code; update references to runGracefulShutdown,
runSignalShutdown, shutdownPromise and shutdown accordingly.
In `@packages/core/src/restic/commands/copy.ts`:
- Around line 66-73: The post-run check of options.signal?.aborted wrongly
reports a cancellation after safeExec completes; instead determine cancellation
from safeExec's result/error itself and ensure cleanupTemporaryKeys runs in a
finally block. Change the flow in copy.ts so you call safeExec(...) and catch
its error; if the caught error indicates an abort (e.g. error.name ===
'AbortError' or a specific abort sentinel returned by safeExec) then log via
logger.warn and rethrow; otherwise treat a successful safeExec result as success
even if options.signal is now aborted. Also move cleanupTemporaryKeys(sourceEnv,
deps) and cleanupTemporaryKeys(destEnv, deps) into a finally block so they
always run regardless of success/abort; reference safeExec,
cleanupTemporaryKeys, options.signal, and logger.warn when making these edits.
In `@packages/core/src/restic/commands/delete-snapshots.ts`:
- Around line 39-42: The current abort check inside deleteSnapshots() uses
options.signal?.aborted and throws even after restic forget --prune has
completed; change it to only treat the operation as aborted when the safeExec
result indicates an aborted process (or when safeExec returns an explicit
aborted flag) rather than relying solely on options.signal?.aborted — locate the
aborting branch around cleanupTemporaryKeys() in deleteSnapshots(), consult the
safeExec call that ran restic forget --prune, and gate the throw on the safeExec
result (or an explicit aborted boolean from safeExec) so successful deletes
still run cache invalidation and stats refresh.
In `@packages/core/src/restic/commands/dump.ts`:
- Around line 68-73: The code currently aborts the local abortController
(abortDump) before calling safeSpawn/onSpawn which can cause stream to be unset
and surface a generic "Failed to initialize restic dump stream" instead of
propagating cancellation; update the logic in the dump routine to first
short-circuit when options.signal?.aborted (or when options.signal fires
before/while spawning) by returning/throwing an abort-specific error (e.g.,
propagate the original AbortSignal reason or throw an AbortError) instead of
proceeding to safeSpawn, and also ensure the abortDump handler and the
post-spawn !stream check map their failure path to the same abort-specific
error; apply the same change to the analogous branch around lines 115-117 so all
early/shutdown aborts consistently propagate cancellation rather than a generic
initialization failure.
In `@packages/core/src/restic/commands/forget.ts`:
- Around line 63-69: The current code checks extra.signal?.aborted after
awaiting safeExec and cleanupTemporaryKeys, which can incorrectly report an
already-completed restic forget/prune as "Operation aborted"; instead, determine
cancellation from the safeExec outcome itself (e.g., the returned result or any
thrown abort-specific error) before deciding to throw. Update the forget.ts
block that calls safeExec and cleanupTemporaryKeys so that you inspect the
safeExec result (res, exit code, or thrown AbortError) to detect aborts, only
throw "Operation aborted" when safeExec indicates it was terminated by the
signal, and always perform cleanupTemporaryKeys regardless of abort detection.
---
Outside diff comments:
In `@app/server/modules/lifecycle/shutdown.ts`:
- Around line 12-26: The guard in unmountVolumes is inverted: it currently runs
the LOCAL_AGENT_ID unmount logic only when config.flags.enableLocalAgent is
false, skipping cleanup on normal local-agent shutdown; change the condition to
run the findMany/unmount loop when enableLocalAgent is true (i.e., remove or
invert the negation) so volumes from db.query.volumesTable.findMany filtered by
{ agentId: LOCAL_AGENT_ID, status: "mounted" } are iterated and each unmounted
via withContext(..., () => volumeService.unmountVolume(volume.shortId, {
persistStatus: false })) with the same error handling and logger call preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4016cf83-68ad-4827-a3e2-cadb91a8cca0
📒 Files selected for processing (24)
app/server.tsapp/server/core/__tests__/repository-mutex.test.tsapp/server/core/repository-mutex.tsapp/server/modules/agents/agents-manager.tsapp/server/modules/backups/__tests__/backups.service.execution.test.tsapp/server/modules/backups/backup-executor.tsapp/server/modules/backups/backups.service.tsapp/server/modules/backups/helpers/backup-maintenance.tsapp/server/modules/lifecycle/__tests__/shutdown.test.tsapp/server/modules/lifecycle/shutdown.tsapp/server/modules/repositories/__tests__/repositories.service.test.tsapp/server/modules/repositories/helpers/doctor.tsapp/server/modules/repositories/repositories.service.tsapp/server/modules/repositories/restore-executor.tsapp/server/plugins/bootstrap.tsapp/test/helpers/repository-mutex.tspackages/core/src/restic/commands/copy.tspackages/core/src/restic/commands/delete-snapshots.tspackages/core/src/restic/commands/dump.tspackages/core/src/restic/commands/forget.tspackages/core/src/restic/commands/ls.tspackages/core/src/restic/commands/snapshots.tspackages/core/src/restic/commands/stats.tspackages/core/src/restic/commands/tag-snapshots.ts
💤 Files with no reviewable changes (1)
- app/server.ts
5860e2b to
e674a8f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e674a8f6bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c27e717e0f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 729f6822d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Refactored the repo mutex to be the owner of the abort signal. On a shutdown signal, a simple repoMutex.shutdown() cancels all the in-flight restic operations leaving no orphaned locks behind
Summary by CodeRabbit
Release Notes
New Features
Improvements