Skip to content

refactor: gracefully abort restic tasks on shutdown#947

Draft
nicotsx wants to merge 6 commits into
mainfrom
06-05-refactor_shutdown-mutex
Draft

refactor: gracefully abort restic tasks on shutdown#947
nicotsx wants to merge 6 commits into
mainfrom
06-05-refactor_shutdown-mutex

Conversation

@nicotsx

@nicotsx nicotsx commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Refactored the repo mutex to be the owner of the abort signal. On a shutdown signal, a simple repoMutex.shutdown() cancels all the in-flight restic operations leaving no orphaned locks behind

Summary by CodeRabbit

Release Notes

  • New Features

    • Added graceful shutdown handling for process termination signals (SIGTERM/SIGINT).
  • Improvements

    • Enhanced support for cancelling long-running operations (backups, restores, and repository operations).
    • Improved cleanup and lifecycle management during shutdown and operation termination.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb052770-132c-4cb9-9706-c9b629562bc1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

nicotsx commented Jun 5, 2026

Copy link
Copy Markdown
Owner Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@nicotsx nicotsx marked this pull request as ready for review June 5, 2026 15:55

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5860e2bab4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/server/modules/backups/backups.service.ts
Comment thread app/server/modules/backups/backup-executor.ts Outdated
Comment thread app/server/modules/repositories/repositories.service.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
app/server/modules/lifecycle/shutdown.ts (1)

12-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The local-agent unmount guard is inverted.

This only tries to unmount LOCAL_AGENT_ID volumes when enableLocalAgent is false, so the normal local-agent shutdown path skips the cleanup entirely.

Suggested fix
 const unmountVolumes = async () => {
-	if (!config.flags.enableLocalAgent) {
-		const volumes = await db.query.volumesTable.findMany({
-			where: { AND: [{ status: "mounted" }, { agentId: LOCAL_AGENT_ID }] },
-		});
-
-		for (const volume of volumes) {
-			const { status, error } = await withContext({ organizationId: volume.organizationId }, () =>
-				volumeService.unmountVolume(volume.shortId, { persistStatus: false }),
-			).catch((error) => ({ status: "error" as const, error: toMessage(error) }));
-
-			logger.info(`Volume ${volume.name} unmount status: ${status}${error ? `, error: ${error}` : ""}`);
-		}
+	if (!config.flags.enableLocalAgent) {
+		return;
+	}
+
+	const volumes = await db.query.volumesTable.findMany({
+		where: { AND: [{ status: "mounted" }, { agentId: LOCAL_AGENT_ID }] },
+	});
+
+	for (const volume of volumes) {
+		const { status, error } = await withContext({ organizationId: volume.organizationId }, () =>
+			volumeService.unmountVolume(volume.shortId, { persistStatus: false }),
+		).catch((error) => ({ status: "error" as const, error: toMessage(error) }));
+
+		logger.info(`Volume ${volume.name} unmount status: ${status}${error ? `, error: ${error}` : ""}`);
 	}
 };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/server/modules/lifecycle/shutdown.ts` around lines 12 - 26, The guard in
unmountVolumes is inverted: it currently runs the LOCAL_AGENT_ID unmount logic
only when config.flags.enableLocalAgent is false, skipping cleanup on normal
local-agent shutdown; change the condition to run the findMany/unmount loop when
enableLocalAgent is true (i.e., remove or invert the negation) so volumes from
db.query.volumesTable.findMany filtered by { agentId: LOCAL_AGENT_ID, status:
"mounted" } are iterated and each unmounted via withContext(..., () =>
volumeService.unmountVolume(volume.shortId, { persistStatus: false })) with the
same error handling and logger call preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/server/core/repository-mutex.ts`:
- Around line 148-153: runWithLease should avoid invoking the operation if the
lease's abort signal is already aborted: check lease.signal.aborted immediately
after acquiring the lease and before calling operation({ signal: lease.signal
}); if aborted, still call lease.release() and propagate the abort (e.g. reject
with an AbortError/DOMException or rethrow the signal.reason) instead of
invoking RepositoryOperation; update the runWithLease implementation
(referencing runWithLease, RepositoryLease, RepositoryOperation, lease.signal,
lease.release) to perform this pre-check and proper abort propagation.

In `@app/server/modules/backups/backups.service.ts`:
- Around line 454-514: The catch path for repoMutex.runShared currently treats
shutdown rejections as failures because it only checks
abortController.signal.aborted in one place; update the catch handling (the
block surrounding the repoMutex.runShared call and the similar block at 515-526)
so that if abortController.signal.aborted (or the rejection is an AbortError)
you call handleBackupCancellation(scheduleId, ctx.organizationId, <optional
message>) and taskStore.cancel(task.id, message) and set domainHandlerCompleted
= true, otherwise continue to call handleBackupFailure/taskStore.fail; reference
repoMutex.runShared, abortController.signal, handleBackupCancellation,
handleBackupFailure, taskStore.cancel, and taskStore.fail and apply the same
change to both occurrences.

In `@app/server/modules/backups/helpers/backup-maintenance.ts`:
- Around line 114-130: The mirror copy is incorrectly taking an exclusive lock
on the destination repository; change the lock type for the mirror repository
entries in the repoMutex.runMany calls used by mirror sync to "shared" (i.e.,
where you currently pass { repositoryId: mirrorRepository.id, type: "exclusive",
operation: `mirror_sync:${scheduleId}` } and similar entries) so both source and
mirror use "shared" locks while calling restic.copy (and mirror-related
operations such as those near the other runMany block referenced in the
comment); keep exclusive locks reserved only for maintenance
(prune/rebuild-index).

In `@app/server/modules/lifecycle/shutdown.ts`:
- Around line 28-30: The shutdown function currently awaits
Promise.allSettled([Scheduler.stop(), repoMutex.shutdown(), unmountVolumes(),
stopApplicationRuntime()]) and ignores results so any rejected step is lost;
modify shutdown in shutdown.ts to collect the settled results, find any rejected
entries (e.g., from repoMutex.shutdown() or stopApplicationRuntime()), log or
aggregate their errors, and rethrow (or throw an AggregateError) so failures
propagate to callers like app/server/plugins/bootstrap.ts; reference the
shutdown function and the Promise.allSettled call and ensure the code returns a
rejected promise when any step failed.

In `@app/server/modules/repositories/__tests__/repositories.service.test.ts`:
- Around line 435-456: The test calls the shared singleton repoMutex.shutdown(),
causing later tests to fail; change the test to use an isolated mutex instance
(or a test-only reset) instead of the imported repoMutex: create or inject a
fresh RepositoryMutex for this test and pass it to the code under test (or
reinitialize the module so repositoriesService.dumpSnapshot uses the fresh
mutex), ensure the holder uses that fresh instance (the holderStarted Promise
using repoMutex.runExclusive), and avoid calling shutdown() on the global
repoMutex so other tests are unaffected.

In `@app/server/modules/repositories/restore-executor.ts`:
- Around line 117-134: The catch currently treats any rejection from
repoMutex.runShared as a failure; change it to map abort-style rejections to
cancellation: in the catch block check if the outer signal is aborted OR the
caught error is an abort error (e.g. error?.name === "AbortError" || error?.code
=== "ERR_ABORTED" or use a helper isAbortError) and return { status:
"cancelled", message: "Restore was cancelled" }; otherwise keep returning {
status: "failed", error: toMessage(error) }. Update the catch around
repoMutex.runShared/restore executor accordingly (references:
repoMutex.runShared, signal, executeControllerRestore, executeAgentRestore).

In `@app/server/plugins/bootstrap.ts`:
- Around line 9-24: runGracefulShutdown currently swallows errors by catching
them and not propagating them, so runSignalShutdown always calls
process.exit(0); change runGracefulShutdown (and its use in runSignalShutdown)
so shutdown errors are propagated or converted to a non-zero exit code: let
shutdownPromise rethrow or return a rejected promise when shutdown() fails, and
in runSignalShutdown await runGracefulShutdown(reason).catch(err => {
logger.error("Error during shutdown", err); process.exit(1); }) (or rethrow to
let the caller handle exit code) so a failing shutdown results in process.exit
with a non-zero code; update references to runGracefulShutdown,
runSignalShutdown, shutdownPromise and shutdown accordingly.

In `@packages/core/src/restic/commands/copy.ts`:
- Around line 66-73: The post-run check of options.signal?.aborted wrongly
reports a cancellation after safeExec completes; instead determine cancellation
from safeExec's result/error itself and ensure cleanupTemporaryKeys runs in a
finally block. Change the flow in copy.ts so you call safeExec(...) and catch
its error; if the caught error indicates an abort (e.g. error.name ===
'AbortError' or a specific abort sentinel returned by safeExec) then log via
logger.warn and rethrow; otherwise treat a successful safeExec result as success
even if options.signal is now aborted. Also move cleanupTemporaryKeys(sourceEnv,
deps) and cleanupTemporaryKeys(destEnv, deps) into a finally block so they
always run regardless of success/abort; reference safeExec,
cleanupTemporaryKeys, options.signal, and logger.warn when making these edits.

In `@packages/core/src/restic/commands/delete-snapshots.ts`:
- Around line 39-42: The current abort check inside deleteSnapshots() uses
options.signal?.aborted and throws even after restic forget --prune has
completed; change it to only treat the operation as aborted when the safeExec
result indicates an aborted process (or when safeExec returns an explicit
aborted flag) rather than relying solely on options.signal?.aborted — locate the
aborting branch around cleanupTemporaryKeys() in deleteSnapshots(), consult the
safeExec call that ran restic forget --prune, and gate the throw on the safeExec
result (or an explicit aborted boolean from safeExec) so successful deletes
still run cache invalidation and stats refresh.

In `@packages/core/src/restic/commands/dump.ts`:
- Around line 68-73: The code currently aborts the local abortController
(abortDump) before calling safeSpawn/onSpawn which can cause stream to be unset
and surface a generic "Failed to initialize restic dump stream" instead of
propagating cancellation; update the logic in the dump routine to first
short-circuit when options.signal?.aborted (or when options.signal fires
before/while spawning) by returning/throwing an abort-specific error (e.g.,
propagate the original AbortSignal reason or throw an AbortError) instead of
proceeding to safeSpawn, and also ensure the abortDump handler and the
post-spawn !stream check map their failure path to the same abort-specific
error; apply the same change to the analogous branch around lines 115-117 so all
early/shutdown aborts consistently propagate cancellation rather than a generic
initialization failure.

In `@packages/core/src/restic/commands/forget.ts`:
- Around line 63-69: The current code checks extra.signal?.aborted after
awaiting safeExec and cleanupTemporaryKeys, which can incorrectly report an
already-completed restic forget/prune as "Operation aborted"; instead, determine
cancellation from the safeExec outcome itself (e.g., the returned result or any
thrown abort-specific error) before deciding to throw. Update the forget.ts
block that calls safeExec and cleanupTemporaryKeys so that you inspect the
safeExec result (res, exit code, or thrown AbortError) to detect aborts, only
throw "Operation aborted" when safeExec indicates it was terminated by the
signal, and always perform cleanupTemporaryKeys regardless of abort detection.

---

Outside diff comments:
In `@app/server/modules/lifecycle/shutdown.ts`:
- Around line 12-26: The guard in unmountVolumes is inverted: it currently runs
the LOCAL_AGENT_ID unmount logic only when config.flags.enableLocalAgent is
false, skipping cleanup on normal local-agent shutdown; change the condition to
run the findMany/unmount loop when enableLocalAgent is true (i.e., remove or
invert the negation) so volumes from db.query.volumesTable.findMany filtered by
{ agentId: LOCAL_AGENT_ID, status: "mounted" } are iterated and each unmounted
via withContext(..., () => volumeService.unmountVolume(volume.shortId, {
persistStatus: false })) with the same error handling and logger call preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4016cf83-68ad-4827-a3e2-cadb91a8cca0

📥 Commits

Reviewing files that changed from the base of the PR and between 34fc4f0 and 5860e2b.

📒 Files selected for processing (24)
  • app/server.ts
  • app/server/core/__tests__/repository-mutex.test.ts
  • app/server/core/repository-mutex.ts
  • app/server/modules/agents/agents-manager.ts
  • app/server/modules/backups/__tests__/backups.service.execution.test.ts
  • app/server/modules/backups/backup-executor.ts
  • app/server/modules/backups/backups.service.ts
  • app/server/modules/backups/helpers/backup-maintenance.ts
  • app/server/modules/lifecycle/__tests__/shutdown.test.ts
  • app/server/modules/lifecycle/shutdown.ts
  • app/server/modules/repositories/__tests__/repositories.service.test.ts
  • app/server/modules/repositories/helpers/doctor.ts
  • app/server/modules/repositories/repositories.service.ts
  • app/server/modules/repositories/restore-executor.ts
  • app/server/plugins/bootstrap.ts
  • app/test/helpers/repository-mutex.ts
  • packages/core/src/restic/commands/copy.ts
  • packages/core/src/restic/commands/delete-snapshots.ts
  • packages/core/src/restic/commands/dump.ts
  • packages/core/src/restic/commands/forget.ts
  • packages/core/src/restic/commands/ls.ts
  • packages/core/src/restic/commands/snapshots.ts
  • packages/core/src/restic/commands/stats.ts
  • packages/core/src/restic/commands/tag-snapshots.ts
💤 Files with no reviewable changes (1)
  • app/server.ts

Comment thread app/server/core/repository-mutex.ts
Comment thread app/server/modules/backups/backups.service.ts
Comment thread app/server/modules/backups/helpers/backup-maintenance.ts
Comment thread app/server/modules/lifecycle/shutdown.ts
Comment thread app/server/plugins/bootstrap.ts
Comment thread packages/core/src/restic/commands/copy.ts Outdated
Comment thread packages/core/src/restic/commands/delete-snapshots.ts Outdated
Comment thread packages/core/src/restic/commands/dump.ts
Comment thread packages/core/src/restic/commands/forget.ts Outdated
@nicotsx nicotsx force-pushed the 06-05-refactor_shutdown-mutex branch from 5860e2b to e674a8f Compare June 5, 2026 17:29

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e674a8f6bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/server/modules/repositories/restore-executor.ts
Comment thread app/server/modules/repositories/helpers/doctor.ts
Comment thread packages/core/src/restic/commands/tag-snapshots.ts Outdated
Comment thread app/server/modules/backups/helpers/backup-maintenance.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c27e717e0f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/server/modules/lifecycle/shutdown.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 729f6822d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/server/modules/agents/agents-manager.ts
Comment thread app/server/modules/agents/agents-manager.ts
Comment thread app/server/modules/repositories/repositories.service.ts Outdated
@nicotsx nicotsx marked this pull request as draft June 6, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant