Skip to content

Bug: Fleet sync jobs stuck in 'running' status after crash/restart #1729

@mrveiss

Description

@mrveiss

Problem

With the new DB-persisted fleet sync tracking (#1707), if the SLM backend process dies mid-sync (e.g. OOM, crash, or unclean restart), the job row stays in status='running' forever. There is no startup reconciliation to detect and mark stale jobs as failed.

Expected

On SLM backend startup, any fleet sync job with status='running' that was created more than N minutes ago should be marked as failed with a message like "interrupted by service restart".

Fix

Add a startup hook in main.py lifespan or the code-sync module init that:

  1. Queries fleet_sync_jobs WHERE status = 'running'
  2. Marks them as failed with completed_at = now()
  3. Logs a warning for each recovered job

Impact

Severity: low — cosmetic/reporting, no functional harm. Jobs show incorrect status in API/UI.

Discovered During

Implementing #1707 — fleet sync job DB persistence

Location

autobot-slm-backend/api/code_sync.py — needs startup reconciliation hook

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions