Skip to content

fix: expire stale in-flight cron records#438

Open
captainsafia wants to merge 1 commit intomainfrom
safia/fix-cron-limits
Open

fix: expire stale in-flight cron records#438
captainsafia wants to merge 1 commit intomainfrom
safia/fix-cron-limits

Conversation

@captainsafia
Copy link
Copy Markdown
Collaborator

Summary

  • expire in-flight Oz run records when they exceed a max-attempt limit or max-age threshold
  • route expired records through the workflow failure handler so progress comments are updated before KV cleanup
  • add production env overrides for the expiration limits
  • add poller tests for attempt expiration, age expiration, failure-handler errors, and disabled limits

Validation

  • python3 -m unittest tests.test_poll_runs
  • python3 -m pytest tests — 344 passed, 1 local urllib3/LibreSSL warning
  • Production deploy smoke check confirmed high-attempt stale records were pruned and remaining queue records are below the new thresholds

Refs #437

Conversation: https://staging.warp.dev/conversation/29bdb7c4-b503-41bc-8776-0872d193af48

Co-Authored-By: Oz oz-agent@warp.dev

Add max-attempt and max-age limits for in-flight Oz runs so the cron poller can drain stale queue records instead of retrying them indefinitely.

Co-Authored-By: Oz <oz-agent@warp.dev>
@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented May 6, 2026

@captainsafia

I'm starting a first review of this pull request.

You can view the conversation on Warp.

I completed the review and no human review was requested for this pull request.

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

Copy link
Copy Markdown
Contributor

@oz-for-oss oz-for-oss Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

The PR adds expiration limits for in-flight cron records by attempt count and age, routes expired records through the workflow failure handler, and covers the new paths with poller tests.

Concerns

  • No blocking correctness or security concerns found in the annotated diff.

Verdict

Found: 0 critical, 0 important, 0 suggestions

Approve

Comment /oz-review on this pull request to retrigger a review (up to 3 times on the same pull request).

Powered by Oz

@captainsafia
Copy link
Copy Markdown
Collaborator Author

@peicodes @vkodithala This applies a fix to the cron that runs every minute to poll the status of active Oz runs and update the corresponding GitHub comments. Previously, the cron iterated through in-flight run states in a stable order and retried each one until it reached a terminal state or successfully applied the result. That caused an issue when older runs got stuck in a bad state, for example because the Oz API kept returning errors while retrieving the run/artifacts or because the generated review payload could no longer be applied to GitHub. Since we process the cache in the same order each tick, those stuck records could repeatedly consume the cron execution window and prevent newer runs from being processed. This change avoids that by expiring in-flight records after a max number of attempts or max age, routing them through the existing failure handling path, and then removing them from the queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant