Skip to content

Configure BullMQ retry strategy and dead letter handling for notification jobs #569

@chrispaskvan

Description

@chrispaskvan

Summary

The BullMQ notification worker re-throws errors to let BullMQ handle retries, but no retry configuration is set. BullMQ defaults to 0 retries, meaning transient failures cause permanent notification loss with no recovery path.

Problem

In helpers/subscriber.js, the worker catches errors and re-throws them:

throw err; // Re-throw to let BullMQ handle retries

However, in helpers/publisher.js, the Queue is created with no defaultJobOptions:

this.#queue = new Queue(topic, {
    connection: client,
    // No defaultJobOptions — BullMQ defaults apply
});

BullMQ defaults:

  • attempts: 1 — no retries; the job runs once and if it fails, it is marked as failed permanently
  • backoff: none
  • removeOnFail: false — failed jobs stay in Redis indefinitely but are never retried
  • removeOnComplete: false — completed jobs stay in Redis indefinitely

Additionally, notification.controller.js has a bare catch block in #send() that swallows all errors and sends a fallback message. The job always appears successful to BullMQ, so even if retries were configured, they would never trigger.

Consequences

  • A transient Twilio API failure → permanent notification loss for that user
  • A brief Bungie outage during token refresh in NotificationController.#send() → permanent notification loss
  • A Redis hiccup during ClaimCheck.updatePhoneNumber() → notification sent but status not tracked
  • Failed jobs accumulate in Redis with no alerting, no dead letter processing, no cleanup
  • Completed jobs also accumulate in Redis, consuming memory indefinitely
  • The comment "Re-throw to let BullMQ handle retries" is misleading since retries are not configured

Proposed Solution

1. Configure defaultJobOptions on the Queue in publisher.js

Add retry configuration with exponential backoff to the Queue constructor.

2. Refactor #send() error handling in notification.controller.js

  • Separate the "Xur has closed shop" fallback (business logic) from transient error handling (infrastructure)
  • Only catch business-logic errors (e.g., Xur not available); let transient errors propagate for BullMQ retry
  • Use BullMQ's UnrecoverableError for permanent failures (user not found, invalid phone number, Bungie account issues)

3. Classify retryable vs. permanent errors

  • Retryable: Twilio transient errors (5xx), Bungie transient errors (429, 503), Redis connection errors, network timeouts
  • Permanent: User not found (authentication fails permanently), invalid phone number, Bungie account issues
  • Leverage the existing isTransient property on ResponseError and isTransientError() from helpers/retry.js

4. Add dead letter alerting

  • Listen for the failed event on QueueEvents where failedReason indicates all retries are exhausted
  • Emit a metric to Application Insights: trackMetric("notification.job.exhausted", 1)
  • Log with full context: jobId, notificationType, phoneNumber, claimCheckNumber, error details

5. Add job lifecycle cleanup

  • removeOnComplete: clean up completed jobs after 24 hours (retain max 1000)
  • removeOnFail: retain failed jobs for 7 days (max 500) for investigation

Files to Modify

File Change
helpers/publisher.js Add defaultJobOptions with retry config to Queue constructor
helpers/subscriber.js Import and use UnrecoverableError for permanent failures
notifications/notification.controller.js Refactor #send() error handling to separate business fallbacks from transient errors
notifications/notification.error.js Add permanent vs. transient error distinction
helpers/application-insights.js Add metric for exhausted retries

Acceptance Criteria

  • Failed notification jobs are retried up to 3 times with exponential backoff (5s, 10s, 20s)
  • Permanent errors (user not found, invalid phone) skip retries via UnrecoverableError
  • Completed jobs are cleaned up after 24 hours (max 1000 retained)
  • Failed jobs are retained for 7 days (max 500) for investigation
  • Exhausted retries emit an Application Insights metric and a structured log entry
  • The Xur "closed shop" fallback only triggers on business-logic errors, not transient API failures
  • Unit test: enqueue a job, make worker fail twice then succeed → job completes on 3rd attempt
  • Unit test: enqueue a job with a permanent error → job fails immediately without retries

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions