Summary
The BullMQ notification worker re-throws errors to let BullMQ handle retries, but no retry configuration is set. BullMQ defaults to 0 retries, meaning transient failures cause permanent notification loss with no recovery path.
Problem
In helpers/subscriber.js, the worker catches errors and re-throws them:
throw err; // Re-throw to let BullMQ handle retries
However, in helpers/publisher.js, the Queue is created with no defaultJobOptions:
this.#queue = new Queue(topic, {
connection: client,
// No defaultJobOptions — BullMQ defaults apply
});
BullMQ defaults:
attempts: 1 — no retries; the job runs once and if it fails, it is marked as failed permanently
backoff: none
removeOnFail: false — failed jobs stay in Redis indefinitely but are never retried
removeOnComplete: false — completed jobs stay in Redis indefinitely
Additionally, notification.controller.js has a bare catch block in #send() that swallows all errors and sends a fallback message. The job always appears successful to BullMQ, so even if retries were configured, they would never trigger.
Consequences
- A transient Twilio API failure → permanent notification loss for that user
- A brief Bungie outage during token refresh in
NotificationController.#send() → permanent notification loss
- A Redis hiccup during
ClaimCheck.updatePhoneNumber() → notification sent but status not tracked
- Failed jobs accumulate in Redis with no alerting, no dead letter processing, no cleanup
- Completed jobs also accumulate in Redis, consuming memory indefinitely
- The comment "Re-throw to let BullMQ handle retries" is misleading since retries are not configured
Proposed Solution
1. Configure defaultJobOptions on the Queue in publisher.js
Add retry configuration with exponential backoff to the Queue constructor.
2. Refactor #send() error handling in notification.controller.js
- Separate the "Xur has closed shop" fallback (business logic) from transient error handling (infrastructure)
- Only catch business-logic errors (e.g., Xur not available); let transient errors propagate for BullMQ retry
- Use BullMQ's
UnrecoverableError for permanent failures (user not found, invalid phone number, Bungie account issues)
3. Classify retryable vs. permanent errors
- Retryable: Twilio transient errors (5xx), Bungie transient errors (429, 503), Redis connection errors, network timeouts
- Permanent: User not found (authentication fails permanently), invalid phone number, Bungie account issues
- Leverage the existing
isTransient property on ResponseError and isTransientError() from helpers/retry.js
4. Add dead letter alerting
- Listen for the
failed event on QueueEvents where failedReason indicates all retries are exhausted
- Emit a metric to Application Insights:
trackMetric("notification.job.exhausted", 1)
- Log with full context:
jobId, notificationType, phoneNumber, claimCheckNumber, error details
5. Add job lifecycle cleanup
removeOnComplete: clean up completed jobs after 24 hours (retain max 1000)
removeOnFail: retain failed jobs for 7 days (max 500) for investigation
Files to Modify
| File |
Change |
helpers/publisher.js |
Add defaultJobOptions with retry config to Queue constructor |
helpers/subscriber.js |
Import and use UnrecoverableError for permanent failures |
notifications/notification.controller.js |
Refactor #send() error handling to separate business fallbacks from transient errors |
notifications/notification.error.js |
Add permanent vs. transient error distinction |
helpers/application-insights.js |
Add metric for exhausted retries |
Acceptance Criteria
Summary
The BullMQ notification worker re-throws errors to let BullMQ handle retries, but no retry configuration is set. BullMQ defaults to 0 retries, meaning transient failures cause permanent notification loss with no recovery path.
Problem
In
helpers/subscriber.js, the worker catches errors and re-throws them:However, in
helpers/publisher.js, the Queue is created with nodefaultJobOptions:BullMQ defaults:
attempts: 1— no retries; the job runs once and if it fails, it is marked as failed permanentlybackoff: noneremoveOnFail: false— failed jobs stay in Redis indefinitely but are never retriedremoveOnComplete: false— completed jobs stay in Redis indefinitelyAdditionally,
notification.controller.jshas a barecatchblock in#send()that swallows all errors and sends a fallback message. The job always appears successful to BullMQ, so even if retries were configured, they would never trigger.Consequences
NotificationController.#send()→ permanent notification lossClaimCheck.updatePhoneNumber()→ notification sent but status not trackedProposed Solution
1. Configure
defaultJobOptionson the Queue inpublisher.jsAdd retry configuration with exponential backoff to the Queue constructor.
2. Refactor
#send()error handling innotification.controller.jsUnrecoverableErrorfor permanent failures (user not found, invalid phone number, Bungie account issues)3. Classify retryable vs. permanent errors
isTransientproperty onResponseErrorandisTransientError()fromhelpers/retry.js4. Add dead letter alerting
failedevent onQueueEventswherefailedReasonindicates all retries are exhaustedtrackMetric("notification.job.exhausted", 1)jobId,notificationType,phoneNumber,claimCheckNumber, error details5. Add job lifecycle cleanup
removeOnComplete: clean up completed jobs after 24 hours (retain max 1000)removeOnFail: retain failed jobs for 7 days (max 500) for investigationFiles to Modify
helpers/publisher.jsdefaultJobOptionswith retry config to Queue constructorhelpers/subscriber.jsUnrecoverableErrorfor permanent failuresnotifications/notification.controller.js#send()error handling to separate business fallbacks from transient errorsnotifications/notification.error.jshelpers/application-insights.jsAcceptance Criteria
UnrecoverableError