Configure BullMQ retry strategy and dead letter handling for notification jobs

## Summary

The BullMQ notification worker re-throws errors to let BullMQ handle retries, but no retry configuration is set. BullMQ defaults to 0 retries, meaning transient failures cause permanent notification loss with no recovery path.

## Problem

In `helpers/subscriber.js`, the worker catches errors and re-throws them:

```js
throw err; // Re-throw to let BullMQ handle retries
```

However, in `helpers/publisher.js`, the Queue is created with no `defaultJobOptions`:

```js
this.#queue = new Queue(topic, {
    connection: client,
    // No defaultJobOptions — BullMQ defaults apply
});
```

**BullMQ defaults:**

- `attempts: 1` — no retries; the job runs once and if it fails, it is marked as failed permanently
- `backoff: none`
- `removeOnFail: false` — failed jobs stay in Redis indefinitely but are never retried
- `removeOnComplete: false` — completed jobs stay in Redis indefinitely

**Additionally**, `notification.controller.js` has a bare `catch` block in `#send()` that swallows *all* errors and sends a fallback message. The job always appears successful to BullMQ, so even if retries were configured, they would never trigger.

### Consequences

- A transient Twilio API failure → permanent notification loss for that user
- A brief Bungie outage during token refresh in `NotificationController.#send()` → permanent notification loss
- A Redis hiccup during `ClaimCheck.updatePhoneNumber()` → notification sent but status not tracked
- Failed jobs accumulate in Redis with no alerting, no dead letter processing, no cleanup
- Completed jobs also accumulate in Redis, consuming memory indefinitely
- The comment "Re-throw to let BullMQ handle retries" is misleading since retries are not configured

## Proposed Solution

### 1. Configure `defaultJobOptions` on the Queue in `publisher.js`

Add retry configuration with exponential backoff to the Queue constructor.

### 2. Refactor `#send()` error handling in `notification.controller.js`

- Separate the "Xur has closed shop" fallback (business logic) from transient error handling (infrastructure)
- Only catch business-logic errors (e.g., Xur not available); let transient errors propagate for BullMQ retry
- Use BullMQ's `UnrecoverableError` for permanent failures (user not found, invalid phone number, Bungie account issues)

### 3. Classify retryable vs. permanent errors

- **Retryable:** Twilio transient errors (5xx), Bungie transient errors (429, 503), Redis connection errors, network timeouts
- **Permanent:** User not found (authentication fails permanently), invalid phone number, Bungie account issues
- Leverage the existing `isTransient` property on `ResponseError` and `isTransientError()` from `helpers/retry.js`

### 4. Add dead letter alerting

- Listen for the `failed` event on `QueueEvents` where `failedReason` indicates all retries are exhausted
- Emit a metric to Application Insights: `trackMetric("notification.job.exhausted", 1)`
- Log with full context: `jobId`, `notificationType`, `phoneNumber`, `claimCheckNumber`, error details

### 5. Add job lifecycle cleanup

- `removeOnComplete`: clean up completed jobs after 24 hours (retain max 1000)
- `removeOnFail`: retain failed jobs for 7 days (max 500) for investigation

## Files to Modify

| File | Change |
|------|--------|
| `helpers/publisher.js` | Add `defaultJobOptions` with retry config to Queue constructor |
| `helpers/subscriber.js` | Import and use `UnrecoverableError` for permanent failures |
| `notifications/notification.controller.js` | Refactor `#send()` error handling to separate business fallbacks from transient errors |
| `notifications/notification.error.js` | Add permanent vs. transient error distinction |
| `helpers/application-insights.js` | Add metric for exhausted retries |

## Acceptance Criteria

- [ ] Failed notification jobs are retried up to 3 times with exponential backoff (5s, 10s, 20s)
- [ ] Permanent errors (user not found, invalid phone) skip retries via `UnrecoverableError`
- [ ] Completed jobs are cleaned up after 24 hours (max 1000 retained)
- [ ] Failed jobs are retained for 7 days (max 500) for investigation
- [ ] Exhausted retries emit an Application Insights metric and a structured log entry
- [ ] The Xur "closed shop" fallback only triggers on business-logic errors, not transient API failures
- [ ] Unit test: enqueue a job, make worker fail twice then succeed → job completes on 3rd attempt
- [ ] Unit test: enqueue a job with a permanent error → job fails immediately without retries

File	Change
`helpers/publisher.js`	Add `defaultJobOptions` with retry config to Queue constructor
`helpers/subscriber.js`	Import and use `UnrecoverableError` for permanent failures
`notifications/notification.controller.js`	Refactor `#send()` error handling to separate business fallbacks from transient errors
`notifications/notification.error.js`	Add permanent vs. transient error distinction
`helpers/application-insights.js`	Add metric for exhausted retries

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure BullMQ retry strategy and dead letter handling for notification jobs #569

Summary

Problem

Consequences

Proposed Solution

1. Configure `defaultJobOptions` on the Queue in `publisher.js`

2. Refactor `#send()` error handling in `notification.controller.js`

3. Classify retryable vs. permanent errors

4. Add dead letter alerting

5. Add job lifecycle cleanup

Files to Modify

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Configure BullMQ retry strategy and dead letter handling for notification jobs #569

Description

Summary

Problem

Consequences

Proposed Solution

1. Configure defaultJobOptions on the Queue in publisher.js

2. Refactor #send() error handling in notification.controller.js

3. Classify retryable vs. permanent errors

4. Add dead letter alerting

5. Add job lifecycle cleanup

Files to Modify

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Configure `defaultJobOptions` on the Queue in `publisher.js`

2. Refactor `#send()` error handling in `notification.controller.js`