fix(redis): timeout cachedFetchJson inflight fetcher (#3539)#3561
fix(redis): timeout cachedFetchJson inflight fetcher (#3539)#3561koala73 wants to merge 4 commits into
Conversation
…ey for isolate lifetime (#3539) Wraps the fetcher in `cachedFetchJson` and `cachedFetchJsonWithMeta` with a 30s `Promise.race` timeout so the inflight Map slot is guaranteed to settle even if the fetcher truly never resolves (no internal timeout, no AbortController). Pre-fix: the inflight entry only deleted in `.finally()`, so a hung fetcher persisted in the Vercel isolate forever — every subsequent caller for that key was handed the same unresolved promise. Mostly mitigated when fetchers behave (use `UPSTREAM_TIMEOUT_MS`), but a future caller forgetting the timeout was a permanent hot-key trap. Treats the cache layer as the last line of defense. Tests cover (a) a hung fetcher rejects with the timeout error and the inflight slot is released so the next caller re-fetches successfully, (b) prompt fetchers don't trip the timeout (no leaked timers / unhandled rejections), and (c) `cachedFetchJsonWithMeta` enforces the same bound.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a permanent inflight-map poisoning bug (#3539) by wrapping the Confidence Score: 4/5Safe to merge; core fix is correct and well-tested, only P2 style suggestions remain. No P0 or P1 issues found. Two P2 style suggestions: the error message reads the module-level timeout variable at fire-time rather than capture-time, and it hard-codes "cachedFetchJson" for both call sites. Neither affects correctness in production. No files require special attention beyond the two P2 style notes on server/_shared/redis.ts. Important Files Changed
Sequence DiagramsequenceDiagram
participant C1 as Caller 1 (leader)
participant C2 as Caller 2 (follower)
participant CF as cachedFetchJson
participant WFT as withFetcherTimeout
participant F as Fetcher
participant T as setTimeout(30s)
participant IM as inflight Map
participant R as Redis
C1->>CF: cachedFetchJson(key)
CF->>R: getCachedJson(key) → null
CF->>WFT: withFetcherTimeout(fetcher(), key)
WFT->>F: fetcher() [already started]
WFT->>T: setTimeout(reject, 30s)
CF->>IM: inflight.set(key, promise)
CF-->>C1: promise
C2->>CF: cachedFetchJson(key)
CF->>R: getCachedJson(key) → null
CF->>IM: inflight.get(key) → existing
CF-->>C2: existing (same promise)
alt Fetcher resolves in time
F-->>WFT: result
WFT->>T: clearTimeout(timer)
WFT-->>CF: result
CF->>R: setCachedJson(key, result)
CF->>IM: inflight.delete(key)
CF-->>C1: result
CF-->>C2: result
else Fetcher hangs → timeout fires
T-->>WFT: reject(TimeoutError)
WFT-->>CF: TimeoutError
CF->>IM: inflight.delete(key)
CF-->>C1: ✗ TimeoutError
CF-->>C2: ✗ TimeoutError
Note over IM: Slot released — next caller starts fresh
end
Reviews (1): Last reviewed commit: "fix(redis): timeout inflight fetcher to ..." | Re-trigger Greptile |
| const timeout = new Promise<never>((_, reject) => { | ||
| timer = setTimeout(() => { | ||
| reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`)); | ||
| }, fetcherTimeoutMs); |
There was a problem hiding this comment.
Error message captures timeout duration at fire-time, not creation-time
fetcherTimeoutMs is read inside the setTimeout callback via closure, so the message reflects the value of the variable when the timer fires, not the value passed to setTimeout. In production this never matters (the value is constant), but in tests where __setFetcherTimeoutForTests and __resetFetcherTimeoutForTests are called, if the reset races ahead of the timer firing the message could print the wrong duration. Capturing the value eagerly at call-time makes the message self-consistent with the actual delay used.
| const timeout = new Promise<never>((_, reject) => { | |
| timer = setTimeout(() => { | |
| reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`)); | |
| }, fetcherTimeoutMs); | |
| const timeout = new Promise<never>((_, reject) => { | |
| const timeoutMs = fetcherTimeoutMs; | |
| timer = setTimeout(() => { | |
| reject(new Error(`cachedFetchJson timeout after ${timeoutMs}ms for "${key}"`)); | |
| }, timeoutMs); |
There was a problem hiding this comment.
Pushing back on this one — timeoutMs in withFetcherTimeout is a local function parameter (line 287/290), captured at function-call time, not a live read against the module-level fetcherTimeoutDefaultMs. The closure inside setTimeout reads the parameter, which is immutable across the lifetime of that race. __resetFetcherTimeoutForTests mutates fetcherTimeoutDefaultMs, but that read happens at call site (line 335 / line 421) before withFetcherTimeout is invoked — by then the value is already snapshotted into the local parameter. So the message is already self-consistent with the actual delay. Closing as no-op; happy to revisit if I'm missing a vector.
| function withFetcherTimeout<T>(promise: Promise<T>, key: string): Promise<T> { | ||
| let timer: ReturnType<typeof setTimeout> | undefined; | ||
| const timeout = new Promise<never>((_, reject) => { | ||
| timer = setTimeout(() => { | ||
| reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`)); | ||
| }, fetcherTimeoutMs); | ||
| }); | ||
| return Promise.race([promise, timeout]).finally(() => { | ||
| if (timer !== undefined) clearTimeout(timer); | ||
| }); | ||
| } |
There was a problem hiding this comment.
Error message hard-codes "cachedFetchJson" for both call sites
withFetcherTimeout is shared by both cachedFetchJson and cachedFetchJsonWithMeta, but the rejection message always says "cachedFetchJson timeout …". This makes the timeout error from cachedFetchJsonWithMeta misleading in logs. Adding a callerName parameter (or even a short tag) would make it trivial to distinguish which call site timed out in production log searches.
There was a problem hiding this comment.
Fixed in 5cca589 — added a callerName: 'cachedFetchJson' | 'cachedFetchJsonWithMeta' parameter to withFetcherTimeout, both call sites now brand their own error string. Tests tightened to assert the exact prefix on each helper.
…s for LLM/multi-stage callers PR review: hardcoded 30s ceiling pre-empted callers that intentionally allow longer fetcher work. Concrete victims: - deductSituation (deduct-situation.ts:121) wraps callLlmReasoning with DEDUCT_TIMEOUT_MS = 120_000 — at 30s the cache wrapper rejected before the LLM's own 120s timeout/fallback could complete; deductSituation didn't catch the rejection so users saw a cold 500. - analyzeStock (analyze-stock.ts:1185) chains 2x UPSTREAM_TIMEOUT_MS sequenced upstream fetches plus a 20s LLM overlay inside one cached fetcher; realistic worst-case ~40s exceeds 30s. Fix: add opts.timeoutMs to cachedFetchJson and cachedFetchJsonWithMeta. Default stays at 30s for the common HTTP-fetch case so a forgetful new caller still can't poison the inflight Map. Long-running callers pass an explicit budget set above their internal bound (deduct: 125s, analyze-stock: 60s) so the safety net is the LAST resort, not the active enforcer. Also added test coverage for the override path on both helpers.
Greptile P2 on #3561: withFetcherTimeout is shared by both cachedFetchJson and cachedFetchJsonWithMeta but the rejection always said "cachedFetchJson timeout …", making cachedFetchJsonWithMeta timeout errors misleading in logs. Add a `callerName` parameter so each call site brands its own error and Sentry log searches can grep the right helper. Tests tightened to assert the exact prefix on each helper.
Per residual-risk note on #3561 review: this helper only releases the inflight slot — the underlying fetcher promise keeps running in the background until isolate recycle. Document the limitation so the next reader knows what NOT to claim, and so a future "add real cancellation" PR has its scope laid out.
Summary
cachedFetchJsonandcachedFetchJsonWithMetawith a 30 sPromise.racetimeout so the inflight Map slot is guaranteed to settle even if the fetcher truly never resolves (no internal timeout, noAbortController)..finally(). A hung fetcher persisted in the Vercel isolate forever — every subsequent caller for that key was handed the same unresolved promise, permanently poisoning the key.UPSTREAM_TIMEOUT_MS(typically 5–15 s) never see the 30 s ceiling. The trap was a future caller forgetting to set their own timeout.__setFetcherTimeoutForTests/__resetFetcherTimeoutForTestsexports follow the existing__resetKeyPrefixCacheForTestspattern so unit tests can exercise the timeout branch without sleeping for 30 s.clearTimeout'd via.finally()whenever the fetcher wins so we don't leak handles or keep the isolate awake unnecessarily.Closes #3539.
Test plan
npx tsx --test tests/redis-caching.test.mjs— 21/21 pass, including 3 new cases:cachedFetchJsonWithMetaenforces the same bound.npx tsc --noEmit -p tsconfig.api.jsonclean.npx biome lint server/_shared/redis.ts tests/redis-caching.test.mjsclean.