Skip to content

fix(redis): timeout cachedFetchJson inflight fetcher (#3539)#3561

Open
koala73 wants to merge 4 commits into
mainfrom
fix/cached-fetch-inflight-timeout
Open

fix(redis): timeout cachedFetchJson inflight fetcher (#3539)#3561
koala73 wants to merge 4 commits into
mainfrom
fix/cached-fetch-inflight-timeout

Conversation

@koala73
Copy link
Copy Markdown
Owner

@koala73 koala73 commented May 2, 2026

Summary

  • Wraps the fetcher in cachedFetchJson and cachedFetchJsonWithMeta with a 30 s Promise.race timeout so the inflight Map slot is guaranteed to settle even if the fetcher truly never resolves (no internal timeout, no AbortController).
  • Pre-fix, the inflight entry only deleted in .finally(). A hung fetcher persisted in the Vercel isolate forever — every subsequent caller for that key was handed the same unresolved promise, permanently poisoning the key.
  • Treats the cache layer as the last line of defense: well-behaved fetchers using their own UPSTREAM_TIMEOUT_MS (typically 5–15 s) never see the 30 s ceiling. The trap was a future caller forgetting to set their own timeout.
  • Test-only __setFetcherTimeoutForTests / __resetFetcherTimeoutForTests exports follow the existing __resetKeyPrefixCacheForTests pattern so unit tests can exercise the timeout branch without sleeping for 30 s.
  • Timer is clearTimeout'd via .finally() whenever the fetcher wins so we don't leak handles or keep the isolate awake unnecessarily.

Closes #3539.

Test plan

  • npx tsx --test tests/redis-caching.test.mjs — 21/21 pass, including 3 new cases:
    • hung fetcher rejects with the timeout error and the inflight slot is released so a follow-up caller successfully re-fetches (the core regression),
    • prompt fetchers don't trip the timeout (no leaked timers / unhandled rejections after the timeout window passes),
    • cachedFetchJsonWithMeta enforces the same bound.
  • npx tsc --noEmit -p tsconfig.api.json clean.
  • npx biome lint server/_shared/redis.ts tests/redis-caching.test.mjs clean.

…ey for isolate lifetime (#3539)

Wraps the fetcher in `cachedFetchJson` and `cachedFetchJsonWithMeta` with a
30s `Promise.race` timeout so the inflight Map slot is guaranteed to settle
even if the fetcher truly never resolves (no internal timeout, no
AbortController).

Pre-fix: the inflight entry only deleted in `.finally()`, so a hung fetcher
persisted in the Vercel isolate forever — every subsequent caller for that
key was handed the same unresolved promise. Mostly mitigated when fetchers
behave (use `UPSTREAM_TIMEOUT_MS`), but a future caller forgetting the
timeout was a permanent hot-key trap. Treats the cache layer as the last
line of defense.

Tests cover (a) a hung fetcher rejects with the timeout error and the
inflight slot is released so the next caller re-fetches successfully, (b)
prompt fetchers don't trip the timeout (no leaked timers / unhandled
rejections), and (c) `cachedFetchJsonWithMeta` enforces the same bound.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment May 3, 2026 4:08am

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 2, 2026

Greptile Summary

This PR fixes a permanent inflight-map poisoning bug (#3539) by wrapping the fetcher() call in both cachedFetchJson and cachedFetchJsonWithMeta with a 30 s Promise.race timeout, ensuring the .finally() that deletes the inflight entry always fires even when the upstream hangs indefinitely. Test-only escape hatches (__setFetcherTimeoutForTests / __resetFetcherTimeoutForTests) follow the established module-override pattern and are backed by three targeted new test cases.

Confidence Score: 4/5

Safe to merge; core fix is correct and well-tested, only P2 style suggestions remain.

No P0 or P1 issues found. Two P2 style suggestions: the error message reads the module-level timeout variable at fire-time rather than capture-time, and it hard-codes "cachedFetchJson" for both call sites. Neither affects correctness in production.

No files require special attention beyond the two P2 style notes on server/_shared/redis.ts.

Important Files Changed

Filename Overview
server/_shared/redis.ts Adds withFetcherTimeout helper wrapping both cachedFetchJson and cachedFetchJsonWithMeta fetcher calls; two minor P2 issues: error message captures fetcherTimeoutMs at fire-time and hard-codes "cachedFetchJson" even for the WithMeta call site.
tests/redis-caching.test.mjs Adds three well-structured test cases covering hung fetcher rejection, clean timer clearance on fast resolution, and the WithMeta variant; test isolation via importRedisFresh and try/finally resets is solid.

Sequence Diagram

sequenceDiagram
    participant C1 as Caller 1 (leader)
    participant C2 as Caller 2 (follower)
    participant CF as cachedFetchJson
    participant WFT as withFetcherTimeout
    participant F as Fetcher
    participant T as setTimeout(30s)
    participant IM as inflight Map
    participant R as Redis

    C1->>CF: cachedFetchJson(key)
    CF->>R: getCachedJson(key) → null
    CF->>WFT: withFetcherTimeout(fetcher(), key)
    WFT->>F: fetcher() [already started]
    WFT->>T: setTimeout(reject, 30s)
    CF->>IM: inflight.set(key, promise)
    CF-->>C1: promise

    C2->>CF: cachedFetchJson(key)
    CF->>R: getCachedJson(key) → null
    CF->>IM: inflight.get(key) → existing
    CF-->>C2: existing (same promise)

    alt Fetcher resolves in time
        F-->>WFT: result
        WFT->>T: clearTimeout(timer)
        WFT-->>CF: result
        CF->>R: setCachedJson(key, result)
        CF->>IM: inflight.delete(key)
        CF-->>C1: result
        CF-->>C2: result
    else Fetcher hangs → timeout fires
        T-->>WFT: reject(TimeoutError)
        WFT-->>CF: TimeoutError
        CF->>IM: inflight.delete(key)
        CF-->>C1: ✗ TimeoutError
        CF-->>C2: ✗ TimeoutError
        Note over IM: Slot released — next caller starts fresh
    end
Loading

Reviews (1): Last reviewed commit: "fix(redis): timeout inflight fetcher to ..." | Re-trigger Greptile

Comment thread server/_shared/redis.ts Outdated
Comment on lines +287 to +290
const timeout = new Promise<never>((_, reject) => {
timer = setTimeout(() => {
reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`));
}, fetcherTimeoutMs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Error message captures timeout duration at fire-time, not creation-time

fetcherTimeoutMs is read inside the setTimeout callback via closure, so the message reflects the value of the variable when the timer fires, not the value passed to setTimeout. In production this never matters (the value is constant), but in tests where __setFetcherTimeoutForTests and __resetFetcherTimeoutForTests are called, if the reset races ahead of the timer firing the message could print the wrong duration. Capturing the value eagerly at call-time makes the message self-consistent with the actual delay used.

Suggested change
const timeout = new Promise<never>((_, reject) => {
timer = setTimeout(() => {
reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`));
}, fetcherTimeoutMs);
const timeout = new Promise<never>((_, reject) => {
const timeoutMs = fetcherTimeoutMs;
timer = setTimeout(() => {
reject(new Error(`cachedFetchJson timeout after ${timeoutMs}ms for "${key}"`));
}, timeoutMs);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing back on this one — timeoutMs in withFetcherTimeout is a local function parameter (line 287/290), captured at function-call time, not a live read against the module-level fetcherTimeoutDefaultMs. The closure inside setTimeout reads the parameter, which is immutable across the lifetime of that race. __resetFetcherTimeoutForTests mutates fetcherTimeoutDefaultMs, but that read happens at call site (line 335 / line 421) before withFetcherTimeout is invoked — by then the value is already snapshotted into the local parameter. So the message is already self-consistent with the actual delay. Closing as no-op; happy to revisit if I'm missing a vector.

Comment thread server/_shared/redis.ts Outdated
Comment on lines +285 to +295
function withFetcherTimeout<T>(promise: Promise<T>, key: string): Promise<T> {
let timer: ReturnType<typeof setTimeout> | undefined;
const timeout = new Promise<never>((_, reject) => {
timer = setTimeout(() => {
reject(new Error(`cachedFetchJson timeout after ${fetcherTimeoutMs}ms for "${key}"`));
}, fetcherTimeoutMs);
});
return Promise.race([promise, timeout]).finally(() => {
if (timer !== undefined) clearTimeout(timer);
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Error message hard-codes "cachedFetchJson" for both call sites

withFetcherTimeout is shared by both cachedFetchJson and cachedFetchJsonWithMeta, but the rejection message always says "cachedFetchJson timeout …". This makes the timeout error from cachedFetchJsonWithMeta misleading in logs. Adding a callerName parameter (or even a short tag) would make it trivial to distinguish which call site timed out in production log searches.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5cca589 — added a callerName: 'cachedFetchJson' | 'cachedFetchJsonWithMeta' parameter to withFetcherTimeout, both call sites now brand their own error string. Tests tightened to assert the exact prefix on each helper.

…s for LLM/multi-stage callers

PR review: hardcoded 30s ceiling pre-empted callers that intentionally allow
longer fetcher work. Concrete victims:

- deductSituation (deduct-situation.ts:121) wraps callLlmReasoning with
  DEDUCT_TIMEOUT_MS = 120_000 — at 30s the cache wrapper rejected before the
  LLM's own 120s timeout/fallback could complete; deductSituation didn't
  catch the rejection so users saw a cold 500.
- analyzeStock (analyze-stock.ts:1185) chains 2x UPSTREAM_TIMEOUT_MS sequenced
  upstream fetches plus a 20s LLM overlay inside one cached fetcher; realistic
  worst-case ~40s exceeds 30s.

Fix: add opts.timeoutMs to cachedFetchJson and cachedFetchJsonWithMeta.
Default stays at 30s for the common HTTP-fetch case so a forgetful new caller
still can't poison the inflight Map. Long-running callers pass an explicit
budget set above their internal bound (deduct: 125s, analyze-stock: 60s) so
the safety net is the LAST resort, not the active enforcer.

Also added test coverage for the override path on both helpers.
koala73 added 2 commits May 3, 2026 08:01
Greptile P2 on #3561: withFetcherTimeout is shared by both cachedFetchJson
and cachedFetchJsonWithMeta but the rejection always said "cachedFetchJson
timeout …", making cachedFetchJsonWithMeta timeout errors misleading in
logs. Add a `callerName` parameter so each call site brands its own error
and Sentry log searches can grep the right helper.

Tests tightened to assert the exact prefix on each helper.
Per residual-risk note on #3561 review: this helper only releases the
inflight slot — the underlying fetcher promise keeps running in the
background until isolate recycle. Document the limitation so the next
reader knows what NOT to claim, and so a future "add real cancellation"
PR has its scope laid out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cachedFetchJson inflight Map has no timeout — hung fetcher poisons key for isolate lifetime

1 participant