feat(aiobs): dedicated AI endpoint routing#3773
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/node/src/__tests__/dedicated-ai-endpoint.spec.ts:40-77
**Prefer parameterised tests for the routing cases**
The first test ("routes batched `$ai_*` events to the dedicated AI path") and the last test ("routes `$ai_*` events to the normal batch path when disabled") cover the exact same scenario — one `$ai_generation` event captured via `capture()` / `shutdown()` — differing only in the `_internal_dedicatedAiEndpoint` flag and the expected URL. Per the team's style, these should be consolidated into a single `it.each` table. Similarly, a disabled-path counterpart for the `captureImmediate` case is absent; adding it to an `it.each` would close that coverage gap at the same time.
### Issue 2 of 2
packages/core/src/posthog-core-stateless.ts:1198-1215
**Second-channel error silently dropped when both channels fail**
`firstError ??= err` captures only the first error encountered. If `DEFAULT_EVENT_CHANNEL` flushes successfully but `AI_EVENT_CHANNEL` throws, the error is rethrown correctly. However, if both channels fail (the analytics flush throws first), the AI channel's error is silently discarded. Callers see the analytics error and have no indication that AI events were also undelivered. Using `AggregateError` (or a simple array-join) would preserve both failure reasons without changing the throw semantics.
Reviews (1): Last reviewed commit: "feat: dedicated ai endpoint routing" | Re-trigger Greptile |
| it('routes batched $ai_* events to the dedicated AI path', async () => { | ||
| const ph = newClient({ _internal_dedicatedAiEndpoint: true }) | ||
| ph.capture({ distinctId: 'u1', event: '$ai_generation', properties: { $ai_model: 'gpt-4' } }) | ||
| await ph.shutdown() | ||
|
|
||
| expect(batchSentTo(AI_URL)?.map((e: any) => e.event)).toEqual(['$ai_generation']) | ||
| expect(callsTo(ANALYTICS_URL)).toHaveLength(0) | ||
| }) | ||
|
|
||
| it('keeps non-AI events on the normal path, in a separate batch from AI events', async () => { | ||
| const ph = newClient({ _internal_dedicatedAiEndpoint: true }) | ||
| ph.capture({ distinctId: 'u1', event: '$ai_generation', properties: {} }) | ||
| ph.capture({ distinctId: 'u1', event: 'button_clicked', properties: {} }) | ||
| await ph.shutdown() | ||
|
|
||
| expect(batchSentTo(AI_URL)?.map((e: any) => e.event)).toEqual(['$ai_generation']) | ||
| expect(batchSentTo(ANALYTICS_URL)?.map((e: any) => e.event)).toEqual(['button_clicked']) | ||
| }) | ||
|
|
||
| it('routes immediate $ai_* captures to the dedicated AI path', async () => { | ||
| const ph = newClient({ _internal_dedicatedAiEndpoint: true }) | ||
| await ph.captureImmediate({ distinctId: 'u1', event: '$ai_embedding', properties: {} }) | ||
|
|
||
| expect(batchSentTo(AI_URL)?.[0].event).toBe('$ai_embedding') | ||
| expect(callsTo(ANALYTICS_URL)).toHaveLength(0) | ||
| await ph.shutdown() | ||
| }) | ||
| }) | ||
|
|
||
| describe('when disabled (default)', () => { | ||
| it('routes $ai_* events to the normal batch path', async () => { | ||
| const ph = newClient() | ||
| ph.capture({ distinctId: 'u1', event: '$ai_generation', properties: {} }) | ||
| await ph.shutdown() | ||
|
|
||
| expect(batchSentTo(ANALYTICS_URL)?.map((e: any) => e.event)).toEqual(['$ai_generation']) | ||
| expect(callsTo(AI_URL)).toHaveLength(0) | ||
| }) |
There was a problem hiding this comment.
Prefer parameterised tests for the routing cases
The first test ("routes batched $ai_* events to the dedicated AI path") and the last test ("routes $ai_* events to the normal batch path when disabled") cover the exact same scenario — one $ai_generation event captured via capture() / shutdown() — differing only in the _internal_dedicatedAiEndpoint flag and the expected URL. Per the team's style, these should be consolidated into a single it.each table. Similarly, a disabled-path counterpart for the captureImmediate case is absent; adding it to an it.each would close that coverage gap at the same time.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/node/src/__tests__/dedicated-ai-endpoint.spec.ts
Line: 40-77
Comment:
**Prefer parameterised tests for the routing cases**
The first test ("routes batched `$ai_*` events to the dedicated AI path") and the last test ("routes `$ai_*` events to the normal batch path when disabled") cover the exact same scenario — one `$ai_generation` event captured via `capture()` / `shutdown()` — differing only in the `_internal_dedicatedAiEndpoint` flag and the expected URL. Per the team's style, these should be consolidated into a single `it.each` table. Similarly, a disabled-path counterpart for the `captureImmediate` case is absent; adding it to an `it.each` would close that coverage gap at the same time.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| this.clearFlushTimer() | ||
| await this._initPromise | ||
|
|
||
| let queue = this.getPersistedProperty<PostHogQueueItem[]>(PostHogPersistedProperty.Queue) || [] | ||
| // Guarantee all flushes run but errors are still thrown at the end | ||
| let firstError: unknown | ||
| for (const channel of this.eventChannels()) { | ||
| try { | ||
| await this._flushChannel(channel) | ||
| } catch (err) { | ||
| firstError ??= err | ||
| } | ||
| } | ||
|
|
||
| if (firstError !== undefined) { | ||
| throw firstError | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Second-channel error silently dropped when both channels fail
firstError ??= err captures only the first error encountered. If DEFAULT_EVENT_CHANNEL flushes successfully but AI_EVENT_CHANNEL throws, the error is rethrown correctly. However, if both channels fail (the analytics flush throws first), the AI channel's error is silently discarded. Callers see the analytics error and have no indication that AI events were also undelivered. Using AggregateError (or a simple array-join) would preserve both failure reasons without changing the throw semantics.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/core/src/posthog-core-stateless.ts
Line: 1198-1215
Comment:
**Second-channel error silently dropped when both channels fail**
`firstError ??= err` captures only the first error encountered. If `DEFAULT_EVENT_CHANNEL` flushes successfully but `AI_EVENT_CHANNEL` throws, the error is rethrown correctly. However, if both channels fail (the analytics flush throws first), the AI channel's error is silently discarded. Callers see the analytics error and have no indication that AI events were also undelivered. Using `AggregateError` (or a simple array-join) would preserve both failure reasons without changing the throw semantics.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
yeah good point with AggregateError
There was a problem hiding this comment.
AggregateError was my first intuition too but there are many things that depend on the error being a specific PostHog error class downstream, such as:
- The public contract of
flushguarantees it throws@throws PostHogFetchHttpError / PostHogFetchNetworkError, so it'd need a major bump for this change - Many flow control utilities check against the PostHog error classes specifically
- Some other utils such as loggers and tests also check against class identity
There was a problem hiding this comment.
Fair points — the AggregateError suggestion would break the documented throws contract and the downstream class-identity checks. The current firstError ??= err behavior is a reasonable tradeoff, especially behind an @internal flag.
One lightweight option that preserves the existing contract while keeping the second error visible: log the dropped error rather than silently discarding it:
| this.clearFlushTimer() | |
| await this._initPromise | |
| let queue = this.getPersistedProperty<PostHogQueueItem[]>(PostHogPersistedProperty.Queue) || [] | |
| // Guarantee all flushes run but errors are still thrown at the end | |
| let firstError: unknown | |
| for (const channel of this.eventChannels()) { | |
| try { | |
| await this._flushChannel(channel) | |
| } catch (err) { | |
| firstError ??= err | |
| } | |
| } | |
| if (firstError !== undefined) { | |
| throw firstError | |
| } | |
| } | |
| // Guarantee all flushes run but errors are still thrown at the end | |
| let firstError: unknown | |
| for (const channel of this.eventChannels()) { | |
| try { | |
| await this._flushChannel(channel) | |
| } catch (err) { | |
| if (firstError !== undefined) { | |
| this._events.emit('error', err) | |
| } | |
| firstError ??= err | |
| } | |
| } |
This way callers still get the original PostHog error class from the first failure, but the second failure isn't completely invisible — it surfaces through the existing error event that the SDK already uses for non-fatal issues. No contract change, no class-identity breakage.
That said, if you're comfortable accepting the silent drop given the @internal gating, the current code is also fine as-is.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
There was a problem hiding this comment.
I see, lets add a TODO about this behavior then at least, so we can consider a refactor in the next major
|
Size Change: +2.53 kB (+0.02%) Total Size: 16.7 MB
ℹ️ View Unchanged
|
| /** | ||
| * Routes `$ai_*` events to a dedicated capture-ai endpoint in their own batch. | ||
| * | ||
| * @internal Not ready for use — the backend endpoint and ingress routing are still being | ||
| * rolled out. Do not enable; behaviour and naming may change or be removed without notice. | ||
| */ | ||
| _internal_dedicatedAiEndpoint?: boolean |
There was a problem hiding this comment.
How will we roll this out? asking users to enable _internal_dedicatedAiEndpoint?
There was a problem hiding this comment.
The idea was to test this with test projects or team 2 in prod, and once ready just remove the guard and ship a new minor.
There was a problem hiding this comment.
is the goal to test the SDK or the backend?
if just the backend, we can do this with the python SDK first and here we dont need that, but a full swap
if the goal is to test also the SDK, that means we will need something similar for every SDK
if the HTTP contract is the same, i dont think we need that tbh, its just important that the backend is battletested
|
if the goal is to dogfood with PostHog/posthog-python#656 |
Got it, will address all feedback and re-request a review, but hold off on merging this until we've tested with Python |
|
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
|
This PR was closed due to lack of activity. Feel free to reopen if it's still relevant. |
Problem
We want to fully split out AI events ingestion, which requires us to send them to a separate
/i/v0/ai/batch/endpoint.Changes
$ai_*events to a dedicated endpoint. Covers both batchedcapture()andcaptureImmediate().@internaloption_internal_dedicatedAiEndpoint(off by default; not ready for general use while the backend route + ingress roll out)posthog-js(web),posthog-react-native, or other SDKsRelease info Sub-libraries affected
Libraries affected
(
@posthog/corealso bumps — it's the shared batcher posthog-node depends on, but isn't a checkbox above.)Checklist
If releasing new changes
pnpm changesetto generate a changeset file🤖 Agent context
Authored with Claude Code (agent), paired with Carlos. Design was iterated heavily: the routing lives at the core
enqueue/flush layer (not in@posthog/ai) so it also catches the LangChain / OpenAI-Agents paths that callcapture()directly; the AI policy lives only inposthog-nodewhile@posthog/coreholds only the generic mechanism, so RN/web are provably unaffected. A sidecar-client approach was rejected (double-shutdown data loss in serverless, config drift); server-side event-name routing was rejected by ingestion. We target the existing/batch/handler via a dedicated path (not the dead multipart/i/v0/aiendpoint). Automated tests run:@posthog/coreandposthog-nodeJest suites (green) plus a new 4-case node spec. Requires human review; no manual testing claimed.