Enhance Pinecone integration by adding codebase indexing and context retrieval functionality. Update repository connection event name for consistency.#52
Conversation
…retrieval functionality. Update repository connection event name for consistency.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
WalkthroughThis pull request integrates code indexing into the repository connection flow and adds context retrieval functionality. Changes include: invoking indexCodebase in the indexRepo handler, adding a new retrieveContext function for querying embeddings with filters, updating the Pinecone index name from v1 to v2, and renaming the repository connection event identifier. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/web/modules/repository/action/index.ts (1)
47-73:⚠️ Potential issue | 🟠 Major
inngest.sendfires even when webhook creation fails, triggering a full index of an un-persisted repository.The
prisma.repository.createcall is guarded byif(webhook), butinngest.sendis outside that block and always executes. Whenwebhookis falsy, no DB record exists yetindexRepowill still:
- Fetch all repository file contents from GitHub
- Generate embeddings for every file
- Upsert vectors into Pinecone
This produces orphaned Pinecone vectors for a repo that was never saved. Move
inngest.sendinside theif(webhook)block.🐛 Proposed fix
if(webhook){ await prisma.repository.create({ data:{ githubId:BigInt(githubId), name:repo, owner, fullName:`${owner}/${repo}`, url:`https://github.com/${owner}/${repo}`, userId:session.user.id } }) + + try { + await inngest.send({ + name: "repository-connected", + data:{ + owner, + repo, + userId: session.user.id + } + }) + } catch (error) { + console.error("Failed to trigger repository indexing:", error) + } } - try { - await inngest.send({ - name: "repository-connected", - data:{ - owner, - repo, - userId: session.user.id - } - }) - - } catch (error) { - console.error("Failed to trigger repository indexing:", error) - - }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web/modules/repository/action/index.ts` around lines 47 - 73, The outgoing event is sent regardless of webhook creation, causing indexing for repos not persisted; move the call to inngest.send (the "repository-connected" event) inside the if(webhook) block immediately after prisma.repository.create so it only fires when a repository was successfully created/persisted (refer to webhook, prisma.repository.create, inngest.send and the "repository-connected" event name).apps/web/inngest/functions/index.ts (1)
22-41:⚠️ Potential issue | 🟠 MajorStep output for large repositories will exceed Inngest's 4 MB payload limit and cause function failures.
filesfromstep.run("fetch-files")is stored in Inngest's step state before truncation occurs insideindexCodebase. The untruncated, base64-decoded file contents fromgetRepoFileContents(which recursively fetches all text files) will exceed Inngest's per-step output limit of 4 MB for sufficiently large repositories.When exceeded, the
"fetch-files"step fails explicitly, triggering Inngest's error and retry system. After max retries, the entire function run fails unless caught with error handling (SDK v3.12.0+).Consider:
- Fetching, embedding, and upserting inside a single
step.runto avoid storing raw content as step output.- Truncating content during the fetch step before returning.
- Processing files in smaller batches per
step.runwith a manifest step.
🧹 Nitpick comments (2)
apps/web/lib/pinecone/pinecone.ts (1)
8-8: Ensure thev2index exists in Pinecone and plan for existingv1data.Renaming the target index to
v2means:
- Any vectors already stored in
supercode-vector-embeddings-v1are silently abandoned — context retrieval will return no results for previously-indexed repos.- The
supercode-vector-embeddings-v2index must be provisioned in the Pinecone console with the correct dimension (matchingtext-embedding-004output, 768 dims) before this is deployed.Consider triggering a re-index of all connected repositories after deployment, or keeping a fallback reference to the v1 index during the transition.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web/lib/pinecone/pinecone.ts` at line 8, The Pinecone index name was changed to "supercode-vector-embeddings-v2" which will break retrieval for existing vectors in "supercode-vector-embeddings-v1"; before deploying, provision a new Pinecone index named "supercode-vector-embeddings-v2" with dimension 768 (matching text-embedding-004) in the Pinecone console and update any initialization code that references the index name (the string "supercode-vector-embeddings-v2" in pinecone.ts). Additionally, add a migration/compatibility plan in the code that either (a) falls back to "supercode-vector-embeddings-v1" when no results are found from v2 (by checking both index names in your Pinecone client/init function), or (b) triggers a re-index of all repos into v2 after deployment (implement a reindex function or job and call it post-deploy).apps/web/modules/pinecone/rag/index.ts (1)
54-66:retrieveContexthas no error handling and an unsafe type cast on metadata content.Two issues:
Neither
generateEmbeddingnorpineconeIndex.queryare wrapped in try/catch. Any network failure, quota error, or Pinecone unavailability propagates raw to callers.indexCodebaseuses per-file try/catch for the samegenerateEmbeddingcall — apply consistent error handling here.
match.metadata?.content as stringis an unchecked cast.RecordMetadatavalues are typed asstring | number | boolean | string[]. A non-string truthy value survives.filter(Boolean)and breaks callers expectingstring[].♻️ Proposed fix
-export async function retrieveContext(query: string, repoId: string, topK:number=5){ - - const embedding = await generateEmbedding(query); - - const results = await pineconeIndex.query({ - vector: embedding, - filter: {repoId}, - topK, - includeMetadata:true - }) - - return results.matches.map(match=>match.metadata?.content as string).filter(Boolean) - -} +export async function retrieveContext(query: string, repoId: string, topK: number = 5): Promise<string[]> { + try { + const embedding = await generateEmbedding(query); + + const results = await pineconeIndex.query({ + vector: embedding, + filter: { repoId }, + topK, + includeMetadata: true, + }); + + return results.matches + .map(match => match.metadata?.content) + .filter((c): c is string => typeof c === 'string' && Boolean(c)); + } catch (error) { + console.error('Failed to retrieve context:', error); + return []; + } +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/web/modules/pinecone/rag/index.ts` around lines 54 - 66, Wrap the body of retrieveContext in a try/catch and mirror the error-handling approach used in indexCodebase: catch errors from generateEmbedding and pineconeIndex.query, log or rethrow a contextual Error (do not let raw errors leak), and return an empty array on failure; also remove the unsafe cast match.metadata?.content as string and instead validate and normalize metadata content: if typeof content === 'string' push it, if Array.isArray(content) then filter for string elements and spread them into the results, otherwise ignore non-string values so only real strings are returned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@apps/web/modules/repository/action/index.ts`:
- Around line 47-73: The outgoing event is sent regardless of webhook creation,
causing indexing for repos not persisted; move the call to inngest.send (the
"repository-connected" event) inside the if(webhook) block immediately after
prisma.repository.create so it only fires when a repository was successfully
created/persisted (refer to webhook, prisma.repository.create, inngest.send and
the "repository-connected" event name).
---
Nitpick comments:
In `@apps/web/lib/pinecone/pinecone.ts`:
- Line 8: The Pinecone index name was changed to
"supercode-vector-embeddings-v2" which will break retrieval for existing vectors
in "supercode-vector-embeddings-v1"; before deploying, provision a new Pinecone
index named "supercode-vector-embeddings-v2" with dimension 768 (matching
text-embedding-004) in the Pinecone console and update any initialization code
that references the index name (the string "supercode-vector-embeddings-v2" in
pinecone.ts). Additionally, add a migration/compatibility plan in the code that
either (a) falls back to "supercode-vector-embeddings-v1" when no results are
found from v2 (by checking both index names in your Pinecone client/init
function), or (b) triggers a re-index of all repos into v2 after deployment
(implement a reindex function or job and call it post-deploy).
In `@apps/web/modules/pinecone/rag/index.ts`:
- Around line 54-66: Wrap the body of retrieveContext in a try/catch and mirror
the error-handling approach used in indexCodebase: catch errors from
generateEmbedding and pineconeIndex.query, log or rethrow a contextual Error (do
not let raw errors leak), and return an empty array on failure; also remove the
unsafe cast match.metadata?.content as string and instead validate and normalize
metadata content: if typeof content === 'string' push it, if
Array.isArray(content) then filter for string elements and spread them into the
results, otherwise ignore non-string values so only real strings are returned.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
apps/web/inngest/functions/index.tsapps/web/lib/pinecone/pinecone.tsapps/web/modules/pinecone/rag/index.tsapps/web/modules/repository/action/index.ts
Summary by CodeRabbit
New Features
Chores