Skip to content

arch(indexing): dedup identical repos + per-repo isolation in the shared Pinecone index #324

Description

@DevanshuNEU

Summary

OCI indexes every repo into a single shared Pinecone index with isolation enforced only by a repo_id metadata filter at query time (app-enforced, no namespaces, no RLS safety net behind it -- see oci/architecture/security.md and consistency-model.md). This works for correctness today, but has two scaling problems:

1. No dedup for identical repos (storage cost + recall dilution)

If 100 users each index github.com/calcom/cal.com, OCI creates 100 separate repo_ids and stores 100 copies of identical vectors in the shared index. Storage grows linearly with users-per-public-repo, and the shared index accumulates near-duplicate vectors that dilute recall at scale.

Direction: dedup public repos by git_url + commit SHA so identical public code shares one index; private repos stay per-owner. Requires a public-vs-private access model and fork/staleness handling.

2. No hard per-repo isolation (and orphaned vectors on delete)

Isolation rides entirely on every query carrying the correct repo_id filter. There is no database-layer fence (service-role key bypasses RLS). Separately, repo deletion has no Pinecone delete step (consistency-model.md anomaly #2) -- every delete orphans that repo's vectors permanently.

Direction: switch to a Pinecone namespace per repo. This gives hard isolation (query scoped to namespace), makes delete a cheap delete_all on the namespace (fixes the orphaned-vector leak), and reduces cross-repo recall dilution.

Why now

Surfaced during the #323 design gate (per-user MCP auth). #323 makes the identity correct; this issue makes the storage partition correct and cheap at scale. They are independent: #323 is the MCP auth boundary, this is the indexing/storage layer.

Scope note

This is a storage-architecture decision (touches indexing partition strategy, the public/private access model, who pays to index, fork/staleness, and the delete path). Warrants its own ADR before implementation. Not part of #323.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    dogfood-findingSurfaced by using OCI on OCI itself

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions