diff --git a/docs/adapters/benchmark_import.md b/docs/adapters/benchmark_import.md new file mode 100644 index 0000000..0e327be --- /dev/null +++ b/docs/adapters/benchmark_import.md @@ -0,0 +1,361 @@ +# Benchmark import adapter — specification + +> Specification document for `layerlens.instrument.adapters.frameworks.benchmark_import`. +> Companion to the user-facing reference at `docs/adapters/frameworks-benchmark_import.md`. +> The depth audit at `A:/tmp/adapter-depth-audit.md` §1.18 flagged this adapter +> as having no spec; this document closes that gap. + +| | | +|---|---| +| Spec ID | ADP-074 (benchmark import) | +| Module | `layerlens.instrument.adapters.frameworks.benchmark_import` | +| Source | `src/layerlens/instrument/adapters/frameworks/benchmark_import/adapter.py` | +| Adapter type | **Data importer** — does NOT extend `BaseAdapter`; not a runtime instrumentation adapter | +| Status | Functional, ports byte-near-identical from ateam | +| Tests | None dedicated — covered by `tests/instrument/adapters/frameworks/test_bulk_ported_smoke.py::test_benchmark_import_adapter_independent` | + +--- + +## 1. Overview + +The benchmark import adapter is a **data importer**. It reads benchmark +suites from external sources (HuggingFace Datasets, HELM result JSON, +local CSV/JSON/Parquet files), normalizes the records to LayerLens' +canonical evaluation schema, and persists them so the platform's +evaluation runner can consume them like any other dataset. + +This adapter is **architecturally distinct** from every other adapter in +`frameworks/`. The other 17 are runtime instrumentation adapters that +extend `BaseAdapter`, wrap an in-process framework, and emit telemetry +events as the wrapped framework runs. `BenchmarkImportAdapter` does none +of that: + +* It does NOT extend `BaseAdapter`. It is a standalone class. +* It does NOT have a `connect()` / `disconnect()` lifecycle. +* It does NOT emit any telemetry events. +* It does NOT wrap any third-party framework's runtime methods. +* It is closer in shape to the `langfuse` adapter's `importer.py` submodule + than to LangChain's callback handler. + +In ATEAM the smoke test for this adapter +(`test_benchmark_import_adapter_independent`) explicitly documents this +non-`BaseAdapter` shape; the audit (§1.18) classifies the architectural +inconsistency as intentional but flags that the framework-architecture +spec lacks a "data importer" carve-out for it. This spec provides that +carve-out. + +The adapter exists because LayerLens evaluation spaces need a way to +import published benchmark suites — MMLU, SQuAD, HumanEval, and similar +HuggingFace datasets — without forcing every customer to write their own +ETL. The downstream evaluation runner (in atlas-app) iterates the +imported benchmark and produces the `model.invoke` / `evaluation.score` +events through standard provider adapters. + +--- + +## 2. Capability surface + +`BenchmarkImportAdapter` does NOT use the `AdapterCapability` enum and +does NOT publish an `AdapterInfo` record. The capability surface is +expressed instead through the public method set listed in §3. + +What the adapter CAN do, sourced directly from `adapter.py`: + +| Capability | Source | +|------------------------------------------------|------------------| +| Stream a HuggingFace dataset split | `import_huggingface` (line 101–171). Uses `datasets.load_dataset(streaming=True)` to avoid materializing the whole split. | +| Parse HELM JSON results | `import_helm` (line 175–246). Handles the top-level-list, `{results: …}`, and `{scenarios: […]}` shapes. | +| Read a CSV file | `import_csv` (line 250–295). Uses stdlib `csv.DictReader`; `delimiter` is configurable. | +| Read a JSON file (array or object-with-key) | `import_json` (line 297–349). The `records_key` argument selects the array inside an object payload (defaults to `records`, falling back to `data`). | +| Read a Parquet file | `import_parquet` (line 351–405). Lazily imports `pyarrow.parquet`; returns a clear error if `pyarrow` isn't installed. | +| Apply schema-mapping renames | `_apply_schema_mapping` (line 419–431). Pure key-renaming; no value transforms. | +| List / look up imported benchmarks | `list_benchmarks` (line 409), `get_benchmark` (line 413). | +| Persist via an injected store | `_persist` (line 433–443). If the constructor receives a `store=` argument with `insert_row(table, row)`, the adapter writes one row to the `benchmarks` table and one row per record to the `benchmark_records` table. | + +--- + +## 3. Contract + +### 3.1 Public API + +| Member | Description | +|-------------------------------------------------|----------------------------------------------------------------------------| +| `__init__(store: Any \| None = None)` | Construct with optional persistent store. Without a store, imports stay in memory. | +| `import_huggingface(dataset_name, split="test", subset=None, schema_mapping=None, max_records=None, tags=None) -> ImportResult` | Streamed HuggingFace import. | +| `import_helm(path, schema_mapping=None, tags=None) -> ImportResult` | HELM JSON import. | +| `import_csv(path, schema_mapping=None, delimiter=",", max_records=None, tags=None) -> ImportResult` | CSV import. | +| `import_json(path, schema_mapping=None, records_key=None, max_records=None, tags=None) -> ImportResult` | JSON import. | +| `import_parquet(path, schema_mapping=None, max_records=None, tags=None) -> ImportResult` | Parquet import. | +| `list_benchmarks() -> list[BenchmarkMetadata]` | All metadata held in this adapter instance. | +| `get_benchmark(benchmark_id) -> BenchmarkMetadata \| None` | Single lookup. | + +### 3.2 Public data models + +`BenchmarkMetadata` and `ImportResult` are Pydantic v1/v2-compatible +models defined in the same module (`adapter.py:34–63`). + +`BenchmarkMetadata`: + +| Field | Type | Default | +|----------------------|--------------------|--------------------------------------------| +| `benchmark_id` | `str` | `f"bench-{uuid4().hex[:12]}"` | +| `name` | `str` | required | +| `source` | `str` | required (`huggingface`/`helm`/`csv`/`json`/`parquet`) | +| `source_identifier` | `str` | empty | +| `version` | `str` | `"1.0.0"` | +| `record_count` | `int` | 0 | +| `schema_mapping` | `dict[str, str]` | `{}` | +| `imported_at` | `str` (ISO-8601) | `datetime.now(UTC).isoformat()` | +| `imported_by` | `str` | empty | +| `tags` | `list[str]` | `[]` | + +`ImportResult`: + +| Field | Type | Default | +|--------------------|-------------------------------|---------| +| `success` | `bool` | `True` | +| `benchmark_id` | `str` | empty | +| `records_imported` | `int` | 0 | +| `records_skipped` | `int` | 0 | +| `duration_ms` | `float` | 0.0 | +| `errors` | `list[str]` | `[]` | +| `metadata` | `Optional[BenchmarkMetadata]` | `None` | + +### 3.3 Schema mapping + +`schema_mapping` is a `dict[str, str]` of `source_field → canonical_field`. +The mapping is applied identically across all five import methods via +`_apply_schema_mapping`. The mapping is a pure key rename — values pass +through unmodified, and source keys NOT in the mapping pass through with +their original names. + +There is **no automatic heuristic mapping** in the source. The +user-reference doc at `docs/adapters/frameworks-benchmark_import.md` +mentions "automatic heuristics" — this is not implemented in code today +(`_apply_schema_mapping` short-circuits to `return record` if the mapping +is None or empty). Treat that doc claim as aspirational pending a future +enhancement. + +The recommended canonical fields, when imports will be consumed by the +LayerLens evaluation runner: + +| Canonical field | Common source field aliases | +|--------------------|--------------------------------------------------------| +| `prompt` | `question`, `input`, `query` | +| `expected_output` | `answer`, `target`, `reference`, `ground_truth` | +| `difficulty` | `difficulty`, `level` | +| `category` | `category`, `subject`, `topic` | + +### 3.4 Persistence contract + +The `store` constructor argument is duck-typed: any object with a single +method `insert_row(table_name: str, row: dict)` will be accepted. The +adapter writes: + +1. One row to table `"benchmarks"` containing `metadata.model_dump()`. +2. One row per imported record to table `"benchmark_records"`, each + record dict pre-augmented with `record["benchmark_id"] = metadata.benchmark_id`. + +If `store` is `None`, the metadata is held in `self._benchmarks` (an +in-memory dict keyed by `benchmark_id`) but the records themselves are +NOT retained — they are produced and discarded inside the `import_*` +method. This is intentional: holding millions of HuggingFace records in +process memory would defeat the streaming load. Callers who need the +records without a store should pass a callable shim that captures them +(e.g. an in-test list). + +### 3.5 Error handling + +Every `import_*` method follows the same error-handling shape: + +* `ImportError` from a lazy dependency import (`datasets`, `pyarrow`) + → return `ImportResult(success=False, errors=["… library not installed …"])`. +* `FileNotFoundError`, `json.JSONDecodeError`, etc. → return + `ImportResult(success=False, errors=[…])`. +* Any other exception → return `ImportResult(success=False, errors=[f"X import failed: {exc}"])`. + +The methods do NOT raise. A caller can rely on the `result.success` flag +and the `result.errors` list. Persistence failures inside `_persist` are +swallowed and logged at DEBUG (`adapter.py:441–443`); the `ImportResult` +will still report `success=True` even though nothing reached the store. +This is a known gap — see §7 roadmap. + +--- + +## 4. What we do NOT support + +* **Not a `BaseAdapter`.** No `connect()`, `disconnect()`, `health_check()`, + `get_adapter_info()`, `serialize_for_replay()`, sinks, capture config, + or capability declarations. The class is freestanding. +* **No telemetry events.** `BenchmarkImportAdapter` does NOT emit + `embedding.create`, `model.invoke`, `evaluation.score`, or any other + event type. The downstream evaluation runner is responsible for + generating events when it iterates the imported benchmark. +* **No live mid-run feedback.** This is a one-shot ETL: call + `import_csv(...)`, get an `ImportResult`. There is no streaming + interface, no progress callback, no per-batch event. For very large + datasets the only way to bound memory is via `max_records`. +* **No MTEB importer.** The user-reference doc's parent-directory listing + references "MTEB" generically; there is no `import_mteb(...)` method. + MTEB datasets reachable through HuggingFace can be loaded via + `import_huggingface("mteb/...")`, but there is no MTEB-specific + schema mapping or score normalization. Adding native MTEB support is + on the roadmap. +* **No MMLU-specific importer.** Same situation as MTEB — load it as a + HuggingFace dataset (`import_huggingface("cais/mmlu", subset=...)`). + No native MMLU-aware mapping ships today. +* **No BIG-bench, GPQA, HumanEval, SWE-bench, AGIEval, etc. native + importers.** Same answer: any of these reachable through HuggingFace + Datasets work via `import_huggingface`, but no benchmark-specific + schema knowledge ships. +* **No automatic schema-heuristic detection.** The reference doc + describes case-insensitive aliasing; the source does not implement it. +* **No cross-benchmark deduplication.** Reimporting the same dataset + produces a fresh `benchmark_id` and a fresh row in the `benchmarks` + table. There is no dedupe key. +* **No re-import / incremental update.** No cursor, no `since`, no diff + mode. Every import is a fresh full-load (within `max_records`). +* **No per-record validation against the canonical evaluation schema.** + The adapter renames keys via `schema_mapping`; it does NOT verify that + a `prompt` field exists, that values are non-empty, or that types are + what the evaluation runner expects. +* **No per-record signing or attestation.** Imported records are stored + as plain rows. There is no merkle root, no source-hash, no provenance + envelope. +* **No multi-tenant scoping inside the adapter.** The adapter has no + `org_id` argument, does not stamp `org_id` onto rows, and does not + inject tenant context into the `store.insert_row` call. Multi-tenant + scoping is the responsibility of the **store** the caller injects — + see §5. +* **No remote-source authentication.** HuggingFace public datasets work + out of the box. Gated datasets require `HUGGINGFACE_HUB_TOKEN` set in + the environment of the hosting process; the adapter does not surface + this as an explicit kwarg. +* **No format-conversion outputs.** The adapter does not export to a + different format. Records flow in, get persisted, and live in whatever + shape the store keeps them. +* **No sample (in `samples/instrument/benchmark_import/`) for + HuggingFace or HELM imports.** Only a CSV sample exists today. + +--- + +## 5. BYOK and multi-tenancy + +**BYOK is not applicable.** The adapter does not call any LLM provider +API and does not need a model API key. HuggingFace dataset loading uses +the `datasets` library's own credential resolution +(`HUGGINGFACE_HUB_TOKEN` env var) for gated datasets; this is outside +the LayerLens BYOK scope per the project's "BYOK = model API keys only" +convention. + +**Multi-tenancy** is the responsibility of the injected `store`. The +adapter is tenant-agnostic by design — it does not know about +organizations, users, or tenant scoping. A multi-tenant deployment must: + +1. Inject a `store` whose `insert_row` is itself tenant-aware (e.g. + wraps the underlying SQL with a `tenant_id` column populated from + request context). +2. Serialize benchmark imports per tenant — do NOT share a single + `BenchmarkImportAdapter` instance across tenants because + `self._benchmarks` (the in-memory metadata cache) does not carry + tenant scoping. Callers running on a per-request basis should + construct a fresh adapter per import. +3. If the platform later wants to surface "all benchmarks for org X", + that lookup must go through the store, not through `list_benchmarks()` + on a long-lived adapter instance. + +The `imported_by` field on `BenchmarkMetadata` is meant to record the +user who triggered the import; it currently defaults to empty and must +be populated by the caller before persistence. The platform's CLI/API +layer is the right place to set it. + +--- + +## 6. Test coverage + +* **Dedicated tests:** none. +* **Bulk smoke coverage:** `tests/instrument/adapters/frameworks/test_bulk_ported_smoke.py::test_benchmark_import_adapter_independent` + (lines 170–189). The smoke test exercises: + * `BenchmarkMetadata(name="test", source="csv")` constructs and the + `benchmark_id` has the expected `bench-` prefix. + * `ImportResult(success=True, benchmark_id=…)` constructs. + * `BenchmarkImportAdapter()` constructs. + + The smoke test does NOT exercise any actual import path. It is a + module-importability and dataclass-shape check, not a behavioral test. + +* **What's missing vs ateam:** ateam ships a dedicated integration test + exercising at least the CSV path end-to-end (audit §1.18). That file + was not ported. Restoring it is on the Tier-2 test-restoration queue + (audit §3 item 16). + +* **Sample:** `samples/instrument/benchmark_import/main.py` exercises + `import_csv` end-to-end against a temporary CSV. There are no samples + for HuggingFace, HELM, JSON, or Parquet paths. + +A complete test pass for v1.7 would cover: + +1. CSV import with explicit `schema_mapping`; verify `records_imported`, + `metadata.source == "csv"`, all rows reachable through the schema + mapping. +2. CSV import with `max_records=N`; verify the cap. +3. CSV import with custom `delimiter="\t"`. +4. JSON import — array form, object-with-`records`-key form, + object-with-default-`data`-key form. +5. JSON import where the file is not valid JSON → `success=False` and + a meaningful error string. +6. HELM import — the three accepted shapes (top-level list, + `{results: …}`, `{scenarios: […]}`). +7. HELM import where the file is missing → `success=False`. +8. Parquet import without `pyarrow` installed → returns the expected + error string. (Achievable in CI by uninstalling pyarrow in a separate + tox env.) +9. HuggingFace import without `datasets` installed → returns the + expected error string. +10. HuggingFace import with a small public dataset (`squad`, + `max_records=2`) — gated behind `LAYERLENS_RUN_NETWORK_TESTS=1` so + it doesn't run in default CI. +11. `_persist` with an injected stub store; verify the two table writes. +12. `_persist` failure path — store raises; verify the result is still + returned (currently `success=True`, see §3.5 known gap). +13. `list_benchmarks` / `get_benchmark` round-trip after a successful + import. + +--- + +## 7. Roadmap + +* **v1.7 — Native MTEB importer.** `import_mteb(task_name, ...)` that + knows the MTEB task taxonomy and writes a benchmark per task. Likely + built on top of `import_huggingface` with task-specific schema + mappings. +* **v1.7 — Native MMLU importer.** Subject-aware mapping + (`subject` → `category`), automatic answer-letter normalization + (`A`/`B`/`C`/`D` → choice text). +* **v1.7 — `imported_by` plumbing.** Wire the CLI / API layer to + populate `imported_by` from the authenticated user. +* **v1.7 — Persist-failure surface.** `_persist` errors should set + `result.success = False` and append to `result.errors` instead of + being swallowed at DEBUG. +* **v1.7 — Restored test parity with ateam** (Tier-2 item 16). +* **v1.8 — Schema-heuristic detection.** Implement the case-insensitive + alias matching that the user-reference doc currently describes + aspirationally. +* **v1.8 — Per-record validation.** Optional `validate=True` flag that + rejects rows missing the canonical `prompt` field after mapping; + populate `records_skipped` accordingly. +* **v1.8 — Source-hash provenance.** Stamp `metadata.source_hash = + sha256(file_or_dataset_bytes)` for attestation. Required for + audit-trail benchmarks shipped to regulated tenants. +* **v1.9 — Native HumanEval / SWE-bench / GPQA importers.** Mirrors the + MMLU/MTEB pattern. +* **No-date-set — Incremental import.** Cursor-based reimport for + benchmarks that grow over time. Practical only once the store layer + exposes a `last_imported_record_id` lookup. +* **No-date-set — Cross-benchmark dedupe.** Decision pending product + input on whether identical question text across benchmarks should + collapse or remain distinct. + +--- + +*End of spec.* diff --git a/docs/adapters/embedding.md b/docs/adapters/embedding.md new file mode 100644 index 0000000..0484093 --- /dev/null +++ b/docs/adapters/embedding.md @@ -0,0 +1,332 @@ +# Embedding adapter — specification + +> Specification document for `layerlens.instrument.adapters.frameworks.embedding`. +> Companion to the user-facing reference at `docs/adapters/frameworks-embedding.md`. +> This file describes WHAT the adapter is contracted to do, what it explicitly +> does NOT do, and what is on the roadmap. The depth audit at +> `A:/tmp/adapter-depth-audit.md` §1.17 flagged this adapter as having no +> dedicated spec; this document closes that gap. + +| | | +|---|---| +| Spec ID | ADP-060 (embedding) + ADP-061 (vector store) | +| Module | `layerlens.instrument.adapters.frameworks.embedding` | +| Source | `src/layerlens/instrument/adapters/frameworks/embedding/` (`embedding_adapter.py`, `vector_store_adapter.py`) | +| Adapter type | Provider — runtime wrapping of embedding-API calls and vector-store queries | +| Status | Functional, ports byte-near-identical from ateam | +| Tests | None dedicated — covered indirectly via `tests/instrument/adapters/frameworks/test_bulk_ported_smoke.py` | + +--- + +## 1. Overview + +The embedding adapter is a runtime instrumentation adapter that wraps the +client-side methods used to **create vector embeddings** and **query vector +stores**. It targets the data-preparation and retrieval halves of a typical +RAG pipeline, not the generative half (the LLM call itself is covered by the +provider adapters in `layerlens.instrument.adapters.providers.*`). + +The module ships two distinct `BaseAdapter` subclasses — they are independent +and either can be used without the other. + +* **`EmbeddingAdapter`** — wraps client methods that turn text into vectors. + Today: OpenAI `client.embeddings.create`, Cohere `client.embed`, and any + `sentence_transformers.SentenceTransformer.encode`. + *(Source: `embedding_adapter.py:118–143`.)* +* **`VectorStoreAdapter`** — wraps client methods that retrieve vectors by + similarity. Today: Pinecone `Index.query`, Weaviate `collection.query.near_vector` + and `collection.query.near_text`, and Chroma `collection.query`. + *(Source: `vector_store_adapter.py:114–145`.)* + +Both adapters use the same monkey-patch + restore pattern as the rest of the +framework adapters: `connect()` registers the wrapper, `disconnect()` calls +`_restore_originals()` to put the original methods back. They emit dict events +through `BaseAdapter.emit_dict_event` rather than typed Pydantic events; the +adapter is therefore subject to the cross-cutting "typed events" gap called out +in audit §2 finding 1. + +This adapter is conceptually different from a future RAG-retrieval adapter: +it captures the API call to the embedding/vector-store backend, but it does +NOT correlate retrieved chunks with the downstream LLM call that consumed +them. That correlation is the responsibility of the agent-framework adapter +(LangChain, LlamaIndex, etc.) sitting above it. + +--- + +## 2. Capability surface + +Both classes declare a single capability via `AdapterInfo.capabilities`: + +| Class | Declared capability | Source | +|--------------------|----------------------|-------------------------------------| +| `EmbeddingAdapter` | `TRACE_MODELS` | `embedding_adapter.py:101–103` | +| `VectorStoreAdapter` | `TRACE_TOOLS` | `vector_store_adapter.py:97–99` | + +The choice of `TRACE_MODELS` for embeddings reflects that an embedding API +call is a model invocation against a hosted embedding model +(`text-embedding-3-small`, `embed-english-v3.0`, etc.). The choice of +`TRACE_TOOLS` for vector stores reflects that a similarity query is treated +by the schema as a retrieval tool call rather than a model call. + +**Capabilities NOT declared:** + +* `STREAMING` — embedding APIs return whole vectors per call; there is no + intermediate token stream to capture. This is consistent with the + underlying provider APIs (OpenAI `embeddings.create` returns a synchronous + `CreateEmbeddingResponse`; Cohere `embed` returns a list; sentence-transformers + returns a numpy/torch tensor in one shot). +* `REPLAY` — both classes implement `serialize_for_replay()` (returning a + `ReplayableTrace` containing `self._trace_events`), but neither declares + the capability. This is the same `serialize_for_replay()`-without-`REPLAY`- + declaration gap that the depth audit flagged across all adapters + (audit §2 finding 3); it is tracked in PR #119. +* `TRACE_HANDOFFS`, `TRACE_STATE`, `TRACE_PROTOCOL_EVENTS` — not relevant to + the embedding/vector-store call surface. + +--- + +## 3. Contract + +### 3.1 Public API — `EmbeddingAdapter` + +| Member | Description | +|---------------------------------|-----------------------------------------------------------------------------------| +| `__init__(stratix=, capture_config=, *, org_id=)` | Standard `BaseAdapter` constructor. | +| `connect()` | Marks the adapter healthy; takes no other action (no networking, no discovery). | +| `disconnect()` | Restores all originals, marks disconnected, closes sinks. | +| `health_check() -> AdapterHealth` | Returns adapter status (`HEALTHY` / `DISCONNECTED` / etc.) and error count. | +| `get_adapter_info() -> AdapterInfo` | Returns name, version, framework, capabilities, author, description. | +| `serialize_for_replay() -> ReplayableTrace` | Returns the in-memory `_trace_events` for replay. | +| `wrap_openai(client) -> client` | Patches `client.embeddings.create`. Returns the same client. | +| `wrap_cohere(client) -> client` | Patches `client.embed`. Returns the same client. | +| `wrap_sentence_transformer(model) -> model` | Patches `model.encode`. Returns the same model. | + +Each `wrap_*` method is **idempotent against missing methods** but **not +idempotent against repeat calls**: calling `wrap_openai(client)` twice on the +same client wraps the wrapper. Callers should `disconnect()` (which restores) +between repeated wrappings, or wrap each client exactly once. + +### 3.2 Public API — `VectorStoreAdapter` + +| Member | Description | +|---------------------------------|-----------------------------------------------------------------------------------| +| Lifecycle | Same as `EmbeddingAdapter` (mirror methods). | +| `wrap_pinecone(index) -> index` | Patches `index.query`. | +| `wrap_weaviate(collection) -> collection` | Patches `collection.query.near_vector` and `collection.query.near_text`.| +| `wrap_chroma(collection) -> collection` | Patches `collection.query`. | + +### 3.3 Events emitted + +All events are emitted via `emit_dict_event(event_type, payload)`. They are +NOT typed Pydantic events; that migration is tracked under audit §3 Tier-1 +recommendation 5. + +#### `embedding.create` (L3) + +Emitted by `EmbeddingAdapter` after every wrapped call returns. + +| Field | Type | Source | Notes | +|-----------------|---------|-----------------------------------------------------------------------|-------| +| `provider` | str | Hardcoded per wrapper: `"openai"` / `"cohere"` / `"sentence_transformers"` | | +| `model` | str | `kwargs["model"]` (OpenAI/Cohere) or literal `"local"` (ST) | OpenAI defaults to `"unknown"` if model not in kwargs (`embedding_adapter.py:151`). Cohere defaults to `"embed-english-v3.0"` (line 188). | +| `batch_size` | int | `len(input)` if list, else `1` | | +| `dimensions` | int \| None | `len(result.data[0].embedding)` (OpenAI), `len(result.embeddings[0])` (Cohere), `result.shape[1]` (ST) | None if response shape unrecognized. | +| `total_tokens` | int | `result.usage.total_tokens` (OpenAI only) | Cohere/ST do not surface token counts in the same way and the field is omitted from those payloads. | +| `latency_ms` | float | Wall-clock measurement around the wrapped call (`time.monotonic`) | Rounded to 2 decimals. | + +#### `retrieval.query` (L3) + +Emitted by `VectorStoreAdapter` after every wrapped query returns. The +**event type is `retrieval.query`, not `vector_store.query`** — the existing +user-reference doc disagrees with the source on this point; the source +(`vector_store_adapter.py:169, 201, 235`) is the authority. + +| Field | Type | Notes | +|-----------------|---------|-------------------------------------------------------------------------------| +| `provider` | str | `"pinecone"` / `"weaviate"` / `"chroma"` | +| `top_k` / `n_results` / `limit` | int | Field name varies by provider to match the underlying SDK kwarg. | +| `match_count` / `result_count` | int | Number of hits returned. | +| `has_filter` / `has_where` | bool | Whether a metadata filter was supplied (Pinecone `filter`, Chroma `where`). | +| `namespace` | str | Pinecone-specific. | +| `query_type` | str | Weaviate-specific: `"near_vector"` or `"near_text"`. | +| `score_min`/`score_max`/`score_mean` | float \| None | Pinecone — extracted from `result.matches[*].score`. | +| `distance_min`/`distance_max` | float \| None | Chroma — from `result["distances"][0]`. | +| `latency_ms` | float | Always present. | + +### 3.4 Lifecycle + +Both adapters inherit the standard `BaseAdapter` lifecycle. `connect()` is +trivial (no remote handshake, no SDK discovery — embedding/vector clients +are passed in by the caller). `disconnect()` restores wrapped methods via +`_restore_originals()` and is the only operation that matters for clean +teardown. + +### 3.5 Error handling + +The wrappers do NOT swallow exceptions. If the wrapped client raises, the +exception propagates to the caller. The wrapper's `start = time.monotonic()` +is reached but the `emit_dict_event` after the call is not, so failed calls +are NOT emitted as `embedding.create` / `retrieval.query` events. The +underlying `BaseAdapter._error_count` is also not incremented on these +exceptions because the wrappers do not invoke +`_record_error()`. This is a known gap shared with most ported adapters +(audit §2 finding for "no PolicyViolationEvent path"). + +--- + +## 4. What we do NOT support + +Each item below is an explicit non-goal as of this spec. None of these +behaviors should be inferred from the source — they are not present. + +* **No async embedding clients.** The OpenAI v1 SDK exposes + `AsyncOpenAI().embeddings.create()`; this adapter only wraps the sync + `OpenAI` client (`hasattr(client, "embeddings")` check at + `embedding_adapter.py:120`). The async coroutine method is NOT patched. + Async support is on the v1.7 roadmap. +* **No streaming.** Embedding APIs return whole vectors; there are no + intermediate stream events. The `STREAMING` capability is not declared + and never will be for this adapter. +* **No retrieval-correlation events.** Vector-store queries emit a single + `retrieval.query` event with aggregate scores — not per-document + correlation IDs. An LLM consuming the retrieved chunks has no way to + reference them back to a specific `retrieval.query` event without help + from the framework adapter above (LangChain/LlamaIndex). +* **No content capture.** The wrapper records *batch_size* but does not + record the input texts themselves, nor the returned vectors. This is + intentional for privacy reasons and is consistent with `CaptureConfig`'s + `capture_content=False` default for production presets. There is no + knob today to opt into capturing input text or returned vectors. +* **No re-ranker support.** Cohere `rerank`, Voyage `rerank`, and other + re-ranking endpoints are NOT wrapped. Only `embed`/`encode` paths. +* **No Voyage / Mistral-embed / Anthropic-embed / NVIDIA-embed / GTE / BGE + hosted endpoints.** Only the three providers listed in §1 are wrapped. + Adding new providers requires a new `wrap_(client)` method + plus a `_make__wrapper(original)` factory mirroring the + existing pattern. +* **No vector-store mutation tracking.** Pinecone `upsert` / `delete`, + Weaviate `data.insert`, Chroma `add` / `update` are NOT wrapped. Only + read-side `query` operations are instrumented. Tracking writes is on + the v1.7 roadmap. +* **No Qdrant, Milvus, FAISS, or pgvector adapters.** Only Pinecone / + Weaviate / Chroma are supported today. +* **No batch-vs-single distinction event.** OpenAI lets you submit a + batch in a single `embeddings.create`; we record `batch_size` but do + not split the call into per-item events. +* **No typed Pydantic event payloads.** All events are emitted as plain + dicts via `emit_dict_event`. Migration to typed events is tracked at + audit §3 Tier-1 item 5 (cross-adapter, not adapter-specific). +* **No OTel `gen_ai.*` semantic conventions.** Audit §2 finding 2 applies. + PR #125 tracks the cross-adapter migration. +* **No `org_id` in event envelopes by default.** The constructor accepts + `org_id=` and stores it on `self._org_id`, but the emit path does NOT + inject it into the event payload. PR #118 tracks the cross-adapter fix. + +--- + +## 5. BYOK and multi-tenancy + +The embedding adapter integrates with provider keys indirectly. The caller +is responsible for constructing the upstream client (`OpenAI()`, `cohere.Client()`, +`SentenceTransformer(...)`); the adapter never instantiates a provider client +of its own and never reads `OPENAI_API_KEY` / `COHERE_API_KEY` / similar +environment variables. Whatever credential the caller's client has, that's +what the wrapped call uses. + +This means BYOK key resolution for embeddings happens **outside** the +adapter's scope, in two layers above it: + +1. The caller's application chooses which key to use (org-scoped BYOK key, + platform-managed key, or environment fallback). +2. The platform's BYOK store (atlas-app `/api/v1/model-keys`) is responsible + for materializing the per-org key into the OpenAI/Cohere client before + the embedding call is made. + +The adapter's only contribution to multi-tenancy is the `org_id` +constructor argument (`embedding_adapter.py:69`, +`vector_store_adapter.py:65`). Today this is stored on the adapter +instance but **not stamped onto emitted events** — the cross-adapter +`org_id` propagation work tracked by PR #118 is the fix. + +For SaaS deployments where a single shared sink is fed by per-tenant +adapter instances, the recommended pattern until PR #118 lands is to use +distinct adapter instances per tenant and let the sink layer (which knows +its own tenant context from its API-key-driven session) attach the +`org_id` envelope. Do NOT share a single `EmbeddingAdapter` instance +across tenants. + +--- + +## 6. Test coverage + +* **Dedicated tests:** none. +* **Bulk smoke coverage:** `tests/instrument/adapters/frameworks/test_bulk_ported_smoke.py:65–67` + imports `EmbeddingAdapter` and exercises construction, `connect()`, + metadata, and capability declaration via `_PARAMETRIZE_CASES`. The + smoke pass for `VectorStoreAdapter` is implicit — it inherits the same + base class but is not explicitly listed in the parametrize cases. +* **What's missing vs ateam:** ateam ships `tests/adapters/embedding/test_integration.py` + exercising provider wrapping with mocked clients (audit §1.17). That + file was NOT ported. Restoring it is on the Tier-2 test-restoration + queue (audit §3 item 16). +* **Sample:** `samples/instrument/embedding/main.py` exercises + `wrap_openai` end-to-end with a real OpenAI client. There is no + vector-store sample. + +A complete test pass for v1.7 would cover, at minimum: + +1. Construction with each `CaptureConfig` preset. +2. `wrap_openai(client)` with a stubbed client that returns a + `CreateEmbeddingResponse`-shaped object — verify the emitted event has + the expected `provider`, `model`, `batch_size`, `dimensions`, + `total_tokens`, `latency_ms`. +3. Same for Cohere (`embed`) and sentence-transformers (`encode`). +4. `wrap_pinecone(index)` with a stub returning `matches=[Mock(score=…)]` — + verify `score_min/max/mean` aggregation. +5. `wrap_weaviate(collection)` for both `near_vector` and `near_text`. +6. `wrap_chroma(collection)` with the dict-shaped response. +7. `disconnect()` restores all originals (assert `client.embeddings.create + is original`). +8. Exception in wrapped call does NOT emit an event but does NOT swallow. +9. `serialize_for_replay()` returns a `ReplayableTrace` with the events + accumulated during the session. +10. Pydantic-compat smoke (`requires_pydantic = PydanticCompat.V1_OR_V2`) + holds for both classes. + +--- + +## 7. Roadmap + +The following items are explicitly planned for v1.7 or later. Nothing in +§1–§6 should be read as committing to any of them. + +* **v1.7 — Async client support.** Add `wrap_async_openai(client)` and + equivalents for AsyncCohere. Wrappers will be `async def` and `await` + the underlying call before emitting. +* **v1.7 — Vector-store write instrumentation.** Wrap Pinecone `upsert`, + Weaviate `data.insert`, Chroma `add`/`update`. Emit + `vector_store.upsert` / `vector_store.delete` events. +* **v1.7 — Voyage AI provider.** Add `wrap_voyage(client)` matching the + shape of the existing OpenAI wrapper. +* **v1.7 — Restored test parity with ateam** (Tier-2 item 16). +* **v1.8 — Qdrant + pgvector adapters.** Both have widely-used Python + clients with stable query surfaces; wrapping is straightforward. +* **v1.8 — Typed Pydantic events** (cross-adapter, audit Tier-1 item 5). + Define `EmbeddingCreateEvent` and `RetrievalQueryEvent` Pydantic models; + switch the wrappers from `emit_dict_event` to `emit_event`. +* **v1.8 — OTel `gen_ai.*` semconv** (cross-adapter, audit Tier-1 item 6). + Add parallel `gen_ai.system="openai"`, `gen_ai.request.model=…`, + `gen_ai.usage.input_tokens=…` attributes alongside the existing flat + fields. +* **No-date-set — Re-ranker support.** Cohere `rerank` and Voyage `rerank` + are reasonable candidates. Decision pending product input on whether + re-rankers should be modeled as `tool.call` or as a new + `retrieval.rerank` event. +* **No-date-set — Per-chunk retrieval correlation.** Requires upstream + framework adapter cooperation (LangChain / LlamaIndex / a future + retrieval-orchestration adapter); not solvable inside this adapter + alone. + +--- + +*End of spec.*