diff --git a/.claude/skills b/.claude/skills new file mode 120000 index 00000000..3e73f3a3 --- /dev/null +++ b/.claude/skills @@ -0,0 +1 @@ +../.github/skills \ No newline at end of file diff --git a/.github/skills/migrate-from-openinference/SKILL.md b/.github/skills/migrate-from-openinference/SKILL.md new file mode 100644 index 00000000..f409ef7d --- /dev/null +++ b/.github/skills/migrate-from-openinference/SKILL.md @@ -0,0 +1,599 @@ +--- +name: migrate-from-openinference +description: Migrate an openinference-instrumentation-* package from https://github.com/open-telemetry/donation-openinference into this repo. Creates a new package, or — when a package for the library already exists in the repo — augments it with the coverage OpenInference adds on top. Use when a user asks to migrate or port a package from OpenInference. +--- + +# Migrate an OpenInference `instrumentation-*` package + +Migrate an `openinference-instrumentation-` package from +https://github.com/open-telemetry/donation-openinference into this +repo. The result emits OTel GenAI semantic conventions through +`opentelemetry-util-genai`. + +Two modes, decided by the "Before you start" gate below: + +- **Greenfield migration** — no package for the library exists yet. Create a + **new implementation** under `instrumentation/`. The default, and what the + bulk of this skill describes. +- **Augment an existing package** — the repo already ships + `opentelemetry-instrumentation-genai-`. Don't re-create it; inventory + what it covers, diff against OpenInference, and add **only the missing + parts**. See [Augment mode](#augment-mode-the-package-already-exists). + +For a greenfield migration the major work items are: rewriting the patcher to +method-level (step 5), mapping every request/response shape into OTel +`InputMessage`/`OutputMessage` parts (step 6), and migrating the unit-test +corpus while filtering openinference-framework plumbing tests out (step 7). + +## Inputs + +User specifies the source, e.g. `openinference-instrumentation-crewai`. + +- **Source**: It should point to one of the folders in + `https://github.com/open-telemetry/donation-openinference/tree/main/python/instrumentation/`. + Fetch a fresh shallow clone if you don't already have one locally: + ```sh + git clone --depth=1 https://github.com/open-telemetry/donation-openinference.git /tmp/openinference + ``` + and use `/tmp/openinference/python/instrumentation//` as the + source path in step 1. + +User may also provide the **target package name**. If not provided: derive it from the source name: +- drop the leading `openinference-instrumentation-`. Remaining part should match the instrumented library name as it appears on PyPI. If it's not the case, flag it. +- The target package name should be `opentelemetry-instrumentation-genai-` where `` is the instrumented library name (e.g. `openai`, `anthropic`, `bedrock`). For example: + - `openinference-instrumentation-openai` → `opentelemetry-instrumentation-genai-openai` + - `openinference-instrumentation-anthropic` → `opentelemetry-instrumentation-genai-anthropic` + Confirm the chosen name with the user. + +## Before you start: is there already a package for this library? + +Once the target name is settled, check whether the repo already ships it: + +```sh +ls instrumentation/opentelemetry-instrumentation-genai- 2>/dev/null +``` + +- **Exists →** **augment mode**: don't scaffold a new package. OpenInference + is now a *second* reference to mine for coverage the existing package + lacks. Jump to [Augment mode](#augment-mode-the-package-already-exists). +- **Doesn't exist →** greenfield migration; continue below. + +If a near-name sibling (a `-agents` / `-client` suffix) might instrument a +*different* surface of the same vendor, confirm the target name with the user +before deciding. + +## Before you start: check for native OTel instrumentation + +AI SDKs increasingly ship their **own** OpenTelemetry GenAI instrumentation. +When they do, migrating the OpenInference package is redundant. So +before writing any code, determine whether the instrumented library is +self-instrumenting. + +```sh +# 1. Does the SDK depend on the OTel API / semconv? +pip show | grep -i requires # or read its pyproject / METADATA +# a dependency on opentelemetry-api or opentelemetry-semantic-conventions +# is the tell. +# 2. Does its source actually emit GenAI spans? +python -c "import , os; print(os.path.dirname(.__file__))" +rg -l "opentelemetry|semconv\._incubating\.attributes\.gen_ai" / +``` + +A dependency on `opentelemetry-api` (or `-semantic-conventions`) **plus** +`gen_ai_attributes` usage in the SDK source means the library is +self-instrumented. Confirm empirically: set a **global** `TracerProvider` +(native hooks often activate only when a real, non-proxy provider is +configured), make one call, and inspect the emitted spans' instrumentation +scope. + +**If the library is self-instrumented, do NOT migrate the OpenInference +package.** Pivot the work: + +1. **Ignore the OpenInference instrumentation entirely** — the vendor owns + the spans; there is nothing to re-implement, and no `src/` instrumentor / + patcher to write. +2. **Write conformance tests against the native instrumentation.** Follow + step 8 / the `write-conformance-tests` skill, but each scenario's `run()` + configures providers and enables the **native** tracer (e.g. sets a global + `TracerProvider`) instead of calling a `*Instrumentor` — then runs the + emitted telemetry through weaver live-check. +3. **Identify gaps / inconsistencies** between the native output and the + GenAI semconv: missing operations, wrong operation name, legacy/duplicate + attributes, no metrics, no content-capture controls, no util-genai + content modes / completion-hook / upload support, etc. Record each as an + `expected_violation` or a documented skip, same as a normal migration. +4. **Write `MIGRATION_REPORT.md`** stating the library is self-instrumented, + the conformance results, and the gap list — that report is the + deliverable. **Stop and surface the finding to the user.** Do not build a + competing package unless they explicitly decide to (e.g. to suppress + native instrumentation and layer util-genai features on top). + +Only when the library has **no** native OTel instrumentation do you continue +with the migration flow below. + +## Reference material + +- **OTel GenAI spans**: — authoritative attribute names, spans, logs, and metrics definitions. +- **OpenInference → OTel attribute mapping** (Arize-maintained): . Use as a quick lookup for what an OpenInference attribute *roughly* corresponds to in OTel; when the mapping disagrees with the official semconv, **the official semconv wins**. +- **Message JSON schemas**: + - input messages: + - output messages: + - system instructions: + - tool definitions: + - retrieval documents: + +- **Code for above models**: . + +## Non-negotiable rules + +The repo-wide rules in [AGENTS.md](../../../AGENTS.md) already apply +(telemetry through `opentelemetry-util-genai` public surface only, no +`type: ignore`, semconv enums over string literals, re-raise caught +exceptions). The rules below are the ones the migration is most likely to +violate: + +1. **Zero OpenInference dependencies.** No `openinference-instrumentation`, + no `openinference-semantic-conventions`, no `openinference-*` anywhere + in the migrated package's `src/` or `tests/`. Verify with + `rg openinference instrumentation/` — output must be empty. +2. **Public util-genai surface only.** Beyond the AGENTS.md rule, the migrated package + must not import any `opentelemetry.util.genai._*` module — the allowed + modules are enumerated in step 4. +3. **Ignore all other OpenInference instrumentations during the migration.** The only + instrumentation code to read is the OpenInference package being migrated + plus `opentelemetry-util-genai`. Build + from first principles: original OpenInference code + util-genai public API + + official semconv spec. +4. **Never work around gaps.** If util-genai or the GenAI semconv is + missing something, flag it and fail the test intentionally +5. **Do not make OTel API calls.** **Exception:** + semconv attributes that exist in the registry but have no named property + on `InferenceInvocation` (e.g. `gen_ai.usage.cache_creation.input_tokens`, + `gen_ai.usage.cache_read.input_tokens`) may be set via + `invocation.attributes[KEY] = …` — that's still going through the + util-genai extension point. Import the key from + `opentelemetry.semconv._incubating.attributes.gen_ai_attributes`. Do + not invent attribute names that aren't in the semconv. +6. **Reuse VCR cassettes.** Reuse cassettes from the OpenInference + tests when possible. +7. **Conformance tests must never be silently skipped.** + When instrumentation can't be made conformant due to missing + information, gap in semantic conventions, or in util-genai, still + write the scenario and let it fail. + + Ask user to decide if they want to mark the scenario as skipped with a reason, or + add an `expected_violation` in the scenario that covers the missing piece. + All these must be documented in `MIGRATION_REPORT.md` as well, with links to the skipped scenario and the expected violation. + +8. **Do not modify weaver policies.** + +## Augment mode: the package already exists + +The package already has a working, conformant implementation; OpenInference +is just another reference to mine for missing coverage. The job is a tight, +delta-only PR that closes specific gaps — **not** a rewrite. + +All [Non-negotiable rules](#non-negotiable-rules) apply to every line you +add, plus two specific to this mode: + +- **Don't rewrite or "improve" existing code.** Leave the existing patcher, + wrappers, and tests alone unless OpenInference reveals a concrete bug — and + then it's a *separate* PR. No opportunistic refactors. +- **Match the existing package's conventions.** New wrappers, helpers, and + scenarios follow the patterns already there (helper names, wrapper + structure, conftest/fixture wiring, scenario shape). Don't add a second way + to do something the package already does. + +### A. Inventory what's already there + +Map the existing package before reading OpenInference (read, don't guess): +patched methods (`wrap_function_wrapper` / `unwrap` in `_instrument`), +request/response shapes its wrappers map, the unit-test matrix per method +(`rg -c '^\s*(async )?def test_' instrumentation//tests/`), +conformance scenarios under `tests/conformance/`, and cassettes. + +### B. Diff against OpenInference + +Run the OpenInference analysis as a greenfield migration would (the reading behind +steps 5–6): every method it patches, every shape it parses. Subtract +inventory A. The remainder is the work-list: + +- Methods OpenInference patches that the package doesn't → new wrappers (step 5). +- Shape branches OpenInference handles that the wrappers drop → extend the + mapping (step 6). +- Scenarios OpenInference covers that the package lacks → new tests / + conformance (steps 7–8). + +Coverage both already have → skip. Coverage the package has but OpenInference +lacks → leave it; not a regression. + +### C. Add the delta + +Apply steps **5–10 to the new parts only**: + +- **Steps 1–3 (scaffold/rename/pyproject) skipped** — touch `pyproject.toml` + only for a genuinely new entry point or dependency range, `README.rst` only + for a new user-visible capability. +- **Step 4** applies to anything you copy in; nothing to excise from existing + code. +- **Steps 5–9** extend the existing wrappers / test utils / conformance + runner in place, not parallel ones. +- **Step 10** is done; revisit `tox.ini` / pyright only for a new test factor + or requirements file. + +### D. Report and review + +Write `MIGRATION_REPORT.md` via the `review-migration` skill as usual — it +detects augment mode. + +## Migration flow + +> The numbered steps below are written for a **greenfield migration**. In +> [augment mode](#augment-mode-the-package-already-exists) skip steps 1–3, +> and scope steps 5–10 to the delta from the inventory/diff (sections A–C +> above). + +### 1. Create the target package + +Because the patcher, wrappers, and tests are all rewritten (steps 5–7), a +`cp -R` of the OpenInference tree mostly creates files you immediately +delete or overwrite. **Prefer scaffolding fresh** from the nearest existing +package (e.g. `opentelemetry-instrumentation-genai-anthropic`) and copy over from +OpenInference **only** what you actually reuse: + +- `LICENSE` (Apache-2.0). +- Reusable cassettes (step 9), if any. + +```sh +mkdir -p instrumentation//src/opentelemetry/instrumentation/genai/ +mkdir -p instrumentation//tests/conformance instrumentation//tests/cassettes +cp /LICENSE instrumentation//LICENSE +``` + +Do **not** carry over `examples/`, OpenInference's `README.md`, or its +`CHANGELOG.md` (per-package changelogs are towncrier-generated at release +time). + +(If you do `cp -R` instead, clean it up afterwards: +`rm -rf .pytest_cache .tox .venv venv .vscode .DS_Store .claude .ruff_cache CHANGELOG.md` +and `find . -name __pycache__ -type d -exec rm -rf {} +`.) + +### 2. Rename the Python module + +OpenInference packages live at `src/openinference/instrumentation//`. +Move that tree under the OTel GenAI namespace, update path according to the target package name: + +```sh +mkdir -p src/opentelemetry/instrumentation/genai +mv src/openinference/instrumentation/ src/opentelemetry/instrumentation/genai/ +rm -rf src/openinference +``` + +Update every import. Verify zero `openinference` references remain in +`src/`, `tests/`, README. The instrumentor class typically renames from +`Instrumentor` (kept as-is — same name is fine). + +### 3. Update `pyproject.toml`, `version.py`, and `README.rst` + +- `[project] name` → new package name. +- `[project.entry-points.opentelemetry_instrumentor]` → un-prefixed lib name + pointing at the new module path + (` = "opentelemetry.instrumentation.genai.:Instrumentor"`). +- Hatch version path, project URLs, classifiers → new repo paths. +- **Strip every `openinference-*` dependency.** OpenInference packages typically depend on + `openinference-instrumentation`, `openinference-semantic-conventions`, and + sometimes `opentelemetry-instrumentation` for the `BaseInstrumentor` mixin + — keep only the last one. Replace with `opentelemetry-instrumentation` (for + `BaseInstrumentor`) and the underlying SDK (`openai`, `anthropic`, …) at + the same range OpenInference was using. +- `__version__` in `version.py` should equal the value in + `util/opentelemetry-util-genai/src/opentelemetry/util/genai/version.py` — all + workspace packages share one version. Verify: + + ```sh + diff <(grep ^__version__ instrumentation//src/opentelemetry/instrumentation/genai//version.py) \ + <(grep ^__version__ util/opentelemetry-util-genai/src/opentelemetry/util/genai/version.py) + ``` + +- Hatchling builds **require a `README.md` or `README.rst`**. Rewrite it to + point at the new repo URLs and module path; drop OpenInference links, OpenInference badges, + `using_attributes(...)` examples, `OpenInferenceTracer` / `TraceConfig` + configuration, and any "OpenInference semconv" links. Include a usage snippet + importing from `opentelemetry.instrumentation.genai.`, a pointer to + `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`, and a pointer to + `tests/conformance/` (no `examples/`). + +### 4. Drop OpenInference plumbing + +OpenInference ships a framework that's incompatible with this repo's util-genai model +— excise it before touching the patcher. + +```sh +rg 'openinference|OpenInferenceTracer|TraceConfig|using_attributes|using_session|using_user|using_metadata|using_tags|SpanAttributes\.|OpenInferenceMimeTypeValues|OpenInferenceSpanKindValues|safe_json_dumps' src/ tests/ +``` + +Drop every match. The mappings: + +- **`OpenInferenceTracer` / `TraceConfig`** — replaced by `TelemetryHandler` from + `opentelemetry.util.genai.handler`. Instrumentation code does not + construct or pass tracers. +- **`using_attributes(session_id=…, user_id=…, …)` / `using_session` / + `using_metadata` / `using_tags`** — there is **no OTel GenAI equivalent** + context-propagation API. Drop the calls and the tests that exercise them + (those go in the test "skip with reason" bucket — see step 7). +- **`OpenInferenceSpanKindValues` / `OpenInferenceMimeTypeValues`** — drop; + span kind is set by util-genai based on the invocation type. +- **`SpanAttributes.LLM_*` / `SpanAttributes.INPUT_*`** — flat-string + OpenInference-semconv attributes. Replaced by typed `InputMessage` / `OutputMessage` + payloads serialized into `gen_ai.input.messages` / `gen_ai.output.messages` + by util-genai. The conversion is in step 6. +- **`safe_json_dumps`** — drop; util-genai serializes message payloads. + +**Any import of `opentelemetry.util.genai._` from instrumentation +`src/` is a violation.** Public surface only: + +- `opentelemetry.util.genai.handler` — `TelemetryHandler` +- `opentelemetry.util.genai.invocation` — `InferenceInvocation`, + `EmbeddingInvocation`, `ToolInvocation`, `WorkflowInvocation`, + `AgentInvocation`, `Error`, `GenAIInvocation` +- `opentelemetry.util.genai.types` — `InputMessage`, `OutputMessage`, + `Text`, `ToolCallRequest`, `ToolCallResponse`, `Reasoning`, + `ServerToolCall`, `ServerToolCallResponse`, `GenericPart`, `Blob`, + `File`, `Uri`, `Modality` +- `opentelemetry.util.genai.completion_hook` +- `opentelemetry.util.genai.environment_variables` + +```sh +rg 'from opentelemetry\.util\.genai\._' instrumentation//src/ +``` + +Output must be empty. + +### 5. Rewrite patching: transport → API method level + +This is the largest behavioral change. OpenInference typically patches at +the HTTP-transport layer (`OpenAI.request`, `AsyncOpenAI.request`, +`HTTPClient._send_request`, etc.) and dispatches by `cast_to` response type. +**That pattern does not survive the migration.** util-genai's +`InferenceInvocation` model needs the request kwargs (`model`, `messages`, +`tools`, `stream`, …) at call time, which only the API methods see. + +Replace every transport-level wrapper with method-level +`wrap_function_wrapper` calls — one per public API endpoint — using +**positional args only** (newer wrapt versions reject keyword args): + +```python +from wrapt import wrap_function_wrapper +from opentelemetry.instrumentation.utils import unwrap + +wrap_function_wrapper( + "openai.resources.chat.completions", # module (positional) + "Completions.create", # name (positional) + chat_completions_wrapper, # wrapper (positional) +) +``` + +For uninstrument, use `opentelemetry.instrumentation.utils.unwrap` (matching +positional module + name). + +**Patch ALL API methods OpenInference instruments.** Walk OpenInference's +`_instrumentor.py` / `instrumentor.py` plus the dispatch table in the +transport accumulator and enumerate every endpoint OpenInference emits a span for — +including ones with only generic attribute extraction (assistants, threads, +files, fine-tuning, vector stores, batches, uploads, moderations, …). Each +becomes one `wrap_function_wrapper` call. Dropping coverage for an endpoint +because it's "legacy" or "rarely used" is a regression — OTel GenAI semconv +applies to every inference and embedding API regardless of vintage. If a +specific API has no util-genai invocation type yet, that's a gap for the +review report (see step 11), not a reason to drop the patch. + +### 6. Map every request and response shape into OTel GenAI types + +For each wrapped method, walk OpenInference's input-parsing branch by branch and +ensure each branch has a corresponding mapping in the new wrapper. Same +for output parsing. A wrapper that handles `str` input but not `list` +input (when the original SDK accepts both) is incomplete and must not +ship. + +Mapping cheat sheet — OpenInference source shape on the left, OTel construct on the +right (all types from `opentelemetry.util.genai.types` unless noted): + +| Source request item | OTel construct | +|---|---| +| User / assistant / system text message | `Input/OutputMessage(role=…, parts=[Text(content=…)])` | +| Assistant message containing a tool/function call | `Message(role="assistant", parts=[ToolCallRequest(name=…, id=…, arguments=…)])` | +| Tool/function result message | `Message(role="tool", parts=[ToolCallResponse(id=…, response=…)])` | +| Reasoning / thinking item | `Message(role="assistant", parts=[Reasoning(content=…)])` | +| Server-side tool call (web_search, file_search, code_interpreter, …) | `Message(parts=[ServerToolCall(name=…, server_tool_call=…, id=…)])` | +| Server-side tool call result | `Message(parts=[ServerToolCallResponse(server_tool_call_response=…, id=…)])` | +| Inline image / audio / video bytes | `Blob(mime_type=…, modality="image"\|"audio"\|"video", content=b"…")` | +| External media URL | `Uri(mime_type=…, modality=…, uri="…")` | +| File reference (e.g. OpenAI `file_id`) | `File(mime_type=…, modality=…, file_id="file-…")` | +| Provider-specific item with no semconv mapping | `GenericPart(value=…)` — never silently drop. Flag those in the review report. | + +Output messages mirror the input mapping — `OutputMessage` serializes with +a `parts` array (not `content`); each part has a `type` field. When +asserting on `gen_ai.output.messages`, parse the JSON and check +`msg["parts"]`. + +`Modality` is `Literal["image", "video", "audio"]`. `error.type` and span +status come from `invocation.fail(exc)` — do not emit a separate span +exception event. + +### 7. Restructure tests + +```text +tests/ + cassettes/.yaml + conformance/ + inference.py / embedding.py / ... # see step 8 + conftest.py + test_.py # unit tests; refactor onto helpers + requirements.{oldest,latest}.txt +``` + +**Categorize OpenInference tests before migrating.** List every test function in the +OpenInference package and bucket each one: + +- ✅ **Migrate** — exercises a patched API method. Rewrite assertions + from flat OpenInference semconv attributes (`SpanAttributes.LLM_INPUT_MESSAGES_…`) + to OTel constructs: assert on `span.attributes[GenAIAttributes.GEN_AI_…]` + (semconv constants), and parse `gen_ai.input.messages` / `gen_ai.output.messages` + JSON to check the `parts` arrays. +- ✅ **Migrate (rewrite)** — unit test for an OpenInference-internal helper + (attribute extractor, message parser, etc.) where the helper is gone but + the **parsing scenario** still applies. Rewrite as an integration test + that feeds the same response shape through VCR and asserts the + resulting OTel telemetry. This is the bucket most likely to be + mis-categorized — anything covering tool-call objects, refusals, + reasoning items, multi-content messages, token-usage breakdowns, image + inputs, raw-response variants (e.g. `with_raw_response`) is parsing + domain logic and **must** be migrated. The wrapper has to do the + parsing; the test has to verify it. +- ❌ **Skip with reason** — exercises OpenInference framework plumbing with no OTel + equivalent: `using_attributes()` context propagation, `TraceConfig` + masking, `OpenInferenceTracer` behavior, OpenInference flat-attribute naming format, + `OpenInferenceSpanKindValues` checks. Document the reason in a comment + in the test file (or in `MIGRATION_REPORT.md` later — see step 11). + Do **not** skip a test because the API it exercises is "legacy" or + "new" — semconv applies to all inference APIs. + +Decision rule for the ✅ rewrite vs ❌ skip split: ask whether the test +covers a *response shape* from the instrumented library, or *OpenInference framework +behavior*. A test that constructs a tool-call object with `arguments` / +`call_id` / `name` and verifies it's extracted into attributes covers a +response shape — migrate it. A test that checks +`using_attributes(session_id=…)` propagates into span attributes covers OpenInference +framework behavior — skip it. + +**Sanity check before committing step 7.** Count source vs migrated tests: + +```sh +rg -c '^\s*(async )?def test_' /tests/ +rg -c '^\s*(async )?def test_' instrumentation//tests/ +``` + +A migration that drops from 80 tests to 5 is a regression — go back to the +"migrate" and "migrate (rewrite)" buckets and finish them. + +**Replace conftest boilerplate.** OpenInference conftests duplicate the +exporter/provider/VCR plumbing that lives here in +`opentelemetry-test-util-genai`. Don't copy OpenInference's — mirror an +existing package's conftest (e.g. +`instrumentation/opentelemetry-instrumentation-genai-openai/tests/conftest.py`), +which registers the shared fixtures as plugins: + +```python +pytest_plugins = [ + "opentelemetry.test_util_genai.fixtures", + "opentelemetry.test_util_genai.vcr", +] +``` + +The lib-specific conftest then adds only: `vcr_config` (per-package +`filter_headers` and `before_record_response`), an `environment` autouse +for the lib's API-key env var, library-client fixtures (e.g. +`openai_client` / `async_openai_client`), and the `instrument_*` fixtures +(`instrument_no_content`, `instrument_with_content`, `instrument_event_only`) +built on the shared `instrument` context manager from +`opentelemetry.test_util_genai.instrumentor` (see [AGENTS.md](../../../AGENTS.md) Tests). + +**Assertions — no shared helper module.** There is no +`opentelemetry.test_util_genai.assertions`. Assert directly on +`span.attributes[GenAIAttributes.GEN_AI_…]` using the semconv constants +from `opentelemetry.semconv._incubating.attributes.gen_ai_attributes`, and +on metric/log records from the in-memory exporters. Factor repeated checks +into a per-package `tests/test_utils.py` (existing packages have helpers +like `assert_all_attributes`, `assert_completion_attributes`, +`assert_messages_attribute`, plus weather-tool fixtures). OpenInference +helpers (`_check_llm_attributes`, etc.) map onto these — rewrite them on +top of OTel semconv constants and parsed `gen_ai.*.messages` JSON, and keep +tiny constants (`DEFAULT_MODEL`, sample prompts) inline or in +`tests/test_utils.py`. + +**Required unit-test coverage per wrapped method.** Apply the repo test +matrix (sync/async × happy/error, plus streaming × happy/error where the +method streams — see [AGENTS.md](../../../AGENTS.md) Tests section) +to **every** method patched in step 5. For the migration these are blockers for +the migration PR, not follow-up. The error variants must verify the +original exception is re-raised, `error.type` is recorded, and span status +is ERROR. + +**`tests/requirements.{latest,oldest}.txt`** — OpenInference typically has its own +pin file; keep only the third-party version pins (`openai==`, +`anthropic==`, …). +Add current `util/opentelemetry-util-genai` and `instrumentation/opentelemetry-instrumentation-genai-` to the latest; for the oldest, use the oldest version of `opentelemetry-util-genai` that works (there is none released yet, but check). + +### 8. Conformance scenarios + +Author conformance scenarios using the **`write-conformance-tests`** skill — +it's the generic procedure (scenario modules, the `test_conformance.py` +runner, declared gaps, lib-specific assertions, weaver policies) and applies +to any instrumentation. Migration-specific notes on top of that skill: + +- Drop OpenInference's `examples/` tree — its end-to-end demos are replaced + by conformance scenarios, not migrated. +- For an operation blocked by a util-genai/semconv gap, point the + `expected_violations` / `xfail` `reason=` at the gap row in + `MIGRATION_REPORT.md`. + +### 9. Cassettes (or a transport proxy) + +- Copy cassettes from OpenInference's `tests/cassettes/` (or wherever the OpenInference package + parks them) into the migrated package's `tests/cassettes/`. Reuse names so existing + unit tests keep loading them. +- Reuse existing cassettes for conformance scenarios when they are applicable. +- **AI-generated cassettes.** For a cassette OpenInference lacks and you + can't record (no provider access), you may synthesize one from the + provider's API reference via AI. Start it with a + `# TODO: this is generated by AI, re-record` comment, mention it in the PR, + and open a follow-up issue to re-record it against the real provider in CI. + +**Transport proxy instead of cassettes.** If the OpenInference unit tests mock +HTTP (e.g. `respx`, `httpx.MockTransport`) rather than replay recorded +cassettes, you may do the same in the migrated package's **unit** tests — build the SDK +client with an `httpx.MockTransport` (or equivalent) returning canned +responses instead of `@pytest.mark.vcr`. When you go this route: + +- Reuse the OpenInference cassettes' recorded **response bodies** (incl. the + streaming SSE payloads) as the canned responses, so fidelity is preserved. +- Still register the shared VCR plugins in `conftest.py` (the shared + `fixture_vcr` is autouse) and keep `vcr_config`. +- **Use the same mechanism in conformance scenarios.** Conformance does not + require VCR — a scenario can build its client with the same transport mock + and ignore the injected `vcr` (see the `write-conformance-tests` skill). + Pick one mechanism (cassettes *or* transport mock) and use it consistently + across the whole package. +- **Mention the choice in `MIGRATION_REPORT.md`** (the `review-migration` skill + flags missing cassettes; note that the package mocks the transport by + design so the absence is not a gap). + +### 10. Workspace integration + +Wire the new package into the workspace, `tox.ini`, and pyright per the +**Adding a package to the workspace** section of [AGENTS.md](../../../AGENTS.md) +— it applies to any new package, not just migrations. Migration-specific note on top: + +- **Leave the package out of `[tool.pyright] include`.** A migration over untyped + `wrapt` boundaries (`wrapped, instance, args, kwargs`) and vendor SDK members + produces hundreds of strict-mode errors, so don't add it to `include` until + typing lands — track that as a follow-up. + +### 11. Local checks, review, and PR + +Run the pre-PR checks from the **Commands** section of +[AGENTS.md](../../../AGENTS.md) — `tox -e precommit`, `tox -e typecheck`, and +the package's `-{oldest,latest}` (and `-conformance`) test envs. + +Open the PR with the `migration:openinference` label. Run the +`review-migration` skill locally to generate `MIGRATION_REPORT.md`; iterate +until §4 (test coverage) is clean. The review skill compares the migrated package +against OpenInference (or any upstreams you name), so coverage gaps +surface in one report. + +## See also + +- [AGENTS.md](../../../AGENTS.md) — general repo rules that already apply to the migration. +- `util/opentelemetry-util-genai/AGENTS.md` — util-genai usage rules. +- `.github/skills/write-conformance-tests/SKILL.md` — generic conformance-scenario authoring (step 8). +- `.github/skills/review-migration/SKILL.md` — sister review skill (writes `MIGRATION_REPORT.md`). diff --git a/.github/skills/review-migration/SKILL.md b/.github/skills/review-migration/SKILL.md new file mode 100644 index 00000000..fcac4d7a --- /dev/null +++ b/.github/skills/review-migration/SKILL.md @@ -0,0 +1,316 @@ +--- +name: review-migration +description: Review a migrated or augmented instrumentation-genai package by comparing it against any known external upstream implementations of the same instrumentation (OpenInference, vendor-specific). Handles both greenfield migrations and augment-mode PRs that add coverage to a pre-existing package (checking the added parts, their consistency with existing code, and old-vs-new coexistence). Writes MIGRATION_REPORT.md in the migrated package root. +--- + +# Review a migrated instrumentation-genai package + +Compare the migrated package in `instrumentation//` against +every known upstream implementation of the same instrumentation, and +write a `MIGRATION_REPORT.md` in the migrated package root listing only +what's missing or wrong. + +## Inputs and upstreams + +User supplies the **migrated package name**, e.g. +`opentelemetry-instrumentation-genai-openai`. Strip +`opentelemetry-instrumentation-genai-` to get ``; that drives the +upstream lookup. + +**Upstreams come from the user, with a default fallback:** + +- **If the user names one or more upstreams** (a repo, a directory path, a + vendor's own SDK-side instrumentation, a Logfire / Pydantic AI package, + …), compare against exactly those — one column per upstream. Use them + verbatim; don't second-guess or substitute. +- **If the user names none**, default to OpenInference: search + `https://github.com/open-telemetry/donation-openinference` under + `python/instrumentation/` for a package matching `` (typically + `openinference-instrumentation-`, but confirm by listing the + directory — the name may differ). Fetch a shallow clone if you don't + already have one locally: + + ```sh + git clone --depth=1 https://github.com/open-telemetry/donation-openinference.git /tmp/openinference + ls /tmp/openinference/python/instrumentation/ | rg + ``` + +If there are no upstreams to compare against (no user-named upstream resolves and no +OpenInference match), this isn't a migration — bail with a one-line note. + +## Greenfield vs augment mode + +Detect the mode before reviewing — it changes what counts as a problem: + +- **Greenfield migration** — the migration *created* the package; everything in + `instrumentation//` is the migrated package. +- **Augment mode** — the package **already existed** and the migration only + *adds* coverage mined from the upstream. The diff modifies a pre-existing + package. + +Tell them apart from the PR diff / git history / unstaged changes: if the +package's `src/` and tests predate the migration commits, it's augment mode. + +In augment mode the deliverable shifts to: the **added** coverage's +completeness vs the upstream, its **consistency** with the pre-existing code, +and **old-vs-new coexistence** concerns. Do **not** flag pre-existing +coverage the upstream also has as missing. Per-section adjustments are inline +below. + +## Rules + +- **Deliverable is `instrumentation//MIGRATION_REPORT.md`.** Write + there, not stdout. The file is gitignored — regenerate freely. +- **Report problems, not work.** No "✅ matches", no "all tests migrated", + no recap of structure. The PR diff already shows what's there; the + report is for what's missing or wrong. +- **Don't justify empty findings.** When a section/table has no problems, + emit `_none_` (or skip the section, where the rule says so) and stop. + Do not list which greps you ran, which env vars you searched for, or + why you concluded clean. Clean means clean. +- **Tables only when there are ≥1 problem rows.** Empty section: `_none_`. +- **Read, don't guess.** Every claim from actual code or command output. + Don't infer from package names or READMEs. Before listing something as + missing, grep for it in the migrated package's `src/` and `tests/`. +- **Snapshot of the PR head**, not aspirational state. + +## Report structure + +First line: + +```markdown +# Migration review: + +Mode: greenfield migration | augment existing package + +Compared against: +- OpenInference: `openinference-instrumentation-` (add link) +- : ``, `` (omit the line if no other upstream + is in scope) +``` + +State the mode on the second line. In augment mode, one sentence after it +naming what the PR adds (the gaps it closes) orients the reviewer. + +Sections render in order. Each is bounded to its problems. + +### 1. Instrumented API surface + +Table of every API method any source patches, marking methods the +migrated package does **not** patch. **Be exhaustive.** Enumerate every +endpoint upstream emits a span for, even if attribute extraction +is generic. Do not collapse rows. + +For each upstream, walk the actual instrumentation code, not docs: + +- **Method-level patching** (the shape of this migration, and of any + method-level upstream): read every `wrap_function_wrapper(...)` call in + `_instrument()`. Each call = one row. +- **Transport-level patching** (typical of OpenInference, e.g. + `openai.OpenAI.request` / `AsyncOpenAI.request`): enumerate every + `cast_to` response type the accumulator/dispatch table handles **and** + every other endpoint that flows through the wrapped method (assistants, + threads, files, fine-tuning, images, audio, vector_stores, batches, + uploads, moderations, …). Spans with only generic attribute extraction + still count — list them, and note "generic span only" in the Notes. +- **Anything else** (vendor SDK hooks, monkey-patched class methods, + decorator-based instrumentation): walk the actual entry points the + upstream registers and list one row per emitted span site. + +One column per upstream that exists, plus `This package`: + +| API method | OpenInference | This package | Notes | +|---|---|---|---| +| `openai.resources.chat.completions.Completions.create` | ✅ | ✅ | | +| `openai.resources.responses.Responses.create` | ✅ | ❌ | | +| `openai.resources.beta.assistants.Assistants.create` | ✅ (generic span only) | ❌ | | + +- ✅ = patched. ❌ = at least one upstream patches it, this migration doesn't. + — = not patched (and that's expected — upstream doesn't patch it either). +- Sort `❌` rows to the top. +- For `❌` rows, the **Notes** cell must name the GenAI semconv operation + the method maps to (`chat`, `embeddings`, `text_completion`, + `invoke_agent`, `invoke_workflow`, `execute_tool`, `realtime`, …) — or + "no semconv operation defined yet" if the spec hasn't caught up. + +Do not render a separate `**Gaps:**` bullet list below the table — the +table itself is the gap list. + +**Augment mode.** Split `This package` into `Existed` and `Added this PR`: + +| API method | OpenInference | Existed | Added this PR | Notes | +|---|---|---|---|---| +| `…chat.completions.Completions.create` | ✅ | ✅ | — | already covered | +| `…responses.Responses.create` | ✅ | ❌ | ✅ | added by this PR | +| `…batches.Batches.create` | ✅ | ❌ | ❌ | still missing — `chat` | + +`❌` in both columns (upstream patches it, the package still doesn't) is the +gap; sort it to the top. + +### 2. Gaps and open issues + +Genuine **tooling/util gaps** — things the migration couldn't do because +`opentelemetry-util-genai` and/or the GenAI semconv doesn't yet support them +(missing util-genai factory, attribute not in the registry yet). +Reference failing/skipped tests if any. + +Do NOT list "OTel semconv doesn't cover X" as a gap when X is just an +inference API variant — semconv applies to all inference APIs. + +| Gap | File / test | Upstream issue | Notes | +|---|---|---|---| + +### 3. Significant behavioral changes + +Material differences vs upstream — equivalences (different code shape, +same emitted telemetry) are not listed. Look for: patching strategy +(transport-level vs method-level); error recording (exception event vs +`error.type` + status); token counting (cache details folded vs +separate); message format (flat OpenInference attributes vs JSON `parts` array); +streaming wrapper shape; attributes set unconditionally vs gated on +`is_recording()` / `should_capture_content()`; anything emitted by +upstream that this migration deliberately drops without a one-line rationale. + +| Aspect | Upstream | This package | Notes | +|---|---|---|---| + +### 3b. Consistency and old-vs-new coexistence (augment mode only) + +**Greenfield migration: skip.** Here the risk isn't "is it conformant" but "do +the additions fit the package they landed in." Compare the **added** code +against the **pre-existing** code (not the upstream); flag only real +problems: + +- **Divergent patterns** — added wrappers/helpers/fixtures re-implementing + something the package already has a convention for. +- **Duplicated scaffolding** — added tests not reusing the package's + `tests/test_utils.py` helpers or conformance runner. +- **Telemetry contradictions** — an added method emitting an operation name, + attribute shape, span-kind, or content-capture gating that disagrees with + what pre-existing methods emit for the analogous case. +- **Coexistence hazards** — added code that alters pre-existing paths (shared + module-level state, `_instrument()` ordering, a dependency range the old + code wasn't tested against), or a `pyproject`/`tox` edit affecting existing + tests. + +| Concern | Pre-existing | Added | Notes | +|---|---|---|---| + +`_none_` if the additions are consistent and isolated. + +### 4. Test coverage + +Three checklists. Render only **missing** cells; if every checklist is +clean, render `_Test coverage complete._` and skip the subsections. + +Gaps in this section are **blockers for the migration PR**, not +follow-up work — address them before merge. Do not list §4 items as +follow-up issues in §5. + +**Augment mode.** The matrix applies to methods **added this PR** (§1 +`Added this PR` = ✅) — those block merge. A pre-existing method missing a +variant is a §5 follow-up, not a §4 blocker. + +Also flag here: **🟡 missing-cassette** — any scenario or unit test that +references a cassette not committed under `tests/cassettes/`. And: +**unreferenced cassettes** — one-line count of `tests/cassettes/*.yaml` +files no test/scenario opens. Skip if 0. + +For each unreferenced cassette, walk git history before naming a cause: + +1. Check whether the same cassette is also unreferenced in the upstream + it came from (for each unreferenced file, search the upstream's + `tests/` for any reference). If yes, it's an **inherited upstream + orphan** — say so plainly; the migration is not responsible. +2. If the cassette IS referenced upstream but not in the migrated package, the migration + dropped a test. That is **not** a §4b cosmetic note — it's a §4a + missing variant (or a missing scenario, depending on what the test + covered). Add it to §4a's table with the upstream test name in Notes, + and do not rationalize ("tests were renamed/removed during the migration" + without naming the commit is speculation, not analysis). + +Never write "appear to be" / "likely" / "safe to delete" without +evidence. Either you walked the history and know, or you didn't — in +which case say "history not checked" so the reviewer does it. + +#### 4a. Unit-test matrix per wrapped method + +For each method in §1 row where `This package` = ✅: + +| Variant | Required when | +|---|---| +| **sync × happy** | always | +| **sync × error** | always — verify the original exception is re-raised unmodified, `error.type` is recorded, span status is ERROR | +| **async × happy** | the lib exposes an async counterpart | +| **async × error** | the lib exposes an async counterpart | +| **streaming × happy** | the method accepts `stream=True` or returns a stream wrapper | +| **streaming × error** | flag if streaming is supported but no error path is exercised at all or error is not recorded on telemetry | +| **async streaming × happy** | the method accepts `stream=True` or returns an async stream wrapper | +| **async streaming × error** | flag if streaming is supported but no error path is exercised at all or error is not recorded on telemetry | + +Identify variants by reading the migrated package's `src/` (`is_streaming(kwargs)`, +async `def`, `Stream` / `AsyncStream` wrappers). + +| Wrapped method | Missing variants | Notes | +|---|---|---| + +#### 4b. Conformance scenarios + +For each distinct GenAI semconv operation the migrated package emits (`chat`, +`embeddings`, `execute_tool`, `invoke_agent`, `invoke_workflow`, +`create_agent`, …) there should be at least one happy-path scenario file under +`tests/conformance/.py` driven by `run_conformance(...)`. More +scenarios per operation are fine but never required. + +| Operation | Scenario file | Status | +|---|---|---| + +Mention if conformance scenario is skipped, there are expected_violations, +or `uv run tox -e py312-test-instrumentation-genai-` fails. + +#### 4c. Docstring / README coverage + +| Asset | Required content | Status | +|---|---|---| +| `README.rst` | install snippet, usage snippet importing from `opentelemetry.instrumentation.genai.`, pointer to `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT` and the util-genai README, pointer to `tests/conformance/` (no `examples/`) | | +| `src/opentelemetry/instrumentation/genai//__init__.py` module docstring | usage example using the new module path; configuration section listing the env vars users actually need | | + +Render only rows that are missing/wrong/stale (e.g. README still imports +from `openinference.instrumentation.`, install command points +at the `openinference-instrumentation-` package name, links to an +OpenInference URL). + +**Refactor misses.** One bullet *only if* unit tests re-implement generic +semconv shape checks inline instead of factoring them into the migrated package's +`tests/test_utils.py` helpers (e.g. `assert_all_attributes`, +`assert_completion_attributes`, `assert_messages_attribute`). There is no +shared `opentelemetry.test_util_genai.assertions` module — assert directly +on `span.attributes[GenAIAttributes.GEN_AI_…]` using the semconv constants. + +### 5. Follow-up work + +What goes in this PR vs. a follow-up: + +- **In this PR**: §4 test coverage gaps. These block merge. In augment mode, + scoped to the methods added this PR, plus any §3b consistency problem in the + added code (it ships here). +- **Follow-up**: §1 ❌ rows (new instrumented methods), §2 util-genai / + semconv gaps, §3 behavioral parity items, and — augment mode — pre-existing + coverage gaps from §4 — each a **separate** PR, one logical change apiece. + +Suggest an issue title per follow-up item, grouped by type (API surface / +util-genai gaps / behavioral parity); name the upstream that covers it and +the semconv operation where relevant. **Do not file the issues** — listing +them is the deliverable; filing AI-generated issues is against the +contributor policy, so the human author decides which to open. If there +are none, render `_No follow-up issues recommended._` + +## See also + +- `.github/skills/migrate-from-openinference/SKILL.md` — the migration skill; it + runs this review at its final step to produce `MIGRATION_REPORT.md`. +- `.github/skills/write-conformance-tests/SKILL.md` — authoring the + conformance scenarios this report checks in §4b. +- `.github/instructions/instrumentation.instructions.md` — the copilot + PR-review rules for `instrumentation/**`; generic instrumentation + violations are flagged there and not repeated in this report. \ No newline at end of file diff --git a/.github/skills/write-conformance-tests/SKILL.md b/.github/skills/write-conformance-tests/SKILL.md new file mode 100644 index 00000000..c1c8a6cd --- /dev/null +++ b/.github/skills/write-conformance-tests/SKILL.md @@ -0,0 +1,362 @@ +--- +name: write-conformance-tests +description: Author GenAI conformance-test scenarios for an OpenTelemetry instrumentation package — Scenario subclasses under tests/conformance/, the test_conformance.py runner, declared gaps, lib-specific assertions, and weaver live-check policies. Use when adding or updating conformance tests for any instrumentation, whether greenfield or ported. +--- + +# Write GenAI conformance tests + +Conformance tests validate that an instrumentation package emits telemetry +matching the [GenAI semantic conventions](https://github.com/open-telemetry/semantic-conventions-genai) +via Weaver live-check. They apply to **any** instrumentation package — +greenfield or ported — and don't depend on how the package was built. + +This skill covers authoring the `tests/conformance/.py` modules and +the `tests/test_conformance.py` runner. For the always-on rules that hold even +without this skill loaded, see the **Conformance tests** section of +[AGENTS.md](../../../AGENTS.md). + +## One scenario per operation + +Put one scenario per emitted semconv operation under `tests/conformance/`. +Write a scenario for **every** semconv operation the library emits, even one +currently blocked by a util-genai or semconv gap. Skipping the scenario hides +the gap; writing it records the gap (see [Declared gaps](#declared-gaps)). +**Never** drop a scenario file because it would fail today. + +## Recommended scenarios + +Cover the scenarios below that apply to the library. Skip a row only when the +library genuinely can't perform that operation (e.g. an inference-only +client has no `embeddings` scenario). + +**LLM client instrumentations:** + +| Scenario | File | Covers | +|---|---|---| +| Inference | `inference.py` | A `chat` operation. | +| Tool calling | `tool_calling.py` | A `chat` turn where the model returns tool calls and a follow-up turn feeds tool results back. Asserts tool calls and tool results are present on input/output **messages**. weaver will validate the format. Do **not** expect `execute_tool` spans unless the client library itself instruments tool execution — most don't; tool execution is the caller's code. | +| Multimodal content | `multimodal.py` | A `chat` turn carrying the **non-text parts** the client accepts (inline image/audio bytes, media URLs, file refs, …), asserting each round-trips onto the messages. Cover only the part types the library emits — see [Message-part coverage](#message-part-coverage). | +| Reasoning | `reasoning.py` | A `chat` turn against a reasoning model where the response carries reasoning/thinking content, asserting a `reasoning` part lands on an output message (and `gen_ai.usage.output_tokens` / reasoning-token attributes if the library records them). Only when the client surfaces reasoning content — see [Message-part coverage](#message-part-coverage). | +| Embeddings | `embedding.py` | An `embeddings` operation. | + +**Agent / orchestration instrumentations:** + +| Scenario | File | Covers | +|---|---|---| +| Agent invocation with tooling | `invoke_agent.py` | An `invoke_agent` run that calls at least one tool — expects `invoke_agent` plus the nested `execute_tool` / `chat` spans the framework emits. | +| Multi-agent orchestration | `multi_agent.py` | One agent handing off to / invoking another — expects nested `invoke_agent` spans under the orchestrator. | +| Workflows | `invoke_workflow.py` | An `invoke_workflow` run wrapping the agent/tool spans it drives. | + +## Message-part coverage + +Weaver validates a part's *shape*, not *which* part types a scenario +exercised — a text-only scenario leaves the package's image/audio/file/tool +mapping unverified. So exercise **every non-text part type the library can +emit** and assert it landed on a message. Cover only what the package +instruments: walk its wrappers (the step-6 mapping for a port) for which +`opentelemetry.util.genai.types` parts they produce. + +| Part `type` | util-genai type | Emitted when the library accepts… | +|---|---|---| +| `text` | `Text` | plain text (always) | +| `tool_call` / `tool_call_response` | `ToolCallRequest` / `ToolCallResponse` | function/tool calling — covered by `tool_calling.py` | +| `server_tool_call` / `server_tool_call_response` | `ServerToolCall` / `ServerToolCallResponse` | vendor server-side tools (web_search, code_interpreter, …) | +| `reasoning` | `Reasoning` | reasoning / thinking items | +| `blob` | `Blob` | inline image/audio/video **bytes** (`modality` distinguishes them) | +| `uri` | `Uri` | an external media **URL** (`modality`) | +| `file` | `File` | a **file reference** / id (`modality`) | +| `generic` | `GenericPart` | a provider item with no semconv mapping — flag, don't drop | + +Group by shared turn/cassette — typically one `multimodal.py` for the +image/audio/file/url inputs the client accepts, `tool_calling.py` for tool +parts, and `reasoning.py` for `reasoning` parts (a reasoning model emits +those on output messages, not input). `type` alone gives `blob` / `uri` / +`file`; to tell image from +audio from video, read the part's `modality` with a `_part_fields` helper +returning `(type, modality)` tuples (defined alongside `_part_types` in +[Lib-specific assertions](#lib-specific-assertions)): + +```python + def validate(self, report: LiveCheckReport) -> None: + super().validate(report) + chat_spans = [ + entry["span"] for entry in report["samples"] + if "span" in entry + and _attr(entry["span"], "gen_ai.operation.name") == "chat" + ] + input_parts = { + (t, m) for span in chat_spans + for t, m in _part_fields(_attr(span, "gen_ai.input.messages")) + } + # e.g. an inline image + an audio URL were sent + assert ("blob", "image") in input_parts, f"no image blob, saw {input_parts}" + assert ("uri", "audio") in input_parts, f"no audio uri, saw {input_parts}" +``` + +If a part type the library accepts can't round-trip yet (a util-genai/semconv +gap), still write the scenario and record it as a +[declared gap](#declared-gaps) — never silently omit the part. + +## Scenario modules + +Each scenario module defines a subclass of `Scenario` from +`opentelemetry.test_util_genai.conformance`. It sets the `expected_spans` / +`expected_metrics` ClassVars and implements +`run(self, *, tracer_provider, meter_provider, logger_provider, vcr)`. +Drive instrumentation through the shared `instrument` context manager (not +`instr.instrument()` / `trace.set_tracer_provider`). The runner injects an +already-configured `vcr`, so a cassette-based scenario just calls +`vcr.use_cassette(...)`: + +```python +# tests/conformance/inference.py +from typing import Any + +from opentelemetry.instrumentation.genai. import Instrumentor +from opentelemetry.sdk._logs import LoggerProvider +from opentelemetry.sdk.metrics import MeterProvider +from opentelemetry.sdk.trace import TracerProvider +from opentelemetry.test_util_genai.conformance import Scenario +from opentelemetry.test_util_genai.instrumentor import instrument + + +class InferenceScenario(Scenario): + expected_spans = ("chat",) + expected_metrics = ( + "gen_ai.client.operation.duration", + "gen_ai.client.token.usage", + ) + + def run( + self, + *, + tracer_provider: TracerProvider, + meter_provider: MeterProvider, + logger_provider: LoggerProvider, + vcr: Any, + ) -> None: + with instrument( + Instrumentor(), + tracer_provider=tracer_provider, + logger_provider=logger_provider, + meter_provider=meter_provider, + content_capture="SPAN_ONLY", + ): + with vcr.use_cassette("inference.yaml"): + ... # call the patched API +``` + +One operation per scenario. No env vars, no logging config. + +**VCR cassettes are not required — a transport mock works too.** Mock HTTP +however the package's **unit** tests already do, and use the **same pattern +across every scenario in that package** (don't mix cassettes and transport +mocks within one lib). If the package mocks the transport (e.g. +`httpx.MockTransport`, `respx`) instead of replaying cassettes, build the +client with that transport inside `run()` and ignore the injected `vcr`: + +```python + def run(self, *, tracer_provider, meter_provider, logger_provider, vcr) -> None: + with instrument( + Instrumentor(), + tracer_provider=tracer_provider, + logger_provider=logger_provider, + meter_provider=meter_provider, + content_capture="SPAN_ONLY", + ): + client = (transport=httpx.MockTransport(_handler)) # canned response + client.(...) # call the patched API +``` + +`vcr` stays in the signature either way (the runner always passes it). + +## Declared gaps + +**Declared gaps** go in the `expected_violations` ClassVar (a tuple of +`ExpectedViolation`), not `xfail`. `run_conformance` fails on *undeclared* +weaver violations and on declared violations weaver no longer reports — so +a known util-genai/semconv gap is recorded as an `expected_violation` that +fails loudly the moment it's fixed. + +When the gap is too big to capture as individual `expected_violations` — the +whole operation can't run yet — skip the entire scenario instead, via +`pytest.mark.xfail` / `skip` on the parametrize entry in +`test_conformance.py`. Don't invent a one-off `reason=` string: **ask the +user to file a tracking bug first**, then use that issue as the skip +`reason=` (e.g. `reason="blocked by open-telemetry/...#1234"`) so the skip +is traceable and gets revisited when the bug is fixed. **Never** drop the +scenario file itself — a skipped scenario still records that the operation +exists; a deleted one hides it. + +## Lib-specific assertions + +**Lib-specific assertions** go in a `validate(self, report)` override on the +scenario (call `super().validate(report)` first) — there is no +`_local_assertions.py` / `LocalAssertions` hook. Common lib-specific shapes: + +- **Vendor server-tool payloads** (`code_interpreter`, `web_search`, …). +- **Vendor-specific finish reasons** outside semconv's enum (`stop`, + `length`, `content_filter`, `tool_call`, `error`). +- **Provider-specific `error.type`** — exception class names from the + underlying SDK. + +`validate` receives the weaver `LiveCheckReport`. Read span samples from +`report["samples"]` (each `entry["span"]["attributes"]` is a list of +`{"name", "value"}`) and seen metrics from +`report["statistics"]["seen_registry_metrics"]`. Always call +`super().validate(report)` first so the `expected_spans` / `expected_metrics` +checks still run: + +```python +# tests/conformance/tool_calling.py +from __future__ import annotations + +import json +from typing import Any + +from opentelemetry.test.weaver_live_check import LiveCheckReport +from opentelemetry.test_util_genai.conformance import Scenario + + +class ToolCallingScenario(Scenario): + expected_spans = ("chat",) + expected_metrics = ("gen_ai.client.operation.duration",) + + def run(self, *, tracer_provider, meter_provider, logger_provider, vcr) -> None: + ... # drive a tool-calling turn — see "Scenario modules" above + + def validate(self, report: LiveCheckReport) -> None: + super().validate(report) # keep the expected_spans / _metrics checks + + # Lib-specific: weaver validates the *shape* of each message part, but + # not that a tool call actually round-tripped. Assert the model's + # tool_call landed on an output message and the tool result was fed + # back on an input message (across the two chat turns). + chat_spans = [ + entry["span"] + for entry in report["samples"] + if "span" in entry + and _attr(entry["span"], "gen_ai.operation.name") == "chat" + ] + assert chat_spans, "no chat span emitted" + + output_part_types = { + t for span in chat_spans + for t in _part_types(_attr(span, "gen_ai.output.messages")) + } + input_part_types = { + t for span in chat_spans + for t in _part_types(_attr(span, "gen_ai.input.messages")) + } + assert "tool_call" in output_part_types, ( + f"expected a tool_call part on an output message, saw {output_part_types}" + ) + assert "tool_call_response" in input_part_types, ( + f"expected a tool_call_response part on an input message, saw {input_part_types}" + ) + + +def _attr(span: dict[str, Any], name: str) -> Any: + for attr in span["attributes"]: + if attr["name"] == name: + return attr["value"] + return None + + +def _part_types(messages_json: str | None) -> list[str]: + # gen_ai.{input,output}.messages is a JSON string of + # [{"role": ..., "parts": [{"type": ..., ...}]}]. + messages = json.loads(messages_json) if messages_json else [] + return [part["type"] for message in messages for part in message["parts"]] + + +def _part_fields(messages_json: str | None) -> list[tuple[str, str | None]]: + # Like _part_types, but keeps modality so image/audio/video are + # distinguishable on blob/uri/file parts (None for parts without one). + messages = json.loads(messages_json) if messages_json else [] + return [ + (part["type"], part.get("modality")) + for message in messages + for part in message["parts"] + ] +``` + +## The test_conformance.py runner + +`tests/test_conformance.py` guards collection with +`pytest.importorskip("opentelemetry.test.weaver_live_check")`, parametrizes +the scenario *instances*, and calls `run_conformance(scenario, vcr=vcr, +weaver=weaver_live_check)` — it builds its own providers from the weaver +OTLP endpoint, so don't pass providers/exporters: + +```python +import pytest + +pytest.importorskip("opentelemetry.test.weaver_live_check") + +from opentelemetry.test.weaver_live_check import WeaverLiveCheck # noqa: E402 +from opentelemetry.test_util_genai.conformance import ( # noqa: E402 + Scenario, + run_conformance, +) + +from .conformance.embedding import EmbeddingScenario +from .conformance.inference import InferenceScenario + + +@pytest.mark.parametrize( + "scenario", + [InferenceScenario(), EmbeddingScenario()], + ids=lambda s: type(s).__name__, +) +def test_conformance( + scenario: Scenario, vcr, weaver_live_check: WeaverLiveCheck +) -> None: + run_conformance(scenario, vcr=vcr, weaver=weaver_live_check) +``` + +Do not write a separate `test_weaver_live.py`; weaver is already wired +through `run_conformance`. The `weaver_live_check` fixture skips the test +(rather than passing without the gate) only when it can't start (unsupported +platform / network) — never wrap it in a try/except, which would silently +disable the gate. + +## Weaver policies + +`weaver_live_check` enforces `policies/genai_content_validation.rego` +(content-attribute JSON shape) and `policies/genai_span_validation.rego` +(span name format, per-op expected attributes) — read these policy files +when authoring scenarios; they're authoritative. + +`weaver_live_check` downloads the pinned weaver binary on first use (cached +under `~/.cache/otel-conformance/weaver/`). + +## Recorded HTTP (cassettes or transport mock) + +A scenario replays one HTTP interaction per operation. Use whichever +mechanism the package's unit tests use, consistently across all of its +scenarios: + +- **VCR cassette** — `vcr.use_cassette(".yaml")`, one committed + cassette per operation under `tests/cassettes/`. +- **Transport mock** — build the SDK client with an `httpx.MockTransport` + (or `respx`) returning a canned response; no cassette file needed. + +Pick the one the lib already follows and don't mix the +two within a package. + +**AI-generated cassettes.** Lacking provider access, you may synthesize a +cassette from the provider's API reference via AI. Start the cassette with a +`# TODO: this is generated by AI, re-record` comment, mention it in the PR, +and open a follow-up issue to re-record it against the real provider in CI. + +## Running + +```sh +uv run tox -e py312-test-instrumentation-genai--conformance +``` + +The `*-conformance` tox envs target `tests/test_conformance.py` directly; the +regular `*-{oldest,latest}` envs `--ignore` it so they don't need the +OTLP/gRPC exporter or `weaver_live_check`. + diff --git a/.gitignore b/.gitignore index fa7bfddc..67ff44c6 100644 --- a/.gitignore +++ b/.gitignore @@ -73,3 +73,6 @@ policies/_schemas.rego # Conformance test weaver live-check reports weaver_reports/ + +# Migration review reports generated by the review-migration skill +MIGRATION_REPORT.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 337764c2..9c00f9dc 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -128,6 +128,22 @@ uv run tox -e changelog-preview If your change doesn't need an entry (pure docs/tooling), add the `Skip Changelog` label to the PR. +## Skills + +This repo ships skills (under `.github/skills/`) that automate the heavy, +repeatable contribution flows. Trigger them deliberately when +you start one of these tasks: + +- **`migrate-from-openinference`** — migrate an `openinference-instrumentation-*` + package into this repo as an OTel GenAI package, or augment an existing + package with the coverage OpenInference adds on top. +- **`review-migration`** — review a ported or augmented package against its + upstream implementation and write `MIGRATION_REPORT.md`. +- **`write-conformance-tests`** — author conformance scenarios and the + `test_conformance.py` runner for an instrumentation package. + +Please contribute back anything you learn while using the skills that could help improve them! + ## Keep PRs small One logical change per PR. Don't bundle unrelated fixes, refactors, or