Skip to content

feat(cost): per-modality token breakdown in Python SDK (F3 Track C)#596

Open
john-weiler wants to merge 6 commits into
mainfrom
feat/f3-multimodal-cost-sdk-python
Open

feat(cost): per-modality token breakdown in Python SDK (F3 Track C)#596
john-weiler wants to merge 6 commits into
mainfrom
feat/f3-multimodal-cost-sdk-python

Conversation

@john-weiler

@john-weiler john-weiler commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

User description

Summary

  • Extends LlmMetrics with num_image_input_tokens, num_audio_input_tokens, num_audio_output_tokens (optional, forward-compat safe)
  • Threads through logger.add_llm_span()base_handlerspan_params
  • LangChain handler: _extract_gemini_modality_breakdown() checks three surfaces (priority order):
    1. message.usage_metadata.input_token_details / output_token_details (langchain-google-genai ≥ 2.x with LangChain Core UsageMetadata)
    2. message.response_metadata.prompt_tokens_details / candidates_tokens_details (raw Gemini API list shape)
    3. message.response_metadata["usage_metadata"] nested dict — defensive fallback for providers that nest usage under response_metadata (added in response to review comment; covers forward-compat if upstream langchain-google-genai adds modality info there)
  • Sync + async handlers both updated; also handles a ChatGeneration or raw AIMessage interchangeably
  • galileo-adk: _extract_usage_metadata() walks native SDK prompt_tokens_details/candidates_tokens_details (handles .modality as enum or string); span_manager.end_llm() carries 3 new fields through to end_node()
  • crewai, openai_agents, otel: untouched — OpenAI-compat path drops breakdown (per RFC)

Known limitation — langchain-google-genai 4.2.4

Discovered during local E2E testing: langchain-google-genai 4.2.4 (latest) does not surface Gemini's per-modality breakdown on the AIMessage. The wrapper at chat_models.py:1263 hardcodes:

input_token_details={"cache_read": cache_read_tokens}

…reading prompt_token_count and cached_content_token_count from the Gemini response but dropping prompt_tokens_details (which contains the modality breakdown). Verified end-to-end:

Path prompt_tokens_details returned to consumer
google-genai SDK directly [(TEXT, 10), (AUDIO, 8)]
langchain-google-genaiAIMessage.usage_metadata.input_token_details {"cache_read": 0} (modality stripped)
langchain-google-genaiAIMessage.response_metadata only finish_reason, model_name, safety_ratings, model_provider

Until langchain-google-genai is patched upstream, the LangChain Gemini path will not produce per-modality cost data. The extractor in this PR is correct and will start producing data automatically once the upstream surfaces it. Per-modality cost continues to work for:

  • galileo-adk (uses google-genai SDK directly — verified breakdown reaches the extractor)
  • Direct logger.add_llm_span(audio_input_tokens=…, …) calls
  • Vertex AI judge spans (covered in runners PR #2480)

Path forward: file an upstream PR against langchain-google-genai to thread prompt_tokens_details through input_token_details. Estimated ~10 lines.

Test plan

  • 4 existing TestParseLlmResult cases (Surface 1, Surface 2, text-only, precedence)
  • 2 new TestParseLlmResult cases (Surface 3 nested-usage, Surface 1 wins over Surface 3)
  • 6 new TestExtractUsageMetadata ADK cases covering audio+image, text-only, empty candidates, enum-vs-string modality
  • All existing tests pass
  • Local E2E validated end-to-end via google-genai direct + logger.add_llm_span(audio_input_tokens=…)

Depends on runners feat/f3-multimodal-cost-foundation (orbit) + runners PR #2480.

🤖 Generated with Claude Code


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Propagate per-modality token counts through GalileoObserver/TraceBuilder, the LangChain result parser, and span handling so LlmMetrics and logged traces receive Gemini-native image/audio breakdowns. Convert OpenAI multimodal content parts into LoggedMessage blocks to keep image/audio data and tool linkage intact in downstream ingestion.

TopicDetails
Other Other files
Modified files (1)
  • galileo-adk/tests/test_observer.py
Latest Contributors(2)
UserCommitDate
jaweiler@splunk.comtest: add coverage for...June 23, 2026
jweiler@galileo.aifixesJune 08, 2026
OpenAI multimodal Convert OpenAI list-based multimodal content into TextContentBlock/DataContentBlock sequences and emit LoggedMessage instances so the ingestion layer shows the actual image/audio parts while preserving tool-call metadata; cover tests/test_openai_extractors.py to guard the new parsing behavior.
Modified files (2)
  • src/galileo/openai/extractors.py
  • tests/test_openai_extractors.py
Latest Contributors(2)
UserCommitDate
jaweiler@splunk.comtest: add coverage for...June 23, 2026
jweiler@galileo.aifixesJune 08, 2026
Modal metrics flow Connect the per-modality token fields from GalileoObserver through TraceBuilder, SpanManager, GalileoLogger, GalileoBaseHandler, and the LangChain callbacks so LlmMetrics and end-node spans reflect Gemini image/audio usage while the LangChain parser (LLMEndResult) sources those counts from the three surfaces of its Gemini messages and returns deterministic zeros when detail lists are present; cover tests/test_observer.py and tests/test_langchain.py to validate the new modality extraction logic.
Modified files (10)
  • galileo-adk/src/galileo_adk/observer.py
  • galileo-adk/src/galileo_adk/span_manager.py
  • galileo-adk/src/galileo_adk/trace_builder.py
  • src/galileo/handlers/base_handler.py
  • src/galileo/handlers/langchain/async_handler.py
  • src/galileo/handlers/langchain/handler.py
  • src/galileo/handlers/langchain/utils.py
  • src/galileo/logger/logger.py
  • src/galileo/resources/models/llm_metrics.py
  • tests/test_langchain.py
Latest Contributors(2)
UserCommitDate
jaweiler@splunk.comfeat: add image_output...June 23, 2026
jweiler@galileo.aifeat(cost): per-modali...June 01, 2026
Review this PR on Baz | Customize your next review

Extends LlmMetrics with optional num_image_input_tokens / num_audio_input_tokens /
num_audio_output_tokens fields and threads them through the full ingestion path:
LlmMetrics model → logger.add_llm_span() → base_handler → span_params.

Extraction is implemented in two integrations:

LangChain (primary path):
- _extract_gemini_modality_breakdown() checks two surfaces:
  1. message.usage_metadata.input_token_details / output_token_details
     (langchain-google-genai >= 2.x with LangChain Core UsageMetadata)
  2. message.response_metadata.prompt_tokens_details / candidates_tokens_details
     (raw Gemini API ModalityTokenCount list forwarded by the adapter)
- Both sync and async handlers updated.

galileo-adk (native Gemini SDK path):
- _extract_usage_metadata() extended to walk prompt_tokens_details /
  candidates_tokens_details (ModalityTokenCount list with .modality enum
  or string) and bucket into image/audio counts.
- span_manager.end_llm() carries the 3 new fields through to end_node().

All other integrations (crewai, openai_agents, otel) unchanged — they go
through the OpenAI-compat path which drops modality breakdown (per RFC).

Tests: 4 new TestParseLlmResult cases + 6 new TestExtractUsageMetadata
ADK cases; all existing tests pass.

Part of F3 multimodal token/cost support.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.54658% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.99%. Comparing base (f9feed6) to head (697487c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/galileo/handlers/langchain/utils.py 93.97% 5 Missing ⚠️
src/galileo/openai/extractors.py 87.17% 5 Missing ⚠️
galileo-adk/src/galileo_adk/observer.py 94.87% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #596      +/-   ##
==========================================
+ Coverage   83.31%   83.99%   +0.67%     
==========================================
  Files         125      135      +10     
  Lines       10659    11942    +1283     
==========================================
+ Hits         8881    10031    +1150     
- Misses       1778     1911     +133     
Flag Coverage Δ
galileo-adk 88.98% <94.87%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 522 to +533

# Per-modality breakdown — only present on native Gemini SDK responses.
prompt_details = getattr(usage, "prompt_tokens_details", None)
candidates_details = getattr(usage, "candidates_tokens_details", None)
if prompt_details or candidates_details:
image_in = 0
audio_in = 0
audio_out = 0
has_prompt = bool(prompt_details)
has_candidates = bool(candidates_details)
for entry in prompt_details or []:
modality_attr = getattr(entry, "modality", None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_usage_metadata duplicates the Gemini modality breakdown in ::_extract_gemini_modality_breakdown, should we extract a shared helper so both paths use the same logic?

Severity

Want Baz to fix this for you? Activate Fixer

Comment thread tests/test_langchain.py
Comment thread src/galileo/handlers/langchain/utils.py
Comment thread src/galileo/handlers/langchain/utils.py Outdated
Comment on lines +30 to +34
# Per-modality breakdown — None means "counted as text in the flat totals".
# Only populated for providers that return modality-level token counts (e.g. Gemini native).
num_image_input_tokens: None | Unset | int = UNSET
num_audio_input_tokens: None | Unset | int = UNSET
num_audio_output_tokens: None | Unset | int = UNSET

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/galileo/resources/ is regenerated from openapi.yaml, but LlmMetrics in the spec doesn't include the new modality fields, so the next regen will drop them and break src/galileo/logger/logger.py:1247-1255; should we move this into the OpenAPI source/generator input instead of patching the generated file?

Severity

Want Baz to fix this for you? Activate Fixer You can also update your AI coding guidelines based on this comment by apply pr to [branch name]

Other fix methods

Fix in Cursor

Prompt for AI Agents
Before applying, verify this suggestion against the current code. In
src/galileo/resources/models/llm_metrics.py around lines 30-34 and in the additions to
`LlmMetrics.to_dict()` and `LlmMetrics.from_dict()` (new fields
`num_image_input_tokens`, `num_audio_input_tokens`, `num_audio_output_tokens`), don’t
keep these changes in the generated client boundary. Instead, update the OpenAPI source
schema that generates `LlmMetrics` (likely openapi.yaml) to add these three properties
(type int, nullable/optional using the same conventions as the other token fields) so
regeneration won’t delete them. After updating the OpenAPI schema, remove the manual
edits from the generated llm_metrics.py and regenerate via
scripts/auto-generate-api-client.sh, then run the failing logger tests
(src/galileo/logger/logger.py around 1247-1255) to ensure there are no unexpected-kwarg
errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will regenerate once API changes are in ...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clarification on the premise: regeneration won't actually break logger.py:1247-1255. That code constructs galileo_core.schemas.logging.span.LlmMetrics (a pydantic model with model_config extra='allow'), not the edited attrs class in src/galileo/resources/models/llm_metrics.py. I verified the modality kwargs are accepted and serialized via model_dump() purely through extra='allow' — so the write/ingest path does not depend on this generated-file edit at all. The edit only affects the read-side models (LlmSpan, extended_llm_span_record, partial_extended_llm_span_record) that call LlmMetrics.from_dict(). So the regen concern is real for the read path, but the generated-file change is functionally unnecessary for emitting the new fields.

Comment thread galileo-adk/tests/test_observer.py
Comment thread src/galileo/openai/extractors.py
Comment thread src/galileo/openai/extractors.py Outdated
@fercor-cisco

Copy link
Copy Markdown
Contributor

/astra review

@galileo-astra galileo-astra Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This review was generated by an AI agent (Astra) and may contain mistakes. Please verify all suggestions independently.

Verdict: request_changes — The core per-modality token threading is sound, but the bundled OpenAI multimodal-message path introduces a tool-linkage regression and ships with effectively no test coverage.

General Comments

  • 🟡 minor (design): The new multimodal path makes convert_to_galileo_message return a LoggedMessage whose content is a list of TextContentBlock/DataContentBlock. These messages are then placed into LlmSpan.input/output, whose declared type (galileo_core Message.content) is str | list[ContentPart]. When the span is serialized via model_dump(), pydantic emits a PydanticSerializationUnexpectedValue warning ("Expected str") for every multimodal message. I verified the block data does survive serialization, so this is not data loss, but it means a UserWarning is logged on every multimodal OpenAI call — noisy for a telemetry path that is meant to observe without interfering. Consider whether galileo_core needs a content-block-aware span type, or suppress/handle the warning at the boundary.

Follow-ups

Suggested follow-up work that could be tracked as Shortcut stories:

  • galileo-adk/src/galileo_adk/observer.py:503-557: _extract_usage_metadata (ADK) and _extract_gemini_modality_breakdown (LangChain utils) implement the same Gemini per-modality walk with subtly different null-vs-zero semantics: ADK only sets image_input_tokens/audio_input_tokens when prompt_tokens_details is present and audio_output_tokens when candidates_tokens_details is present (otherwise the key is absent → None downstream), while the LangChain path always returns a 0-filled 3-tuple once any detail data exists. Consider extracting a shared helper so both paths agree on the null/zero contract (matches reviewer comment r3337251851).
  • src/galileo/openai/extractors.py:221-267: _openai_content_parts_to_blocks re-implements type→content-block conversion that already exists in galileo/utils/serialization._convert_langchain_content_block. Consider reusing/sharing that helper to avoid drift (matches reviewer comment r3373853343).

Comment thread src/galileo/openai/extractors.py Outdated
Comment thread src/galileo/handlers/langchain/utils.py
Comment thread src/galileo/openai/extractors.py

@fercor-cisco fercor-cisco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the Astra feedback and test failures.

Comment thread galileo-adk/src/galileo_adk/observer.py
@john-weiler

Copy link
Copy Markdown
Contributor Author

john-weiler and others added 4 commits June 23, 2026 10:59
…linkage, provider scope

- galileo-adk/trace_builder.py: add image_input_tokens/audio_input_tokens/audio_output_tokens
  to TraceBuilder.add_llm_span() so the ADK path doesn't raise TypeError when GalileoBaseHandler
  calls through with these new kwargs; thread them into LlmMetrics
- src/galileo/handlers/langchain/utils.py: fix Surface 1 has_detail_data guard to only activate
  when audio/image keys are actually present — previously any non-empty input_token_details
  (e.g. Anthropic cache_read, OpenAI reasoning) incorrectly returned (0,0,0) instead of
  (None,None,None) for providers that have no modality breakdown
- src/galileo/openai/extractors.py: forward tool_calls and tool_call_id into LoggedMessage
  when content is a multimodal list; previously tool-call linkage was silently dropped for
  tool-role messages with array content

Co-Authored-By: Claude <noreply@anthropic.com>
Mirrors the existing image_input_tokens/audio_* fields — threads
image_output_tokens from LlmMetrics through LlmMetrics.to_dict/from_dict,
GalileoLogger.add_llm_span/log_single_llm_call, GalileoBaseHandler.end_node,
both LangChain handlers, _extract_gemini_modality_breakdown (Surface 1/2/3 output
paths + candidates_tokens_details IMAGE loop), LLMEndResult, SpanManager.end_llm,
TraceBuilder.add_llm_span, and GalileoObserver._extract_usage_metadata /
on_llm_end so the ADK and LangChain paths can report image generation output tokens.

Co-Authored-By: Claude <noreply@anthropic.com>
…modal tool linkage

- test_langchain.py: 3 new TestParseLlmResult cases — image_output_tokens from
  surface 1 output_token_details, image_output_tokens from surface 2
  candidates_tokens_details, and non-Gemini provider with only cache_read in
  input_token_details returns None not zero for all modality fields
- test_observer.py: image_output_tokens from candidates_tokens_details IMAGE entry
- test_openai_extractors.py: full coverage of _openai_content_parts_to_blocks
  (text, image data URI, image plain URL, input_audio, unrecognised type, mixed)
  and convert_to_galileo_message tool linkage (tool_call_id and tool_calls both
  preserved when content is a list of parts)

Co-Authored-By: Claude <noreply@anthropic.com>
…rces/

Adds num_image_input_tokens, num_audio_input_tokens, num_audio_output_tokens,
num_image_output_tokens to LlmMetrics in openapi.yaml so the generated
src/galileo/resources/models/llm_metrics.py includes them natively — the
previously hand-edited fields will now survive re-generation instead of being
dropped on the next regen cycle.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants