Skip to content

Conversation

@cdoern
Copy link
Collaborator

@cdoern cdoern commented Jan 5, 2026

What does this PR do?

Add comprehensive OTEL metrics tracking for inference requests with automatic export to OTLP collectors.

Metrics implemented:

  • llama_stack.inference.requests_total (counter)
  • llama_stack.inference.request_duration_seconds (histogram)
  • llama_stack.inference.concurrent_requests (up-down counter)
  • llama_stack.inference.inference_duration_seconds (histogram)
  • llama_stack.inference.time_to_first_token_seconds (histogram)

Key components:

  • Create metrics module with 5 OTEL instruments
  • Integrate metrics into chat/completions inference routers
  • Add OTLP HTTP exporter with auto-configuration via env vars
  • Implement integration tests with OTLP test collector
  • Fix test infrastructure to support metrics export in server mode

All metrics include attributes for model, provider, endpoint_type,
stream, and status for flexible filtering and aggregation.

Test Plan

new telemetry integration tests should pass.

Add comprehensive OTEL metrics tracking for inference requests with
automatic export to OTLP collectors.

Metrics implemented:
 - llama_stack.inference.requests_total (counter)
 - llama_stack.inference.request_duration_seconds (histogram)
 - llama_stack.inference.concurrent_requests (up-down counter)
 - llama_stack.inference.inference_duration_seconds (histogram)
 - llama_stack.inference.time_to_first_token_seconds (histogram)

Key components:
 - Create metrics module with 5 OTEL instruments
 - Integrate metrics into chat/completions inference routers
 - Add OTLP HTTP exporter with auto-configuration via env vars
 - Implement integration tests with OTLP test collector
 - Fix test infrastructure to support metrics export in server mode

 All metrics include attributes for model, provider, endpoint_type,
 stream, and status for flexible filtering and aggregation.

Signed-off-by: Charlie Doern <cdoern@redhat.com>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 5, 2026
@cdoern
Copy link
Collaborator Author

cdoern commented Jan 5, 2026

@iamemilio PTAL, I realize some of these might be auto captured while others are not, but opening this up for wider review


# Request-level metrics
REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total"
REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Request-level metrics
REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total"
REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds"
CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests"
Copy link
Contributor

@iamemilio iamemilio Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a good chance the FastAPI instrumentation captures this as: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/#metric-httpserveractive_requests


# Token-level metrics
INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds"
TIME_TO_FIRST_TOKEN = f"{INFERENCE_PREFIX}.time_to_first_token_seconds"
Copy link
Contributor

@iamemilio iamemilio Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one should be captured by inference providers: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiservertime_to_first_token. I don't think its possible to capture this in llama stack since it doesn't generate tokens.

CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests"

# Token-level metrics
INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should check in the grafana and promethes metrics in llama stack. Its both client and server, so there is a chance its capturing this as https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiclientoperationduration

Copy link
Contributor

@iamemilio iamemilio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that before moving forward with these, we need to be aware of what data is already being captured, if llama stack is the right place to capture some of this data, and if we can stretch what data we already have to fit this. For example, histograms of requests per second can be summarized in Grafana ( and other dashboarding tools ) to give you a sum of requests which is total requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants