feat: implement OpenTelemetry metrics for inference operations #4441

cdoern · 2026-01-05T17:41:08Z

What does this PR do?

Add comprehensive OTEL metrics tracking for inference requests with automatic export to OTLP collectors.

Metrics implemented:

llama_stack.inference.requests_total (counter)
llama_stack.inference.request_duration_seconds (histogram)
llama_stack.inference.concurrent_requests (up-down counter)
llama_stack.inference.inference_duration_seconds (histogram)
llama_stack.inference.time_to_first_token_seconds (histogram)

Key components:

Create metrics module with 5 OTEL instruments
Integrate metrics into chat/completions inference routers
Add OTLP HTTP exporter with auto-configuration via env vars
Implement integration tests with OTLP test collector
Fix test infrastructure to support metrics export in server mode

All metrics include attributes for model, provider, endpoint_type,
stream, and status for flexible filtering and aggregation.

Test Plan

new telemetry integration tests should pass.

Add comprehensive OTEL metrics tracking for inference requests with automatic export to OTLP collectors. Metrics implemented: - llama_stack.inference.requests_total (counter) - llama_stack.inference.request_duration_seconds (histogram) - llama_stack.inference.concurrent_requests (up-down counter) - llama_stack.inference.inference_duration_seconds (histogram) - llama_stack.inference.time_to_first_token_seconds (histogram) Key components: - Create metrics module with 5 OTEL instruments - Integrate metrics into chat/completions inference routers - Add OTLP HTTP exporter with auto-configuration via env vars - Implement integration tests with OTLP test collector - Fix test infrastructure to support metrics export in server mode All metrics include attributes for model, provider, endpoint_type, stream, and status for flexible filtering and aggregation. Signed-off-by: Charlie Doern <cdoern@redhat.com>

cdoern · 2026-01-05T17:41:37Z

@iamemilio PTAL, I realize some of these might be auto captured while others are not, but opening this up for wider review

iamemilio · 2026-01-05T17:56:14Z

src/llama_stack/telemetry/constants.py

+
+# Request-level metrics
+REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total"
+REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds"


This one seems redundant, it should be handled by either: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiserverrequestduration or https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiserverrequestduration

iamemilio · 2026-01-05T17:58:15Z

src/llama_stack/telemetry/constants.py

+# Request-level metrics
+REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total"
+REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds"
+CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests"


There is a good chance the FastAPI instrumentation captures this as: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/#metric-httpserveractive_requests

iamemilio · 2026-01-05T18:00:52Z

src/llama_stack/telemetry/constants.py

+
+# Token-level metrics
+INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds"
+TIME_TO_FIRST_TOKEN = f"{INFERENCE_PREFIX}.time_to_first_token_seconds"


This one should be captured by inference providers: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiservertime_to_first_token. I don't think its possible to capture this in llama stack since it doesn't generate tokens.

iamemilio · 2026-01-05T18:10:41Z

src/llama_stack/telemetry/constants.py

+CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests"
+
+# Token-level metrics
+INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds"


I think you should check in the grafana and promethes metrics in llama stack. Its both client and server, so there is a chance its capturing this as https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiclientoperationduration

iamemilio

I think that before moving forward with these, we need to be aware of what data is already being captured, if llama stack is the right place to capture some of this data, and if we can stretch what data we already have to fit this. For example, histograms of requests per second can be summarized in Grafana ( and other dashboarding tools ) to give you a sum of requests which is total requests.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 5, 2026

iamemilio reviewed Jan 5, 2026

View reviewed changes

iamemilio suggested changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement OpenTelemetry metrics for inference operations #4441

feat: implement OpenTelemetry metrics for inference operations #4441

Uh oh!

cdoern commented Jan 5, 2026

Uh oh!

cdoern commented Jan 5, 2026

Uh oh!

iamemilio Jan 5, 2026

Uh oh!

iamemilio Jan 5, 2026 •

edited

Loading

Uh oh!

iamemilio Jan 5, 2026 •

edited

Loading

Uh oh!

iamemilio Jan 5, 2026

Uh oh!

iamemilio left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: implement OpenTelemetry metrics for inference operations #4441

Are you sure you want to change the base?

feat: implement OpenTelemetry metrics for inference operations #4441

Uh oh!

Conversation

cdoern commented Jan 5, 2026

What does this PR do?

Test Plan

Uh oh!

cdoern commented Jan 5, 2026

Uh oh!

iamemilio Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

iamemilio Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamemilio Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iamemilio Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

iamemilio left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iamemilio Jan 5, 2026 •

edited

Loading

iamemilio Jan 5, 2026 •

edited

Loading