-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: implement OpenTelemetry metrics for inference operations #4441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add comprehensive OTEL metrics tracking for inference requests with automatic export to OTLP collectors. Metrics implemented: - llama_stack.inference.requests_total (counter) - llama_stack.inference.request_duration_seconds (histogram) - llama_stack.inference.concurrent_requests (up-down counter) - llama_stack.inference.inference_duration_seconds (histogram) - llama_stack.inference.time_to_first_token_seconds (histogram) Key components: - Create metrics module with 5 OTEL instruments - Integrate metrics into chat/completions inference routers - Add OTLP HTTP exporter with auto-configuration via env vars - Implement integration tests with OTLP test collector - Fix test infrastructure to support metrics export in server mode All metrics include attributes for model, provider, endpoint_type, stream, and status for flexible filtering and aggregation. Signed-off-by: Charlie Doern <cdoern@redhat.com>
|
@iamemilio PTAL, I realize some of these might be auto captured while others are not, but opening this up for wider review |
|
|
||
| # Request-level metrics | ||
| REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total" | ||
| REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems redundant, it should be handled by either: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiserverrequestduration or https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiserverrequestduration
| # Request-level metrics | ||
| REQUESTS_TOTAL = f"{INFERENCE_PREFIX}.requests_total" | ||
| REQUEST_DURATION = f"{INFERENCE_PREFIX}.request_duration_seconds" | ||
| CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a good chance the FastAPI instrumentation captures this as: https://opentelemetry.io/docs/specs/semconv/http/http-metrics/#metric-httpserveractive_requests
|
|
||
| # Token-level metrics | ||
| INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds" | ||
| TIME_TO_FIRST_TOKEN = f"{INFERENCE_PREFIX}.time_to_first_token_seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one should be captured by inference providers: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiservertime_to_first_token. I don't think its possible to capture this in llama stack since it doesn't generate tokens.
| CONCURRENT_REQUESTS = f"{INFERENCE_PREFIX}.concurrent_requests" | ||
|
|
||
| # Token-level metrics | ||
| INFERENCE_DURATION = f"{INFERENCE_PREFIX}.inference_duration_seconds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should check in the grafana and promethes metrics in llama stack. Its both client and server, so there is a chance its capturing this as https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/#metric-gen_aiclientoperationduration
iamemilio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that before moving forward with these, we need to be aware of what data is already being captured, if llama stack is the right place to capture some of this data, and if we can stretch what data we already have to fit this. For example, histograms of requests per second can be summarized in Grafana ( and other dashboarding tools ) to give you a sum of requests which is total requests.
What does this PR do?
Add comprehensive OTEL metrics tracking for inference requests with automatic export to OTLP collectors.
Metrics implemented:
Key components:
All metrics include attributes for model, provider, endpoint_type,
stream, and status for flexible filtering and aggregation.
Test Plan
new telemetry integration tests should pass.