The trace_evaluations resource on the Stratix client allows you to run judges against traces and retrieve scored results. This is how you assess the quality of your trace data using the evaluation criteria defined in your judges.
A trace evaluation runs a specific judge against a specific trace, producing a scored result with reasoning and step-by-step analysis. You can estimate costs before running evaluations and retrieve detailed results afterwards.
from layerlens import Stratix
client = Stratix()
# Estimate cost before running
estimate = client.trace_evaluations.estimate_cost(
trace_ids=["trace-1", "trace-2"],
judge_id="judge-123",
)
print(f"Estimated cost: ${estimate.estimated_cost:.4f}")
# Run a judge on a trace
evaluation = client.trace_evaluations.create(
trace_id="trace-1",
judge_id="judge-123",
)
# Wait for completion and get results
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")import asyncio
from layerlens import AsyncStratix
async def main():
client = AsyncStratix()
evaluation = await client.trace_evaluations.create(
trace_id="trace-1",
judge_id="judge-123",
)
result = await client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
if __name__ == "__main__":
asyncio.run(main())Both the Stratix (synchronous) and AsyncStratix (asynchronous) clients support the following methods.
Runs a judge against a trace, creating a new trace evaluation.
| Parameter | Type | Required | Description |
|---|---|---|---|
trace_id |
str |
Yes | ID of the trace to evaluate |
judge_id |
str |
Yes | ID of the judge to run |
timeout |
float | httpx.Timeout | None |
No | Override request timeout |
Returns a TraceEvaluation object if successful, None otherwise.
evaluation = client.trace_evaluations.create(
trace_id="trace-abc",
judge_id="judge-xyz",
)
print(f"Evaluation {evaluation.id}: {evaluation.status}")Retrieves a trace evaluation by its unique identifier.
| Parameter | Type | Required | Description |
|---|---|---|---|
id |
str |
Yes | The unique trace evaluation ID |
timeout |
float | httpx.Timeout | None |
No | Override request timeout |
Returns a TraceEvaluation object if found, None otherwise.
get_many(page=None, page_size=None, judge_id=None, trace_id=None, outcome=None, time_range=None, search=None, sort_by=None, sort_order=None, timeout=None)
Retrieves multiple trace evaluations with filtering and pagination.
| Parameter | Type | Required | Description |
|---|---|---|---|
page |
int | None |
No | Page number (1-based, defaults to 1) |
page_size |
int | None |
No | Number of evaluations per page (default: 20, max: 100) |
judge_id |
str | None |
No | Filter by judge |
trace_id |
str | None |
No | Filter by trace |
outcome |
str | None |
No | Filter by outcome (e.g., "pass", "fail") |
time_range |
str | None |
No | Filter by time range (e.g., "7d", "30d") |
search |
str | None |
No | Search term to filter evaluations |
sort_by |
str | None |
No | Field to sort by (e.g., "created_at") |
sort_order |
str | None |
No | Sort direction: "asc" or "desc" |
timeout |
float | httpx.Timeout | None |
No | Override request timeout |
Returns a TraceEvaluationsResponse object containing:
trace_evaluations: List ofTraceEvaluationobjectscount: Number of evaluations in this pagetotal: Total number of matching evaluations
Returns None if the request fails.
# Get all evaluations
response = client.trace_evaluations.get_many()
print(f"Total: {response.total}")
# Filtered by judge and outcome
response = client.trace_evaluations.get_many(
judge_id="judge-123",
outcome="pass",
sort_by="created_at",
sort_order="desc",
)Retrieves the detailed results of a completed trace evaluation, including scores, reasoning, and step-by-step analysis.
Returns None if results are not yet available (evaluation still pending or in progress).
| Parameter | Type | Required | Description |
|---|---|---|---|
id |
str |
Yes | The unique trace evaluation ID |
timeout |
float | httpx.Timeout | None |
No | Override request timeout |
Returns a TraceEvaluationResultsResponse object with the evaluation result fields (score, passed, reasoning, etc.).
Returns None if the evaluation has not completed yet or if the request fails.
result = client.trace_evaluations.get_results("eval-123")
if result:
print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
for step in result.steps:
print(f" Tool: {step.tool}, Result: {step.result}")Polls the evaluation status until it reaches a terminal state (success or failure), then returns the results. This is the recommended way to wait for trace evaluation results.
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
id |
str |
Yes | The unique trace evaluation ID | |
interval_seconds |
int |
No | 3 |
Seconds between status polls |
timeout_seconds |
int | None |
No | 300 |
Maximum wait time. None waits indefinitely |
Returns a TraceEvaluationResultsResponse object if the evaluation completes successfully.
Returns None if the evaluation failed or no results are available.
Raises TimeoutError if timeout_seconds is exceeded.
evaluation = client.trace_evaluations.create(
trace_id="trace-abc",
judge_id="judge-xyz",
)
# Wait up to 5 minutes for results
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
# Custom timeout and polling interval
result = client.trace_evaluations.wait_for_completion(
evaluation.id,
interval_seconds=5,
timeout_seconds=600,
)Estimates the cost of running a judge against a set of traces before actually executing the evaluations.
| Parameter | Type | Required | Description |
|---|---|---|---|
trace_ids |
List[str] |
Yes | List of trace IDs to evaluate |
judge_id |
str |
Yes | ID of the judge to run |
timeout |
float | httpx.Timeout | None |
No | Override request timeout |
Returns a CostEstimateResponse object if successful, None otherwise.
estimate = client.trace_evaluations.estimate_cost(
trace_ids=["trace-1", "trace-2", "trace-3"],
judge_id="judge-123",
)
if estimate:
print(f"Estimated cost: ${estimate.estimated_cost:.4f}")
print(f"Traces: {estimate.trace_count}")
print(f"Model: {estimate.model}")| Property | Type | Description |
|---|---|---|
id |
str |
Unique evaluation identifier |
trace_id |
str |
ID of the evaluated trace |
judge_id |
str |
ID of the judge used |
status |
TraceEvaluationStatus |
Current status of the evaluation |
judge_snapshot |
JudgeSnapshot | None |
Snapshot of judge config at run time |
created_at |
str | None |
ISO 8601 creation timestamp |
started_at |
str | None |
When evaluation started |
finished_at |
str | None |
When evaluation finished |
| Status | Description |
|---|---|
"pending" |
Evaluation queued but not started |
"in_progress" |
Evaluation currently running |
"success" |
Evaluation completed successfully |
"failure" |
Evaluation failed |
| Property | Type | Description |
|---|---|---|
id |
str |
Unique result identifier |
trace_evaluation_id |
str |
Parent evaluation ID |
trace_id |
str |
ID of the evaluated trace |
judge_id |
str |
ID of the judge used |
score |
float | None |
Numerical score |
passed |
bool | None |
Whether the trace passed the evaluation |
reasoning |
str | None |
Overall reasoning for the score |
steps |
List[TraceEvaluationStep] |
Step-by-step reasoning |
model |
str | None |
Model used for evaluation |
turns |
int | None |
Number of turns in evaluation |
latency_ms |
int | None |
Evaluation latency in milliseconds |
prompt_tokens |
int | None |
Number of prompt tokens used |
completion_tokens |
int | None |
Number of completion tokens used |
total_cost |
float | None |
Total cost of the evaluation |
created_at |
str | None |
ISO 8601 creation timestamp |
| Property | Type | Description |
|---|---|---|
estimated_cost |
float |
Estimated cost in dollars |
input_tokens |
int |
Estimated input tokens |
output_tokens |
int |
Estimated output tokens |
model |
str |
Model that would be used |
trace_count |
int |
Number of traces to evaluate |