Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ trace_eval = client.trace_evaluations.create(
judge_id=judge.id,
)

# Get results
results = client.trace_evaluations.get_results(trace_eval.id)
# Wait for completion and get results
result = client.trace_evaluations.wait_for_completion(trace_eval.id)
```

### Custom models
Expand Down
30 changes: 24 additions & 6 deletions docs/api-reference/client.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,13 +33,14 @@ client = AsyncStratix(api_key="your_api_key")

## Constructor Parameters

### `Stratix(api_key, base_url, timeout)` and `AsyncStratix(api_key, base_url, timeout)`
### `Stratix(api_key, base_url, timeout, max_retries)` and `AsyncStratix(api_key, base_url, timeout, max_retries)`

| Parameter | Type | Required | Default | Description |
| ---------- | -------------------------------- | -------- | ------------- | ----------------------------- |
| `api_key` | `str \| None` | Yes\* | `None` | Your LayerLens Stratix API key |
| `base_url` | `str \| httpx.URL \| None` | No | Stratix API URL | Custom API base URL |
| `timeout` | `float \| httpx.Timeout \| None` | No | 10 minutes | Request timeout configuration |
| Parameter | Type | Required | Default | Description |
| ------------- | -------------------------------- | -------- | ------------- | ----------------------------- |
| `api_key` | `str \| None` | Yes\* | `None` | Your LayerLens Stratix API key |
| `base_url` | `str \| httpx.URL \| None` | No | Stratix API URL | Custom API base URL |
| `timeout` | `float \| httpx.Timeout \| None` | No | 10 minutes | Request timeout configuration |
| `max_retries` | `int` | No | `2` | Maximum number of retries on retryable errors (429, 500, 502, 503, 504) |

\*Required unless set via environment variables

Expand Down Expand Up @@ -81,6 +82,23 @@ from layerlens import Stratix
client = Stratix(timeout=30.0)
```

### Retry Configuration

The client automatically retries requests that fail with retryable status codes (429 Too Many Requests, 500, 502, 503, 504) using exponential backoff. If the server sends a `Retry-After` header, the client respects it.

```python
from layerlens import Stratix

# Default: 2 retries
client = Stratix()

# More retries for batch-heavy workloads
client = Stratix(max_retries=5)

# Disable retries entirely
client = Stratix(max_retries=0)
```

### Per-Request Timeout Override

```python
Expand Down
2 changes: 1 addition & 1 deletion docs/api-reference/judges.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Creates a new judge with the specified evaluation criteria.
| ----------------- | -------------------------------- | -------- | -------------------------------------------- |
| `name` | `str` | Yes | Display name for the judge |
| `evaluation_goal` | `str` | Yes | Description of what the judge should evaluate |
| `model_id` | `str \| None` | Yes* | ID of the LLM model to use (required by API)|
| `model_id` | `str \| None` | No | ID of the LLM model to use. If omitted, the server uses a default model |
| `timeout` | `float \| httpx.Timeout \| None` | No | Override request timeout |

#### Returns
Expand Down
73 changes: 57 additions & 16 deletions docs/api-reference/trace-evaluations.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ evaluation = client.trace_evaluations.create(
judge_id="judge-123",
)

# Get results
results = client.trace_evaluations.get_results(evaluation.id)
for result in results.results:
# Wait for completion and get results
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
```
Expand All @@ -47,8 +47,8 @@ async def main():
judge_id="judge-123",
)

results = await client.trace_evaluations.get_results(evaluation.id)
for result in results.results:
result = await client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")

if __name__ == "__main__":
Expand Down Expand Up @@ -149,6 +149,8 @@ response = client.trace_evaluations.get_many(

Retrieves the detailed results of a completed trace evaluation, including scores, reasoning, and step-by-step analysis.

Returns `None` if results are not yet available (evaluation still pending or in progress).

#### Parameters

| Parameter | Type | Required | Description |
Expand All @@ -158,23 +160,62 @@ Retrieves the detailed results of a completed trace evaluation, including scores

#### Returns

Returns a `TraceEvaluationResultsResponse` object containing:
Returns a `TraceEvaluationResultsResponse` object with the evaluation result fields (score, passed, reasoning, etc.).

- `results`: List of `TraceEvaluationResult` objects
Returns `None` if the evaluation has not completed yet or if the request fails.

Returns `None` if the request fails.
#### Example

```python
result = client.trace_evaluations.get_results("eval-123")
if result:
print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
for step in result.steps:
print(f" Tool: {step.tool}, Result: {step.result}")
```

### `wait_for_completion(id, interval_seconds=3, timeout_seconds=300)`

Polls the evaluation status until it reaches a terminal state (success or failure), then returns the results. This is the recommended way to wait for trace evaluation results.

#### Parameters

| Parameter | Type | Required | Default | Description |
| ------------------ | -------------- | -------- | ------- | ------------------------------------------------ |
| `id` | `str` | Yes | | The unique trace evaluation ID |
| `interval_seconds` | `int` | No | `3` | Seconds between status polls |
| `timeout_seconds` | `int \| None` | No | `300` | Maximum wait time. `None` waits indefinitely |

#### Returns

Returns a `TraceEvaluationResultsResponse` object if the evaluation completes successfully.

Returns `None` if the evaluation failed or no results are available.

Raises `TimeoutError` if `timeout_seconds` is exceeded.

#### Example

```python
results_response = client.trace_evaluations.get_results("eval-123")
if results_response:
for result in results_response.results:
print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
for step in result.steps:
print(f" Step {step.step}: {step.reasoning}")
evaluation = client.trace_evaluations.create(
trace_id="trace-abc",
judge_id="judge-xyz",
)

# Wait up to 5 minutes for results
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")

# Custom timeout and polling interval
result = client.trace_evaluations.wait_for_completion(
evaluation.id,
interval_seconds=5,
timeout_seconds=600,
)
```

### `estimate_cost(trace_ids, judge_id, timeout=None)`
Expand Down
63 changes: 22 additions & 41 deletions docs/examples/judges-and-traces.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,14 +103,10 @@ from layerlens import Stratix

client = Stratix()

# Fetch a model and create a judge
models = client.models.get(type="public", name="gpt-4o")
model = models[0]

# Create a judge (no model_id → server uses default model)
judge = client.judges.create(
name=f"Trace Eval Demo Judge {int(time.time())}",
evaluation_goal="Evaluate whether the response is accurate, complete, and well-structured",
model_id=model.id,
)
print(f"Created judge {judge.id}: {judge.name}")

Expand All @@ -133,28 +129,16 @@ evaluation = client.trace_evaluations.create(
)
print(f"Created evaluation {evaluation.id}, status: {evaluation.status}")

# --- Wait for evaluation to complete
for _ in range(30):
evaluation = client.trace_evaluations.get(evaluation.id)
print(f"Evaluation status: {evaluation.status}")
if evaluation.status.value in ("success", "failure"):
break
time.sleep(2)

# --- Get evaluation results
try:
results_response = client.trace_evaluations.get_results(evaluation.id)
if results_response and results_response.results:
for result in results_response.results:
print(f" Score: {result.score}, Passed: {result.passed}")
print(f" Reasoning: {result.reasoning}")
if result.steps:
for step in result.steps:
print(f" Step {step.step}: {step.reasoning}")
else:
print(" No results returned")
except Exception:
print(" No results yet (evaluation may still be in progress)")
# --- Wait for completion and get results
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f" Score: {result.score}, Passed: {result.passed}")
print(f" Reasoning: {result.reasoning}")
if result.steps:
for step in result.steps:
print(f" Tool: {step.tool}, Result: {step.result[:80]}")
else:
print(" No results returned (evaluation may have failed)")

# --- List all trace evaluations
response = client.trace_evaluations.get_many()
Expand Down Expand Up @@ -287,20 +271,17 @@ async def main():
if evaluation:
print(f" Evaluation {evaluation.id}: {evaluation.status}")

# --- Wait and fetch results
await asyncio.sleep(10)
for evaluation in evaluations:
if not evaluation:
continue
try:
results_response = await client.trace_evaluations.get_results(evaluation.id)
if results_response and results_response.results:
for result in results_response.results:
print(f" Score: {result.score}, Passed: {result.passed}")
else:
print(f" Evaluation {evaluation.id}: no results yet")
except Exception:
print(f" Evaluation {evaluation.id}: results not available yet")
# --- Wait for results concurrently
result_tasks = [
client.trace_evaluations.wait_for_completion(e.id)
for e in evaluations if e
]
results = await asyncio.gather(*result_tasks)
for result in results:
if result:
print(f" Score: {result.score}, Passed: {result.passed}")
else:
print(f" No results (evaluation may have failed)")

await client.judges.delete(judge.id)

Expand Down
12 changes: 2 additions & 10 deletions docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ print(f"Accuracy: {result.accuracy}")
## Create a Judge and Evaluate Traces

```python
import time
from layerlens import Stratix

client = Stratix()
Expand All @@ -55,15 +54,8 @@ trace_eval = client.trace_evaluations.create(
judge_id=judge.id,
)

# Poll until complete
while True:
evaluation = client.trace_evaluations.get(trace_eval.id)
if evaluation.status.value in ("success", "failure"):
break
time.sleep(2)

# Get results
result = client.trace_evaluations.get_results(trace_eval.id)
# Wait for completion and get results
result = client.trace_evaluations.wait_for_completion(trace_eval.id)
if result:
print(f"Score: {result.score}, Passed: {result.passed}")
print(f"Reasoning: {result.reasoning}")
Expand Down
43 changes: 11 additions & 32 deletions examples/trace_evaluations.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,17 @@
# Construct sync client (API key from env or inline)
client = Stratix()

# --- Fetch a model to use for judge creation
models = client.models.get(type="public", name="gpt-4o")
if not models:
print("No models found, exiting")
exit(1)
model = models[0]
print(f"Using model: {model.name} ({model.id})")

# --- Create a judge to use for evaluations
# --- Create a judge (no model_id → server uses default model)
judge = client.judges.create(
name=f"Trace Eval Demo Judge {int(time.time())}",
evaluation_goal="Evaluate whether the response is accurate, complete, and well-structured",
model_id=model.id,
)
print(f"Created judge {judge.id}: {judge.name}")

# --- Get existing traces to evaluate
traces_response = client.traces.get_many(page_size=3)
if not traces_response or len(traces_response.traces) == 0:
print("No traces found. Upload some traces first using traces.py")
# Clean up the judge
client.judges.delete(judge.id)
exit(1)

Expand All @@ -48,27 +38,16 @@
)
print(f"Created evaluation {evaluation.id}, status: {evaluation.status}")

# --- Wait for evaluation to complete (poll every 2 seconds, up to 60s)
for _ in range(30):
evaluation = client.trace_evaluations.get(evaluation.id)
print(f"Evaluation status: {evaluation.status}")
if evaluation.status.value in ("success", "failure"):
break
time.sleep(2)

# --- Get evaluation results (may 404 if still in progress)
try:
result = client.trace_evaluations.get_results(evaluation.id)
if result:
print(f" Score: {result.score}, Passed: {result.passed}")
print(f" Reasoning: {result.reasoning}")
if result.steps:
for step in result.steps:
print(f" Tool: {step.tool}, Result: {step.result[:80]}")
else:
print(" No results returned")
except Exception:
print(" No results yet (evaluation may still be in progress)")
# --- Wait for completion and get results in one call
result = client.trace_evaluations.wait_for_completion(evaluation.id)
if result:
print(f" Score: {result.score}, Passed: {result.passed}")
print(f" Reasoning: {result.reasoning}")
if result.steps:
for step in result.steps:
print(f" Tool: {step.tool}, Result: {step.result[:80]}")
else:
print(" No results returned (evaluation may have failed)")

# --- List all trace evaluations
response = client.trace_evaluations.get_many()
Expand Down
Loading
Loading