diff --git a/.github/workflows/sync-docs.yaml b/.github/workflows/sync-docs.yaml new file mode 100644 index 0000000..95e7848 --- /dev/null +++ b/.github/workflows/sync-docs.yaml @@ -0,0 +1,19 @@ +name: Sync Docs to GitBook + +on: + push: + branches: [main] + paths: ["docs/**"] + +jobs: + sync: + runs-on: ubuntu-latest + steps: + steps: + - uses: actions/checkout@v3 + - name: GitBook Sync + uses: gitbook/gitbook-sync@v1 + with: + gitbook-token: ${{ secrets.GITBOOK_TOKEN }} + gitbook-space: ${{ secrets.GITBOOK_SPACE_ID }} + source-dir: docs/ diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..51f04e1 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,59 @@ +# Atlas Python SDK Documentation + +Welcome to the official documentation for the Atlas Python SDK. This library provides convenient access to the LayerLens Atlas REST API from any Python 3.8+ application. + +## What is Atlas? + +Atlas is LayerLens's evaluation platform that allows you to benchmark AI models against various datasets and metrics. The Python SDK provides a synchronous HTTP client powered by [httpx](https://github.com/encode/httpx) and [Pydantic](https://pydantic.dev/) models for type-safe API interactions. + +## Key Features + +- **Simple Authentication**: Secure API key-based authentication +- **Type Safety**: Full Pydantic model support for all API responses +- **Comprehensive Error Handling**: Detailed exception hierarchy for different error scenarios +- **Configurable Timeouts**: Fine-grained timeout control for different operations +- **Environment Variable Support**: Easy configuration through environment variables +- **Python 3.8+ Compatibility**: Works with modern Python versions + +## Quick Start + +```python +import os +from atlas import Atlas + +# Initialize the client +client = Atlas( + api_key=os.environ.get("LAYERLENS_ATLAS_API_KEY"), + organization_id=os.environ.get("LAYERLENS_ATLAS_ORG_ID"), + project_id=os.environ.get("LAYERLENS_ATLAS_PROJECT_ID"), +) + +# Create an evaluation +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" +) + +# Get results +if evaluation: + results = client.results.get(evaluation_id=evaluation.id) + print(f"Evaluation completed with {len(results)} results") +``` + +## Navigation + +- **[Getting Started](getting-started/)** - Installation, setup, and your first API call +- **[API Reference](api-reference/)** - Complete documentation of all available methods +- **[Code Examples](examples/)** - Practical examples for common use cases +- **[Troubleshooting](troubleshooting/)** - Solutions to common issues +- **[Security](security/)** - Best practices for secure API usage + +## Support + +- **LayerLens Support**: Contact support through your LayerLens dashboard +- **Documentation**: Visit [docs.layerlens.com](https://docs.layerlens.com) for additional resources +- **API Status**: Check the [LayerLens status page](https://status.layerlens.com) for service updates + +## License + +This SDK is released under the MIT License. \ No newline at end of file diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md new file mode 100644 index 0000000..40e99b3 --- /dev/null +++ b/docs/SUMMARY.md @@ -0,0 +1,33 @@ +# Table of Contents + +* [Introduction](README.md) + +## Getting Started +* [Installation](getting-started/installation.md) +* [Authentication & Configuration](getting-started/authentication.md) +* [Quick Start Guide](getting-started/quickstart.md) + +## API Reference +* [Client Configuration](api-reference/client.md) +* [Evaluations](api-reference/evaluations.md) +* [Results](api-reference/results.md) +* [Models & Benchmarks](api-reference/models-benchmarks.md) +* [Error Handling](api-reference/errors.md) + +## Code Examples +* [Creating Evaluations](examples/creating-evaluations.md) +* [Retrieving Results](examples/retrieving-results.md) +* [Working with Timeouts](examples/timeouts.md) +* [Advanced Usage Patterns](examples/advanced-usage.md) + +## Troubleshooting +* [Common Issues](troubleshooting/common-issues.md) +* [Authentication Problems](troubleshooting/authentication.md) +* [Error Codes Reference](troubleshooting/error-codes.md) + +## Security Best Practices +* [API Key Management](security/api-key-management.md) +* [Environment Variables](security/environment-variables.md) +* [Rate Limiting](security/rate-limiting.md) +* [Data Privacy](security/data-privacy.md) + diff --git a/docs/api-reference/client.md b/docs/api-reference/client.md new file mode 100644 index 0000000..a2720bb --- /dev/null +++ b/docs/api-reference/client.md @@ -0,0 +1,277 @@ +# Client Configuration + +The `Atlas` class is the main entry point for interacting with the LayerLens Atlas API. This page covers client initialization, configuration options, and advanced usage patterns. + +## Basic Usage + +```python +from atlas import Atlas + +# Using environment variables (recommended) +client = Atlas() + +# Explicit configuration +client = Atlas( + api_key="your_api_key", + organization_id="your_org_id", + project_id="your_project_id" +) +``` + +## Constructor Parameters + +### `Atlas(api_key, organization_id, project_id, base_url, timeout)` + +| Parameter | Type | Required | Default | Description | +|-----------|------|----------|---------|-------------| +| `api_key` | `str \| None` | Yes* | `None` | Your LayerLens Atlas API key | +| `organization_id` | `str \| None` | Yes* | `None` | Your organization identifier | +| `project_id` | `str \| None` | Yes* | `None` | The project you want to work with | +| `base_url` | `str \| httpx.URL \| None` | No | Atlas API URL | Custom API base URL | +| `timeout` | `float \| httpx.Timeout \| None` | No | 10 minutes | Request timeout configuration | + +*Required unless set via environment variables + +## Environment Variable Configuration + +The client automatically loads configuration from these environment variables: + +```bash +LAYERLENS_ATLAS_API_KEY="your_api_key_here" +LAYERLENS_ATLAS_ORG_ID="your_org_id_here" +LAYERLENS_ATLAS_PROJECT_ID="your_project_id_here" +LAYERLENS_ATLAS_BASE_URL="https://custom-endpoint.com/api/v1" # Optional +``` + +## Timeout Configuration + +### Simple Timeout + +```python +from atlas import Atlas + +# 30-second timeout for all requests +client = Atlas(timeout=30.0) +``` + +### Advanced Timeout Configuration + +```python +import httpx +from atlas import Atlas + +client = Atlas( + timeout=httpx.Timeout( + connect=5.0, # Connection timeout: 5 seconds + read=60.0, # Read timeout: 60 seconds + write=30.0, # Write timeout: 30 seconds + pool=10.0 # Connection pool timeout: 10 seconds + ) +) +``` + +### Per-Request Timeout Override + +```python +client = Atlas() + +# Override timeout for a specific request +evaluation = client.with_options(timeout=120.0).evaluations.create( + model="gpt-4", + benchmark="mmlu" +) +``` + +## Client Methods + +### `copy(**kwargs)` + +Create a new client instance with modified configuration: + +```python +# Base client +client = Atlas(api_key="key1", organization_id="org1") + +# Create a copy with different project +project_client = client.copy(project_id="different_project") + +# Create a copy with different timeout +slow_client = client.copy(timeout=300.0) # 5 minutes +``` + +### `with_options(**kwargs)` + +Temporarily override client options for a single request chain: + +```python +client = Atlas() + +# Use different timeout for this request only +evaluation = client.with_options(timeout=60.0).evaluations.create( + model="gpt-4", + benchmark="mmlu" +) + +# Back to original timeout for subsequent requests +results = client.results.get(evaluation_id=evaluation.id) +``` + +## Resource Access + +The client provides access to different API resources through properties: + +```python +client = Atlas() + +# Access evaluations resource +client.evaluations.create(model="gpt-4", benchmark="mmlu") + +# Access results resource +client.results.get(evaluation_id="eval_123") +``` + +Available resources: +- `client.evaluations` - Create and manage evaluations +- `client.results` - Retrieve evaluation results +- More resources coming soon... + +## Error Handling + +The client raises specific exceptions for different error conditions: + +```python +import atlas +from atlas import Atlas + +client = Atlas() + +try: + evaluation = client.evaluations.create(model="invalid", benchmark="invalid") +except atlas.AuthenticationError: + # 401 - Invalid API key + print("Authentication failed") +except atlas.PermissionDeniedError: + # 403 - Valid API key, insufficient permissions + print("Permission denied") +except atlas.NotFoundError: + # 404 - Resource not found + print("Model or benchmark not found") +except atlas.RateLimitError: + # 429 - Too many requests + print("Rate limit exceeded") +except atlas.InternalServerError: + # 500+ - Server error + print("Server error occurred") +except atlas.APIConnectionError: + # Network/connection issues + print("Connection failed") +except atlas.APITimeoutError: + # Request timeout + print("Request timed out") +``` + +## Authentication Headers + +The client automatically handles authentication by adding the required headers: + +```python +# The client adds this header to all requests: +# x-api-key: your_api_key_value +``` + +You don't need to manually handle authentication headers. + +## Base URL Configuration + +### Default Base URL +The client uses the default LayerLens Atlas API endpoint unless overridden. + +### Custom Base URL +For enterprise or self-hosted deployments: + +```python +from atlas import Atlas + +client = Atlas( + base_url="https://your-atlas-instance.com/api/v1" +) + +# Or via environment variable +# LAYERLENS_ATLAS_BASE_URL="https://your-atlas-instance.com/api/v1" +client = Atlas() # Will use custom base URL from environment +``` + +## Best Practices + +### 1. Use Environment Variables +```python +# ✅ Good - secure and flexible +client = Atlas() + +# ❌ Bad - hardcoded credentials +client = Atlas(api_key="hardcoded_key") +``` + +### 2. Configure Appropriate Timeouts +```python +# ✅ Good - reasonable timeout for evaluation creation +client = Atlas(timeout=120.0) # 2 minutes + +# ❌ Bad - too short for long-running operations +client = Atlas(timeout=5.0) # 5 seconds might be too short +``` + +### 3. Handle Errors Gracefully +```python +# ✅ Good - specific error handling +try: + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.RateLimitError: + time.sleep(60) # Wait before retrying + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIError as e: + logger.error(f"API error: {e}") + raise +``` + +### 4. Reuse Client Instances +```python +# ✅ Good - reuse the same client +client = Atlas() +eval1 = client.evaluations.create(model="gpt-4", benchmark="mmlu") +eval2 = client.evaluations.create(model="claude-3", benchmark="hellaswag") + +# ❌ Bad - creating new clients unnecessarily +client1 = Atlas() +eval1 = client1.evaluations.create(model="gpt-4", benchmark="mmlu") +client2 = Atlas() # Unnecessary +eval2 = client2.evaluations.create(model="claude-3", benchmark="hellaswag") +``` + +## Thread Safety + +The Atlas client is thread-safe and can be shared across multiple threads: + +```python +import threading +from atlas import Atlas + +client = Atlas() + +def create_evaluation(model_name): + evaluation = client.evaluations.create( + model=model_name, + benchmark="mmlu" + ) + print(f"Created evaluation for {model_name}: {evaluation.id}") + +# Safe to use the same client across threads +threads = [] +for model in ["gpt-4", "claude-3", "llama-2"]: + thread = threading.Thread(target=create_evaluation, args=(model,)) + threads.append(thread) + thread.start() + +for thread in threads: + thread.join() +``` \ No newline at end of file diff --git a/docs/api-reference/errors.md b/docs/api-reference/errors.md new file mode 100644 index 0000000..ae172d4 --- /dev/null +++ b/docs/api-reference/errors.md @@ -0,0 +1,614 @@ +# Error Handling + +The Atlas Python SDK provides a comprehensive exception hierarchy to help you handle different error conditions gracefully. This guide covers all available exception types and best practices for error handling. + +## Exception Hierarchy + +All Atlas exceptions inherit from the base `AtlasError` class: + +``` +AtlasError +├── APIError +│ ├── APIConnectionError +│ │ └── APITimeoutError +│ ├── APIResponseValidationError +│ └── APIStatusError +│ ├── BadRequestError (400) +│ ├── AuthenticationError (401) +│ ├── PermissionDeniedError (403) +│ ├── NotFoundError (404) +│ ├── ConflictError (409) +│ ├── UnprocessableEntityError (422) +│ ├── RateLimitError (429) +│ └── InternalServerError (500+) +``` + +## Exception Types + +### Base Exceptions + +#### `AtlasError` +Base exception for all Atlas-related errors. + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.AtlasError as e: + print(f"Atlas error occurred: {e}") +``` + +#### `APIError` +Base exception for all API-related errors. Contains additional context about the request. + +**Properties:** +- `message`: Error message +- `request`: The HTTP request that caused the error +- `body`: Response body (if available) + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIError as e: + print(f"API error: {e.message}") + print(f"Request URL: {e.request.url}") + print(f"Response body: {e.body}") +``` + +### Connection Errors + +#### `APIConnectionError` +Raised when the client cannot connect to the API server. + +**Common causes:** +- Network connectivity issues +- DNS resolution problems +- Server is down +- Firewall blocking requests + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIConnectionError as e: + print("Connection failed - check your network connection") + print(f"Error details: {e}") +``` + +#### `APITimeoutError` +Raised when a request times out. + +```python +import atlas + +try: + client = atlas.Atlas(timeout=0.2) # Very short timeout + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APITimeoutError: + print("Request timed out - try increasing timeout or check network") +``` + +### HTTP Status Errors + +All HTTP status errors inherit from `APIStatusError` and include additional properties: + +**Properties:** +- `status_code`: HTTP status code +- `response`: Full HTTP response object +- `request_id`: Request ID for tracking (if provided by server) + +#### `BadRequestError` (400) +Request was malformed or contained invalid parameters. + +```python +import atlas + +try: + client = atlas.Atlas() + # Invalid parameters + evaluation = client.evaluations.create(model="", benchmark="") +except atlas.BadRequestError as e: + print(f"Bad request: {e}") + print(f"Status code: {e.status_code}") +``` + +#### `AuthenticationError` (401) +API key is missing, invalid, or expired. + +```python +import atlas + +try: + client = atlas.Atlas(api_key="invalid_key") + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.AuthenticationError: + print("Authentication failed - check your API key") + print("Make sure LAYERLENS_ATLAS_API_KEY is set correctly") +``` + +#### `PermissionDeniedError` (403) +Valid API key but insufficient permissions for the requested operation. + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="restricted-model", benchmark="mmlu") +except atlas.PermissionDeniedError: + print("Permission denied - check your organization/project access") + print("Contact your administrator for access to this resource") +``` + +#### `NotFoundError` (404) +Requested resource (model, benchmark, evaluation) does not exist. + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="nonexistent-model", benchmark="mmlu") +except atlas.NotFoundError: + print("Model or benchmark not found") + print("Check available models and benchmarks in the Atlas dashboard") +``` + +#### `ConflictError` (409) +Request conflicts with current resource state. + +```python +import atlas + +try: + client = atlas.Atlas() + # Some operation that conflicts with current state + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.ConflictError: + print("Request conflicts with current state") +``` + +#### `UnprocessableEntityError` (422) +Request parameters are valid but cannot be processed. + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="invalid-benchmark") +except atlas.UnprocessableEntityError as e: + print(f"Cannot process request: {e}") + print("Parameters are valid but operation cannot be completed") +``` + +#### `RateLimitError` (429) +Too many requests sent in a given time period. + +```python +import atlas +import time + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.RateLimitError as e: + print("Rate limit exceeded") + # Extract retry-after header if available + retry_after = e.response.headers.get('retry-after') + if retry_after: + print(f"Retry after {retry_after} seconds") + time.sleep(int(retry_after)) + else: + print("Waiting 60 seconds before retry...") + time.sleep(60) +``` + +#### `InternalServerError` (500+) +Server-side error occurred. + +```python +import atlas + +try: + client = atlas.Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.InternalServerError as e: + print(f"Server error: {e.status_code}") + print("This is a server-side issue - try again later") + print(f"Request ID: {e.request_id}") # For support tickets +``` + +## Best Practices + +### 1. Handle Specific Exceptions + +```python +import atlas +import time +from atlas import Atlas + +def robust_create_evaluation(model: str, benchmark: str, max_retries: int = 3): + client = Atlas() + + for attempt in range(max_retries): + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + return evaluation + + except atlas.AuthenticationError: + print("❌ Authentication failed - check your API key") + break # Don't retry auth errors + + except atlas.PermissionDeniedError: + print("❌ Permission denied - contact your administrator") + break # Don't retry permission errors + + except atlas.NotFoundError: + print(f"❌ Model '{model}' or benchmark '{benchmark}' not found") + break # Don't retry not found errors + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after', 60) + print(f"⏳ Rate limited - waiting {retry_after} seconds...") + time.sleep(int(retry_after)) + continue # Retry after waiting + + except atlas.InternalServerError: + if attempt < max_retries - 1: + wait_time = 2 ** attempt # Exponential backoff + print(f"🔄 Server error - retrying in {wait_time}s (attempt {attempt + 1})") + time.sleep(wait_time) + continue + else: + print("❌ Server error - max retries exceeded") + break + + except atlas.APIConnectionError: + if attempt < max_retries - 1: + wait_time = 2 ** attempt + print(f"🔄 Connection error - retrying in {wait_time}s (attempt {attempt + 1})") + time.sleep(wait_time) + continue + else: + print("❌ Connection failed - check your network") + break + + except atlas.APIError as e: + print(f"❌ Unexpected API error: {e}") + break + + return None +``` + +### 2. Graceful Degradation + +```python +import atlas +from atlas import Atlas + +def get_evaluation_results_with_fallback(evaluation_id: str): + client = Atlas() + + try: + results = client.results.get(evaluation_id=evaluation_id) + + if results: + return {"success": True, "data": results, "message": "Results retrieved successfully"} + else: + return {"success": False, "data": None, "message": "No results found"} + + except atlas.NotFoundError: + return {"success": False, "data": None, "message": "Evaluation not found"} + + except atlas.AuthenticationError: + return {"success": False, "data": None, "message": "Authentication required"} + + except atlas.APIConnectionError: + return {"success": False, "data": None, "message": "Service temporarily unavailable"} + + except atlas.APIError as e: + return {"success": False, "data": None, "message": f"Service error: {e}"} + +# Usage +result = get_evaluation_results_with_fallback("eval_123") +if result["success"]: + process_results(result["data"]) +else: + print(f"Could not get results: {result['message']}") +``` + +### 3. Logging and Monitoring + +```python +import logging +import atlas +from atlas import Atlas + +# Configure logging +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +def monitored_api_call(): + client = Atlas() + + try: + logger.info("Creating evaluation...") + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + + if evaluation: + logger.info(f"Evaluation created successfully: {evaluation.id}") + return evaluation + else: + logger.warning("Evaluation creation returned None") + return None + + except atlas.RateLimitError as e: + logger.warning(f"Rate limited - request ID: {e.request_id}") + raise + + except atlas.AuthenticationError: + logger.error("Authentication failed - check API key configuration") + raise + + except atlas.APIConnectionError: + logger.error("Network connection failed") + raise + + except atlas.InternalServerError as e: + logger.error(f"Server error: {e.status_code} - request ID: {e.request_id}") + raise + + except atlas.APIError as e: + logger.error(f"Unexpected API error: {e} - request ID: {getattr(e, 'request_id', 'N/A')}") + raise +``` + +### 4. Context Managers for Resource Management + +```python +import atlas +from contextlib import contextmanager +from atlas import Atlas + +@contextmanager +def atlas_client(): + """Context manager for Atlas client with error handling""" + client = None + try: + client = Atlas() + yield client + except atlas.AuthenticationError: + print("Authentication failed") + raise + except atlas.APIConnectionError: + print("Connection failed") + raise + finally: + # Cleanup if needed + pass + +# Usage +try: + with atlas_client() as client: + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + results = client.results.get(evaluation_id=evaluation.id) +except atlas.AtlasError: + print("Atlas operation failed") +``` + +## Error Response Details + +### Status Error Properties + +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="invalid", benchmark="invalid") +except atlas.APIStatusError as e: + print(f"Status Code: {e.status_code}") + print(f"Request ID: {e.request_id}") + print(f"Response Headers: {dict(e.response.headers)}") + print(f"Response Body: {e.body}") + print(f"Request URL: {e.request.url}") + print(f"Request Method: {e.request.method}") +``` + +### Extracting Useful Information + +```python +import atlas +from atlas import Atlas + +def extract_error_info(error: atlas.APIError): + info = { + "type": type(error).__name__, + "message": str(error), + "request_url": error.request.url if hasattr(error, 'request') else None, + "request_method": error.request.method if hasattr(error, 'request') else None, + } + + if hasattr(error, 'status_code'): + info["status_code"] = error.status_code + + if hasattr(error, 'request_id'): + info["request_id"] = error.request_id + + if hasattr(error, 'response'): + info["response_headers"] = dict(error.response.headers) + + return info + +# Usage +try: + client = Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIError as e: + error_info = extract_error_info(e) + print(f"Error details: {error_info}") +``` + +## Testing Error Handling + +```python +import pytest +import atlas +from unittest.mock import Mock, patch +from atlas import Atlas + +def test_authentication_error_handling(): + """Test that authentication errors are handled properly""" + with patch('atlas.Atlas') as mock_atlas: + mock_atlas.side_effect = atlas.AuthenticationError( + "Invalid API key", + request=Mock(), + response=Mock() + ) + + with pytest.raises(atlas.AuthenticationError): + client = Atlas() + client.evaluations.create(model="gpt-4", benchmark="mmlu") + +def test_rate_limit_retry(): + """Test that rate limit errors trigger appropriate retry logic""" + # Your retry logic test here + pass +``` + +## Common Error Scenarios + +### Invalid Configuration + +```python +# Missing API key +try: + client = Atlas(api_key=None) +except atlas.AtlasError as e: + print(f"Configuration error: {e}") + +# Invalid organization/project +try: + client = Atlas(organization_id="invalid", project_id="invalid") + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.PermissionDeniedError: + print("Invalid organization or project ID") +``` + +### Network Issues + +```python +# Connection timeout +try: + client = Atlas(timeout=0.1) # Very short timeout + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APITimeoutError: + print("Request timed out") + +# Network connectivity +try: + # Simulate network issues + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIConnectionError: + print("Network connectivity issue") +``` + +## Error Recovery Strategies + +### Exponential Backoff + +```python +import time +import random +import atlas +from atlas import Atlas + +def exponential_backoff_retry(func, max_retries=3, base_delay=1): + """Retry function with exponential backoff""" + for attempt in range(max_retries): + try: + return func() + except (atlas.InternalServerError, atlas.APIConnectionError) as e: + if attempt == max_retries - 1: + raise + + delay = base_delay * (2 ** attempt) + random.uniform(0, 1) + print(f"Attempt {attempt + 1} failed, retrying in {delay:.2f}s...") + time.sleep(delay) + +# Usage +def create_evaluation(): + client = Atlas() + return client.evaluations.create(model="gpt-4", benchmark="mmlu") + +evaluation = exponential_backoff_retry(create_evaluation) +``` + +### Circuit Breaker Pattern + +```python +import time +from enum import Enum +from atlas import Atlas +import atlas + +class CircuitState(Enum): + CLOSED = "closed" + OPEN = "open" + HALF_OPEN = "half_open" + +class CircuitBreaker: + def __init__(self, failure_threshold=5, timeout=60): + self.failure_threshold = failure_threshold + self.timeout = timeout + self.failure_count = 0 + self.last_failure_time = None + self.state = CircuitState.CLOSED + + def call(self, func, *args, **kwargs): + if self.state == CircuitState.OPEN: + if time.time() - self.last_failure_time < self.timeout: + raise atlas.APIConnectionError(message="Circuit breaker is OPEN") + else: + self.state = CircuitState.HALF_OPEN + + try: + result = func(*args, **kwargs) + self.on_success() + return result + except (atlas.InternalServerError, atlas.APIConnectionError) as e: + self.on_failure() + raise + + def on_success(self): + self.failure_count = 0 + self.state = CircuitState.CLOSED + + def on_failure(self): + self.failure_count += 1 + self.last_failure_time = time.time() + if self.failure_count >= self.failure_threshold: + self.state = CircuitState.OPEN + +# Usage +breaker = CircuitBreaker() +client = Atlas() + +try: + evaluation = breaker.call( + client.evaluations.create, + model="gpt-4", + benchmark="mmlu" + ) +except atlas.APIError as e: + print(f"Circuit breaker prevented call or operation failed: {e}") +``` diff --git a/docs/api-reference/evaluations.md b/docs/api-reference/evaluations.md new file mode 100644 index 0000000..328c29e --- /dev/null +++ b/docs/api-reference/evaluations.md @@ -0,0 +1,284 @@ +# Evaluations + +The `evaluations` resource allows you to create and manage AI model evaluations against various benchmarks. This is the core functionality of the Atlas platform. + +## Overview + +An evaluation runs a specified model against a benchmark dataset and returns comprehensive metrics including accuracy, readability, toxicity, and ethics scores. + +## Methods + +### `create(model, benchmark, timeout=None)` + +Creates a new evaluation for the specified model and benchmark. + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `model` | `str` | Yes | The model identifier to evaluate | +| `benchmark` | `str` | Yes | The benchmark dataset identifier | +| `timeout` | `float \| httpx.Timeout \| None` | No | Override request timeout | + +#### Returns + +Returns an `Evaluation` object if successful, `None` if the evaluation could not be created. + +#### Example + +```python +from atlas import Atlas + +client = Atlas() + +# Create a basic evaluation +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" +) + +if evaluation: + print(f"Evaluation created: {evaluation.id}") + print(f"Status: {evaluation.status}") +else: + print("Failed to create evaluation") +``` + +#### With Custom Timeout + +```python +# Create evaluation with custom timeout (5 minutes) +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu", + timeout=300.0 +) +``` + +## Response Object + +The `create` method returns an `Evaluation` object with the following properties: + +### Core Properties + +| Property | Type | Description | +|----------|------|-------------| +| `id` | `str` | Unique evaluation identifier | +| `status` | `str` | Current evaluation status | +| `status_description` | `str` | Detailed status description | +| `submitted_at` | `int` | Unix timestamp when evaluation was submitted | +| `finished_at` | `int` | Unix timestamp when evaluation finished | + +### Model Information + +| Property | Type | Description | +|----------|------|-------------| +| `model_id` | `str` | Model identifier used in the request | +| `model_name` | `str` | Human-readable model name | +| `model_key` | `str` | Internal model key | +| `model_company` | `str` | Company that created the model | + +### Benchmark Information + +| Property | Type | Description | +|----------|------|-------------| +| `dataset_id` | `str` | Benchmark identifier used in the request | +| `dataset_name` | `str` | Human-readable benchmark name | + +### Performance Metrics + +These properties are available once the evaluation is completed: + +| Property | Type | Description | +|----------|------|-------------| +| `accuracy` | `float` | Overall accuracy score (0.0 to 1.0) | +| `readability_score` | `float` | Readability assessment score | +| `toxicity_score` | `float` | Toxicity assessment score | +| `ethics_score` | `float` | Ethics assessment score | +| `average_duration` | `int` | Average response time in milliseconds | + +## Evaluation Status + +The `status` field can have the following values: + +| Status | Description | +|--------|-------------| +| `"pending"` | Evaluation queued but not yet started | +| `"running"` | Evaluation currently in progress | +| `"completed"` | Evaluation finished successfully | +| `"failed"` | Evaluation failed due to an error | +| `"cancelled"` | Evaluation was cancelled by user | + +## Complete Example + +```python +import time +from atlas import Atlas +import atlas + +def create_and_monitor_evaluation(): + client = Atlas() + + try: + # Create evaluation + evaluation = client.evaluations.create( + model="gpt-3.5-turbo", + benchmark="mmlu" + ) + + if not evaluation: + print("❌ Failed to create evaluation") + return None + + print(f"✅ Evaluation created: {evaluation.id}") + print(f"📊 Model: {evaluation.model_name} ({evaluation.model_company})") + print(f"📋 Benchmark: {evaluation.dataset_name}") + print(f"⏰ Submitted at: {evaluation.submitted_at}") + print(f"🔄 Status: {evaluation.status}") + + # Note: In practice, you'd use webhooks or polling to check status + # This is just for demonstration + if evaluation.status == "completed": + print(f"\n📈 Results:") + print(f" Accuracy: {evaluation.accuracy:.2%}") + print(f" Readability: {evaluation.readability_score:.2f}") + print(f" Toxicity: {evaluation.toxicity_score:.2f}") + print(f" Ethics: {evaluation.ethics_score:.2f}") + print(f" Avg Duration: {evaluation.average_duration}ms") + + return evaluation + + except atlas.AuthenticationError: + print("❌ Authentication failed - check your API key") + except atlas.PermissionDeniedError: + print("❌ Permission denied - check your organization/project access") + except atlas.NotFoundError: + print("❌ Model or benchmark not found") + except atlas.RateLimitError: + print("❌ Rate limit exceeded - please wait and try again") + except atlas.APIConnectionError as e: + print(f"❌ Connection error: {e}") + except atlas.APIError as e: + print(f"❌ API error: {e}") + + return None + +if __name__ == "__main__": + evaluation = create_and_monitor_evaluation() +``` + +## Available Models + +Common model identifiers include: + +- `"gpt-4"` - OpenAI GPT-4 +- `"gpt-3.5-turbo"` - OpenAI GPT-3.5 Turbo +- `"claude-3-opus"` - Anthropic Claude 3 Opus +- `"claude-3-sonnet"` - Anthropic Claude 3 Sonnet +- `"llama-2-70b"` - Meta Llama 2 70B +- `"mistral-7b"` - Mistral 7B + +> **Note**: Available models may vary based on your organization's access. Check the LayerLens Atlas dashboard for the complete list of available models. + +## Available Benchmarks + +Common benchmark identifiers include: + +- `"mmlu"` - Massive Multitask Language Understanding +- `"hellaswag"` - HellaSwag commonsense reasoning +- `"arc-challenge"` - AI2 Reasoning Challenge +- `"truthfulqa"` - TruthfulQA +- `"winogrande"` - WinoGrande +- `"gsm8k"` - Grade School Math 8K + +> **Note**: Available benchmarks may vary based on your organization's access. Check the LayerLens Atlas dashboard for the complete list of available benchmarks. + +## Error Handling + +### Common Errors + +```python +import atlas +from atlas import Atlas + +client = Atlas() + +try: + evaluation = client.evaluations.create( + model="nonexistent-model", + benchmark="mmlu" + ) +except atlas.NotFoundError: + print("Model 'nonexistent-model' not found") +except atlas.BadRequestError: + print("Invalid request parameters") +except atlas.UnprocessableEntityError: + print("Request parameters are valid but cannot be processed") +``` + +### Timeout Handling + +```python +import atlas +from atlas import Atlas + +client = Atlas() + +try: + evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu", + timeout=30.0 # 30 seconds + ) +except atlas.APITimeoutError: + print("Request timed out - try increasing timeout or check network") +``` + +## Best Practices + +### 1. Check Return Values +```python +# ✅ Good - always check if evaluation was created +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +if evaluation: + print(f"Success: {evaluation.id}") +else: + print("Failed to create evaluation") + +# ❌ Bad - assuming success +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +print(f"Success: {evaluation.id}") # Could raise AttributeError +``` + +### 2. Handle Long-Running Operations +```python +# ✅ Good - appropriate timeout for evaluation creation +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu", + timeout=120.0 # 2 minutes +) + +# ❌ Bad - timeout too short +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu", + timeout=5.0 # Likely to timeout +) +``` + +### 3. Store Evaluation IDs +```python +# ✅ Good - store evaluation ID for later retrieval +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +if evaluation: + # Store this ID in your database/system + evaluation_id = evaluation.id + print(f"Store this ID: {evaluation_id}") +``` + +## Next Steps + +- Learn how to [retrieve results](results.md) for your evaluations +- Explore [code examples](../examples/creating-evaluations.md) for common patterns +- Understand [error handling](errors.md) for robust applications \ No newline at end of file diff --git a/docs/api-reference/models-benchmarks.md b/docs/api-reference/models-benchmarks.md new file mode 100644 index 0000000..ccefdb1 --- /dev/null +++ b/docs/api-reference/models-benchmarks.md @@ -0,0 +1,323 @@ +# Models & Benchmarks + +This page provides reference information about available models and benchmarks in the Atlas platform, along with guidance on selecting appropriate combinations for your evaluations. + +## Overview + +Atlas evaluations require two key components: +- **Model**: The AI model you want to evaluate +- **Benchmark**: The dataset/test suite to evaluate the model against + +The availability of models and benchmarks depends on your organization's access level and the specific Atlas deployment you're using. + +## Models + +### Model Identification + +Models are identified by string IDs that you pass to the `evaluations.create()` method: + +```python +from atlas import Atlas + +client = Atlas() + +# Using model ID +evaluation = client.evaluations.create( + model="gpt-4", # Model ID + benchmark="mmlu" +) +``` + +### Model Information + +When you create an evaluation, the response includes detailed model information: + +```python +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + +if evaluation: + print(f"Model ID: {evaluation.model_id}") # "gpt-4" + print(f"Model Name: {evaluation.model_name}") # "GPT-4" + print(f"Model Key: {evaluation.model_key}") # Internal key + print(f"Model Company: {evaluation.model_company}") # "OpenAI" +``` + +## Benchmarks + +### Benchmark Identification + +Benchmarks are identified by string IDs representing different evaluation datasets: + +```python +from atlas import Atlas + +client = Atlas() + +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" # Benchmark ID +) +``` + +### Benchmark Information + +Evaluation responses include benchmark details: + +```python +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + +if evaluation: + print(f"Dataset ID: {evaluation.dataset_id}") # "mmlu" + print(f"Dataset Name: {evaluation.dataset_name}") # "MMLU" +``` + +### Performance Expectations + +Different model-benchmark combinations yield different types of insights: + +#### General Intelligence Assessment +```python +# Broad capability assessment +models = ["gpt-4", "claude-3-opus", "llama-2-70b"] +benchmark = "mmlu" + +for model in models: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + # Compare general intelligence across models +``` + +#### Specialized Task Performance +```python +# Code generation comparison +models = ["gpt-4", "code-llama-34b", "claude-3-sonnet"] +benchmark = "humaneval" + +for model in models: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + # Compare coding abilities +``` + +## Discovery and Validation + +### Finding Available Models and Benchmarks + +#### Check the Atlas Dashboard +The most reliable way to find available models and benchmarks: + +1. Log into your Atlas dashboard +2. Navigate to the evaluation creation page +3. View dropdown lists of available models and benchmarks +4. Note the exact IDs for use in your code + +#### Programmatic Discovery + +While the SDK doesn't currently provide discovery endpoints, you can validate model/benchmark existence: + +```python +import atlas +from atlas import Atlas + +def validate_model_benchmark(model_id: str, benchmark_id: str) -> bool: + """Test if a model/benchmark combination is available""" + client = Atlas() + + try: + evaluation = client.evaluations.create( + model=model_id, + benchmark=benchmark_id + ) + + if evaluation: + print(f"✅ Valid: {model_id} + {benchmark_id}") + return True + else: + print(f"❌ Invalid: {model_id} + {benchmark_id}") + return False + + except atlas.NotFoundError: + print(f"❌ Not found: {model_id} or {benchmark_id}") + return False + except atlas.PermissionDeniedError: + print(f"❌ No access: {model_id} or {benchmark_id}") + return False + except atlas.APIError as e: + print(f"❌ Error: {e}") + return False + +# Test combinations +combinations = [ + ("gpt-4", "mmlu"), + ("claude-3-opus", "hellaswag"), + ("llama-2-70b", "arc-challenge"), + ("nonexistent-model", "mmlu"), # Should fail +] + +for model, benchmark in combinations: + validate_model_benchmark(model, benchmark) +``` + +### Batch Validation + +```python +def batch_validate_combinations(model_benchmark_pairs): + """Validate multiple model/benchmark combinations""" + client = Atlas() + results = {} + + for model, benchmark in model_benchmark_pairs: + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + results[(model, benchmark)] = { + "valid": evaluation is not None, + "evaluation_id": evaluation.id if evaluation else None, + "model_name": evaluation.model_name if evaluation else None, + "dataset_name": evaluation.dataset_name if evaluation else None, + } + except atlas.APIError as e: + results[(model, benchmark)] = { + "valid": False, + "error": str(e), + "error_type": type(e).__name__ + } + + return results + +# Example usage +combinations = [ + ("gpt-4", "mmlu"), + ("claude-3-sonnet", "hellaswag"), + ("llama-2-70b", "gsm8k"), +] + +results = batch_validate_combinations(combinations) +for (model, benchmark), result in results.items(): + status = "✅" if result["valid"] else "❌" + print(f"{status} {model} + {benchmark}: {result}") +``` + +### Validate Before Production Use + +```python +def safe_create_evaluation(model: str, benchmark: str): + """Create evaluation with validation and error handling""" + client = Atlas() + + # Validate combination first + if not validate_model_benchmark(model, benchmark): + return None + + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + + if evaluation: + print(f"✅ Evaluation created successfully:") + print(f" ID: {evaluation.id}") + print(f" Model: {evaluation.model_name} ({evaluation.model_company})") + print(f" Benchmark: {evaluation.dataset_name}") + return evaluation + else: + print(f"❌ Failed to create evaluation") + return None + + except atlas.APIError as e: + print(f"❌ API error: {e}") + return None + +# Usage +evaluation = safe_create_evaluation("gpt-4", "mmlu") +``` + +### 4. Document Model and Benchmark Choices + +```python +# Document your evaluation strategy +EVALUATION_CONFIGS = { + "general_intelligence": { + "models": ["gpt-4", "claude-3-opus", "gemini-pro"], + "benchmarks": ["mmlu", "arc-challenge", "hellaswag"], + "description": "Broad cognitive ability assessment" + }, + "code_generation": { + "models": ["gpt-4", "code-llama-34b", "claude-3-sonnet"], + "benchmarks": ["humaneval", "mbpp", "apps"], + "description": "Programming and code generation capabilities" + }, + "mathematical_reasoning": { + "models": ["gpt-4", "claude-3-opus", "minerva-62b"], + "benchmarks": ["gsm8k", "math", "minerva-math"], + "description": "Mathematical problem-solving abilities" + } +} + +def run_evaluation_suite(suite_name: str): + """Run a predefined evaluation suite""" + if suite_name not in EVALUATION_CONFIGS: + print(f"Unknown suite: {suite_name}") + return + + config = EVALUATION_CONFIGS[suite_name] + print(f"Running {suite_name}: {config['description']}") + + client = Atlas() + evaluations = [] + + for model in config["models"]: + for benchmark in config["benchmarks"]: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + if evaluation: + evaluations.append(evaluation) + print(f"✅ {model} + {benchmark}: {evaluation.id}") + + return evaluations + +# Run comprehensive evaluation +evaluations = run_evaluation_suite("general_intelligence") +``` + +## Troubleshooting + +### Model or Benchmark Not Found + +```python +try: + evaluation = client.evaluations.create( + model="nonexistent-model", + benchmark="mmlu" + ) +except atlas.NotFoundError: + print("Model or benchmark not found. Check:") + print("1. Spelling of model/benchmark ID") + print("2. Available options in Atlas dashboard") + print("3. Your organization's access permissions") +``` + +### Permission Issues + +```python +try: + evaluation = client.evaluations.create( + model="restricted-model", + benchmark="private-benchmark" + ) +except atlas.PermissionDeniedError: + print("Access denied. Possible causes:") + print("1. Model requires higher permission level") + print("2. Benchmark is not available to your organization") + print("3. Project doesn't have access to these resources") +``` + +### Validation Errors + +```python +try: + evaluation = client.evaluations.create( + model="", # Empty string + benchmark="mmlu" + ) +except atlas.BadRequestError: + print("Invalid request parameters:") + print("- Model and benchmark IDs cannot be empty") + print("- IDs must be valid strings") +``` + +For more information about available models and benchmarks, consult your Atlas dashboard or contact your LayerLens administrator. \ No newline at end of file diff --git a/docs/api-reference/results.md b/docs/api-reference/results.md new file mode 100644 index 0000000..54d6196 --- /dev/null +++ b/docs/api-reference/results.md @@ -0,0 +1,384 @@ +# Results + +The `results` resource allows you to retrieve detailed results from completed evaluations. This provides granular insight into how your model performed on individual test cases. + +## Overview + +Results contain detailed information about each test case in an evaluation, including the prompt, model response, expected answer, scoring metrics, and performance data. + +## Methods + +### `get(evaluation_id, timeout=None)` + +Retrieves detailed results for a specific evaluation. + +#### Parameters + +| Parameter | Type | Required | Description | +|-----------|------|----------|-------------| +| `evaluation_id` | `str` | Yes | The evaluation identifier to get results for | +| `timeout` | `float \| httpx.Timeout \| None` | No | Override request timeout | + +#### Returns + +Returns a list of `Result` objects if successful, `None` if no results are found or the evaluation doesn't exist. + +#### Example + +```python +from atlas import Atlas + +client = Atlas() + +# Get results for a specific evaluation +results = client.results.get(evaluation_id="eval_12345") + +if results: + print(f"Retrieved {len(results)} results") + for i, result in enumerate(results[:3]): # Show first 3 + print(f"\nResult {i+1}:") + print(f" Subset: {result.subset}") + print(f" Score: {result.score}") + print(f" Duration: {result.duration}") +else: + print("No results found or evaluation doesn't exist") +``` + +#### With Custom Timeout + +```python +# Get results with custom timeout (2 minutes) +results = client.results.get( + evaluation_id="eval_12345", + timeout=120.0 +) +``` + +## Result Object + +Each `Result` object contains the following properties: + +### Core Properties + +| Property | Type | Description | +|----------|------|-------------| +| `subset` | `str` | The benchmark subset or category this test case belongs to | +| `prompt` | `str` | The input prompt given to the model | +| `result` | `str` | The model's response/output | +| `truth` | `str` | The expected or correct answer | +| `score` | `float` | Individual score for this test case (typically 0.0 to 1.0) | +| `duration` | `timedelta` | Time taken for the model to respond | +| `metrics` | `Dict[str, float]` | Additional metrics specific to this test case | + +### Understanding Properties + +- **`subset`**: Groups related test cases (e.g., "elementary_mathematics", "world_history") +- **`prompt`**: The exact input sent to the model +- **`result`**: The model's actual response +- **`truth`**: The ground truth or expected answer for comparison +- **`score`**: Individual test case score, usually binary (0.0 or 1.0) for correctness +- **`duration`**: Response latency as a Python `timedelta` object +- **`metrics`**: Additional scoring metrics that may be benchmark-specific + +## Complete Example + +```python +import atlas +from atlas import Atlas +from datetime import timedelta + +def analyze_evaluation_results(evaluation_id: str): + client = Atlas() + + try: + # Get results + results = client.results.get(evaluation_id=evaluation_id) + + if not results: + print(f"❌ No results found for evaluation {evaluation_id}") + return + + print(f"📊 Analysis for evaluation {evaluation_id}") + print(f"📈 Total test cases: {len(results)}") + + # Calculate overall statistics + total_score = sum(result.score for result in results) + avg_score = total_score / len(results) + correct_answers = sum(1 for result in results if result.score > 0.5) + accuracy = correct_answers / len(results) + + # Calculate timing statistics + durations = [result.duration for result in results] + avg_duration = sum(durations, timedelta()) / len(durations) + min_duration = min(durations) + max_duration = max(durations) + + print(f"\n🎯 Performance Metrics:") + print(f" Average Score: {avg_score:.3f}") + print(f" Accuracy: {accuracy:.1%} ({correct_answers}/{len(results)})") + print(f" Average Duration: {avg_duration}") + print(f" Min Duration: {min_duration}") + print(f" Max Duration: {max_duration}") + + # Group by subset + subset_stats = {} + for result in results: + if result.subset not in subset_stats: + subset_stats[result.subset] = {"scores": [], "count": 0} + subset_stats[result.subset]["scores"].append(result.score) + subset_stats[result.subset]["count"] += 1 + + print(f"\n📋 Performance by Subset:") + for subset, stats in subset_stats.items(): + subset_avg = sum(stats["scores"]) / len(stats["scores"]) + subset_acc = sum(1 for s in stats["scores"] if s > 0.5) / len(stats["scores"]) + print(f" {subset}: {subset_acc:.1%} accuracy ({subset_avg:.3f} avg score, {stats['count']} cases)") + + # Show some example results + print(f"\n🔍 Sample Results:") + for i, result in enumerate(results[:3]): + status = "✅ Correct" if result.score > 0.5 else "❌ Incorrect" + print(f"\n Example {i+1} [{result.subset}] - {status}") + print(f" Prompt: {result.prompt[:100]}...") + print(f" Model Answer: {result.result[:100]}...") + print(f" Expected: {result.truth[:100]}...") + print(f" Score: {result.score}, Duration: {result.duration}") + + if result.metrics: + print(f" Additional Metrics: {result.metrics}") + + return results + + except atlas.NotFoundError: + print(f"❌ Evaluation {evaluation_id} not found") + except atlas.AuthenticationError: + print("❌ Authentication failed - check your API key") + except atlas.APIConnectionError as e: + print(f"❌ Connection error: {e}") + except atlas.APIError as e: + print(f"❌ API error: {e}") + + return None + +if __name__ == "__main__": + # Example usage + evaluation_id = "eval_12345" # Replace with actual evaluation ID + results = analyze_evaluation_results(evaluation_id) +``` + +## Working with Large Result Sets + +For evaluations with many test cases, consider processing results in batches: + +```python +from atlas import Atlas + +def process_results_efficiently(evaluation_id: str): + client = Atlas() + + results = client.results.get(evaluation_id=evaluation_id) + if not results: + return + + print(f"Processing {len(results)} results...") + + # Process in chunks to avoid memory issues with very large result sets + chunk_size = 100 + for i in range(0, len(results), chunk_size): + chunk = results[i:i+chunk_size] + + print(f"Processing results {i+1}-{min(i+chunk_size, len(results))}...") + + # Process this chunk + for result in chunk: + # Your processing logic here + pass +``` + +## Filtering and Analysis + +### Filter by Subset + +```python +def analyze_subset_performance(results, target_subset): + subset_results = [r for r in results if r.subset == target_subset] + + if not subset_results: + print(f"No results found for subset '{target_subset}'") + return + + accuracy = sum(1 for r in subset_results if r.score > 0.5) / len(subset_results) + avg_duration = sum(r.duration for r in subset_results) / len(subset_results) + + print(f"Subset '{target_subset}' Performance:") + print(f" Test cases: {len(subset_results)}") + print(f" Accuracy: {accuracy:.1%}") + print(f" Average duration: {avg_duration}") + +# Usage +results = client.results.get(evaluation_id="eval_12345") +if results: + analyze_subset_performance(results, "elementary_mathematics") +``` + +### Find Difficult Cases + +```python +def find_difficult_cases(results, score_threshold=0.3): + """Find test cases where the model struggled""" + difficult_cases = [r for r in results if r.score < score_threshold] + + print(f"Found {len(difficult_cases)} difficult cases (score < {score_threshold})") + + for case in difficult_cases[:5]: # Show first 5 + print(f"\nDifficult Case [{case.subset}]:") + print(f" Prompt: {case.prompt[:100]}...") + print(f" Model: {case.result[:50]}...") + print(f" Expected: {case.truth[:50]}...") + print(f" Score: {case.score}") + +# Usage +results = client.results.get(evaluation_id="eval_12345") +if results: + find_difficult_cases(results) +``` + +## Error Handling + +### Common Errors + +```python +import atlas +from atlas import Atlas + +client = Atlas() + +try: + results = client.results.get(evaluation_id="nonexistent_eval") +except atlas.NotFoundError: + print("Evaluation not found or no results available") +except atlas.AuthenticationError: + print("Authentication failed") +except atlas.PermissionDeniedError: + print("No permission to access this evaluation") +``` + +### Handling Empty Results + +```python +def safe_get_results(client, evaluation_id): + """Safely get results with proper error handling""" + try: + results = client.results.get(evaluation_id=evaluation_id) + + if results is None: + print(f"No results found for evaluation {evaluation_id}") + print("This could mean:") + print("- Evaluation doesn't exist") + print("- Evaluation not yet completed") + print("- No permission to access results") + return [] + + if len(results) == 0: + print(f"Evaluation {evaluation_id} has no test cases") + return [] + + return results + + except atlas.APIError as e: + print(f"Error retrieving results: {e}") + return [] +``` + +## Performance Considerations + +### Large Result Sets +Results can contain thousands of individual test cases. Consider: + +```python +# ✅ Good - check result size first +results = client.results.get(evaluation_id="eval_12345") +if results: + print(f"Retrieved {len(results)} results") + if len(results) > 1000: + print("Large result set - consider processing in chunks") + +# ❌ Bad - not considering memory usage +results = client.results.get(evaluation_id="eval_12345") +# Process all results in memory without considering size +``` + +### Caching Results +For repeated analysis, consider caching results: + +```python +import pickle +from pathlib import Path + +def get_cached_results(client, evaluation_id, cache_dir="cache"): + cache_path = Path(cache_dir) / f"{evaluation_id}_results.pkl" + + if cache_path.exists(): + print("Loading cached results...") + with open(cache_path, 'rb') as f: + return pickle.load(f) + + print("Fetching fresh results...") + results = client.results.get(evaluation_id=evaluation_id) + + if results: + cache_path.parent.mkdir(exist_ok=True) + with open(cache_path, 'wb') as f: + pickle.dump(results, f) + + return results +``` + +## Best Practices + +### 1. Always Check for Results +```python +# ✅ Good - check if results exist +results = client.results.get(evaluation_id="eval_12345") +if results: + print(f"Found {len(results)} results") +else: + print("No results available") + +# ❌ Bad - assume results exist +results = client.results.get(evaluation_id="eval_12345") +print(f"Found {len(results)} results") # Could raise AttributeError +``` + +### 2. Handle Large Result Sets Appropriately +```python +# ✅ Good - process in chunks for large sets +if len(results) > 1000: + for i in range(0, len(results), 100): + chunk = results[i:i+100] + process_chunk(chunk) + +# ❌ Bad - process everything in memory +for result in results: # Could be thousands of results + expensive_processing(result) +``` + +### 3. Use Meaningful Analysis +```python +# ✅ Good - extract meaningful insights +subset_performance = {} +for result in results: + if result.subset not in subset_performance: + subset_performance[result.subset] = [] + subset_performance[result.subset].append(result.score) + +# ❌ Bad - just print raw data +for result in results: + print(result.score) # Not very useful +``` + +## Next Steps + +- Learn about [error handling](errors.md) for robust applications +- Explore [code examples](../examples/retrieving-results.md) for common analysis patterns +- Check out [troubleshooting](../troubleshooting/) for common issues \ No newline at end of file diff --git a/docs/examples/advanced-usage.md b/docs/examples/advanced-usage.md new file mode 100644 index 0000000..6db48c5 --- /dev/null +++ b/docs/examples/advanced-usage.md @@ -0,0 +1,461 @@ +# Advanced Usage Patterns + +This guide covers practical advanced techniques for using the Atlas Python SDK in production environments. + +## Environment Variables Setup + +The Atlas SDK reads your credentials from environment variables. You can set them up however you prefer: + +```python +import os +from atlas import Atlas + +# Option 1: Load from system environment variables +client = Atlas() # Automatically uses LAYERLENS_ATLAS_API_KEY, etc. + +# Option 2: Using python-dotenv (if you prefer .env files) +from dotenv import load_dotenv +load_dotenv() # Loads from .env file +client = Atlas() +``` + +Required environment variables: +- `LAYERLENS_ATLAS_API_KEY` - Your Atlas API key +- `LAYERLENS_ATLAS_ORG_ID` - Your organization ID +- `LAYERLENS_ATLAS_PROJECT_ID` - Your project ID + +## Batch Processing + +### Running Multiple Evaluations + +```python +import time +from atlas import Atlas +import atlas + +def run_evaluation_batch(models, benchmarks): + """Run evaluations for multiple model-benchmark combinations""" + client = Atlas() + + results = {'successful': [], 'failed': []} + + for model in models: + for benchmark in benchmarks: + print(f"Creating evaluation: {model} on {benchmark}") + + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + results['successful'].append({ + 'model': model, + 'benchmark': benchmark, + 'evaluation_id': evaluation.id + }) + print(f"✅ Created: {evaluation.id}") + else: + results['failed'].append({ + 'model': model, + 'benchmark': benchmark, + 'error': 'No evaluation returned' + }) + + except atlas.RateLimitError: + print("Rate limited, waiting 60 seconds...") + time.sleep(60) + + except atlas.APIError as e: + print(f"❌ Failed: {e}") + results['failed'].append({ + 'model': model, + 'benchmark': benchmark, + 'error': str(e) + }) + + time.sleep(2) + + return results + +# Usage +models = ["gpt-4", "claude-3-opus"] +benchmarks = ["mmlu", "hellaswag"] + +batch_results = run_evaluation_batch(models, benchmarks) +print(f"✅ Successful: {len(batch_results['successful'])}") +print(f"❌ Failed: {len(batch_results['failed'])}") +``` + +## Error Handling Patterns + +### Robust Error Handling + +```python +import time +from atlas import Atlas +import atlas + +def create_evaluation_with_retries(model, benchmark, max_retries=3): + """Create evaluation with automatic retries""" + client = Atlas() + + for attempt in range(max_retries): + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + print(f"✅ Success on attempt {attempt + 1}") + return evaluation + + except atlas.RateLimitError as e: + print(f"Rate limited on attempt {attempt + 1}") + if attempt < max_retries - 1: + # Check if server provided retry-after header + retry_after = getattr(e.response, 'headers', {}).get('retry-after', 60) + wait_time = int(retry_after) + print(f"Waiting {wait_time} seconds...") + time.sleep(wait_time) + else: + raise + + except atlas.NotFoundError: + print(f"❌ Model '{model}' or benchmark '{benchmark}' not found") + return None + + except atlas.AuthenticationError: + print("❌ Authentication failed - check your API key") + raise + + except atlas.APIError as e: + print(f"❌ API error on attempt {attempt + 1}: {e}") + if attempt < max_retries - 1: + time.sleep(2 ** attempt) # Exponential backoff + else: + raise + + return None + +# Usage +evaluation = create_evaluation_with_retries("gpt-4", "mmlu") +``` + +## Result Processing + +### Processing Large Result Sets + +```python +from atlas import Atlas +import json +from typing import Dict, List + +def analyze_evaluation_results(evaluation_id: str) -> Dict: + """Analyze results from an evaluation""" + client = Atlas() + + try: + results = client.results.get(evaluation_id=evaluation_id) + + if not results: + return {"error": "No results found"} + + # Basic analysis + analysis = { + "total_results": len(results), + "subsets": {}, + "overall_accuracy": 0, + "avg_duration": 0 + } + + total_score = 0 + total_duration = 0 + + for result in results: + # Track by subset + if result.subset not in analysis["subsets"]: + analysis["subsets"][result.subset] = { + "count": 0, + "total_score": 0, + "accuracy": 0 + } + + analysis["subsets"][result.subset]["count"] += 1 + analysis["subsets"][result.subset]["total_score"] += result.score + + total_score += result.score + total_duration += result.duration.total_seconds() + + # Calculate averages + analysis["overall_accuracy"] = total_score / len(results) + analysis["avg_duration"] = total_duration / len(results) + + # Calculate subset accuracies + for subset_data in analysis["subsets"].values(): + subset_data["accuracy"] = subset_data["total_score"] / subset_data["count"] + + return analysis + + except atlas.APIError as e: + return {"error": str(e)} + +# Usage +analysis = analyze_evaluation_results("eval_123") +if "error" not in analysis: + print(f"📊 Analysis Results:") + print(f" Total results: {analysis['total_results']}") + print(f" Overall accuracy: {analysis['overall_accuracy']:.2%}") + print(f" Average duration: {analysis['avg_duration']:.2f}s") + + print(f" By subset:") + for subset, data in analysis['subsets'].items(): + print(f" {subset}: {data['accuracy']:.2%} ({data['count']} results)") +``` + +## Production Timeouts + +### Different Timeout Strategies + +```python +from atlas import Atlas + +# Different timeout configurations for different use cases + +# Development: Fail fast +dev_client = Atlas(timeout=30.0) # 30 seconds + +# Production: More patient +prod_client = Atlas(timeout=600.0) # 10 minutes + +# Long-running batch jobs: Very patient +batch_client = Atlas(timeout=1800.0) # 30 minutes + +def adaptive_timeout_client(operation_type="default"): + """Get client with timeout appropriate for operation""" + timeouts = { + "quick": 30.0, # For testing connectivity + "default": 300.0, # For normal operations + "batch": 1800.0, # For batch processing + "patient": 3600.0 # For very long evaluations + } + + timeout = timeouts.get(operation_type, timeouts["default"]) + return Atlas(timeout=timeout) + +# Usage +quick_client = adaptive_timeout_client("quick") +batch_client = adaptive_timeout_client("batch") +``` + +## Logging and Monitoring + +### Simple Logging Setup + +```python +import logging +import time +from atlas import Atlas +import atlas + +# Set up logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' +) +logger = logging.getLogger('atlas-client') + +def create_evaluation_with_logging(model, benchmark): + """Create evaluation with comprehensive logging""" + client = Atlas() + + logger.info(f"Creating evaluation: {model} on {benchmark}") + start_time = time.time() + + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + duration = time.time() - start_time + + if evaluation: + logger.info( + f"Evaluation created successfully: {evaluation.id} " + f"(duration: {duration:.2f}s)" + ) + return evaluation + else: + logger.warning( + f"No evaluation returned for {model}+{benchmark} " + f"(duration: {duration:.2f}s)" + ) + return None + + except atlas.APIError as e: + duration = time.time() - start_time + logger.error( + f"Failed to create evaluation {model}+{benchmark}: {e} " + f"(duration: {duration:.2f}s)" + ) + raise + +# Usage +evaluation = create_evaluation_with_logging("gpt-4", "mmlu") +``` + +## Health Checks + +### Simple Health Check + +```python +from atlas import Atlas +import atlas + +def check_atlas_health(): + """Simple health check for Atlas service""" + try: + client = Atlas(timeout=10.0) # Short timeout for health check + + # Try to create a test evaluation (will fail but tests connectivity) + try: + client.evaluations.create( + model="__health_check__", + benchmark="__health_check__" + ) + except atlas.NotFoundError: + # Expected - health check resources don't exist + return {"status": "healthy", "message": "API is reachable"} + except atlas.BadRequestError: + # Also expected - invalid parameters + return {"status": "healthy", "message": "API is reachable"} + + except atlas.AuthenticationError: + return { + "status": "unhealthy", + "error": "Authentication failed - check API key" + } + except atlas.APIConnectionError: + return { + "status": "unhealthy", + "error": "Cannot connect to Atlas API" + } + except atlas.APITimeoutError: + return { + "status": "unhealthy", + "error": "Health check timed out" + } + except Exception as e: + return { + "status": "unhealthy", + "error": f"Unexpected error: {e}" + } + +# Usage +health = check_atlas_health() +if health["status"] == "healthy": + print("✅ Atlas service is healthy") +else: + print(f"❌ Atlas service is unhealthy: {health['error']}") +``` + +## Integration Patterns + +### Using with Flask/FastAPI + +```python +from flask import Flask, jsonify, request +from atlas import Atlas +import atlas + +app = Flask(__name__) + +# Initialize Atlas client once +atlas_client = Atlas() + +@app.route('/health') +def health_check(): + """Health check endpoint""" + health = check_atlas_health() # From example above + status_code = 200 if health["status"] == "healthy" else 503 + return jsonify(health), status_code + +@app.route('/evaluations', methods=['POST']) +def create_evaluation(): + """Create evaluation endpoint""" + try: + data = request.get_json() + model = data.get('model') + benchmark = data.get('benchmark') + + if not model or not benchmark: + return jsonify({ + "error": "Missing required fields: model, benchmark" + }), 400 + + evaluation = atlas_client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + return jsonify({ + "success": True, + "evaluation_id": evaluation.id, + "status": evaluation.status + }) + else: + return jsonify({ + "success": False, + "error": "Failed to create evaluation" + }), 500 + + except atlas.NotFoundError: + return jsonify({ + "success": False, + "error": "Model or benchmark not found" + }), 404 + + except atlas.APIError as e: + return jsonify({ + "success": False, + "error": str(e) + }), 500 + +@app.route('/evaluations//results') +def get_results(evaluation_id): + """Get evaluation results endpoint""" + try: + results = atlas_client.results.get(evaluation_id=evaluation_id) + + if results: + return jsonify({ + "success": True, + "result_count": len(results), + "results": [ + { + "subset": r.subset, + "score": r.score, + "duration_seconds": r.duration.total_seconds() + } + for r in results + ] + }) + else: + return jsonify({ + "success": False, + "error": "No results found" + }), 404 + + except atlas.APIError as e: + return jsonify({ + "success": False, + "error": str(e) + }), 500 + +if __name__ == '__main__': + app.run(debug=True) +``` diff --git a/docs/examples/creating-evaluations.md b/docs/examples/creating-evaluations.md new file mode 100644 index 0000000..37da6bf --- /dev/null +++ b/docs/examples/creating-evaluations.md @@ -0,0 +1,799 @@ +# Creating Evaluations + +This guide provides practical examples for creating evaluations with the Atlas Python SDK. + +## Basic Evaluation Creation + +### Simple Evaluation + +The most straightforward way to create an evaluation: + +```python +from atlas import Atlas + +# Initialize client +client = Atlas() + +# Create evaluation +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" +) + +if evaluation: + print(f"✅ Evaluation created: {evaluation.id}") + print(f" Model: {evaluation.model_name}") + print(f" Benchmark: {evaluation.dataset_name}") + print(f" Status: {evaluation.status}") +else: + print("❌ Failed to create evaluation") +``` + +### With Explicit Configuration + +Using explicit client configuration instead of environment variables: + +```python +from atlas import Atlas + +# Explicit configuration +client = Atlas( + api_key="your_api_key_here", + organization_id="your_org_id", + project_id="your_project_id" +) + +evaluation = client.evaluations.create( + model="claude-3-opus", + benchmark="hellaswag" +) + +if evaluation: + print(f"Evaluation ID: {evaluation.id}") + print(f"Submitted at: {evaluation.submitted_at}") +``` + +## Batch Evaluation Creation + +### Multiple Models on Same Benchmark + +Compare multiple models against the same benchmark: + +```python +from atlas import Atlas +import time + +def compare_models_on_benchmark(models: list, benchmark: str): + """Create evaluations for multiple models on the same benchmark""" + client = Atlas() + evaluations = [] + + print(f"🔄 Creating evaluations for {len(models)} models on {benchmark}") + + for model in models: + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + evaluations.append({ + "model": model, + "evaluation_id": evaluation.id, + "model_name": evaluation.model_name, + "status": evaluation.status + }) + print(f"✅ {model}: {evaluation.id}") + else: + print(f"❌ Failed to create evaluation for {model}") + + except Exception as e: + print(f"❌ Error creating evaluation for {model}: {e}") + + # Brief pause between requests to avoid rate limits + time.sleep(0.5) + + return evaluations + +# Usage +models_to_compare = [ + "gpt-4", + "gpt-3.5-turbo", + "claude-3-opus", + "claude-3-sonnet", + "llama-2-70b" +] + +evaluations = compare_models_on_benchmark(models_to_compare, "mmlu") + +# Print summary +print(f"\n📊 Created {len(evaluations)} evaluations:") +for eval_info in evaluations: + print(f" {eval_info['model_name']}: {eval_info['evaluation_id']}") +``` + +### Single Model on Multiple Benchmarks + +Evaluate one model across multiple benchmarks: + +```python +from atlas import Atlas +import time + +def evaluate_model_on_benchmarks(model: str, benchmarks: list): + """Evaluate a single model across multiple benchmarks""" + client = Atlas() + evaluations = [] + + print(f"🔄 Evaluating {model} on {len(benchmarks)} benchmarks") + + for benchmark in benchmarks: + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + evaluations.append({ + "benchmark": benchmark, + "evaluation_id": evaluation.id, + "dataset_name": evaluation.dataset_name, + "status": evaluation.status + }) + print(f"✅ {benchmark}: {evaluation.id}") + else: + print(f"❌ Failed to create evaluation for {benchmark}") + + except Exception as e: + print(f"❌ Error evaluating on {benchmark}: {e}") + + time.sleep(0.5) + + return evaluations + +# Usage +benchmarks_to_test = [ + "mmlu", + "hellaswag", + "arc-challenge", + "truthfulqa", + "gsm8k" +] + +evaluations = evaluate_model_on_benchmarks("gpt-4", benchmarks_to_test) + +print(f"\n📊 Created {len(evaluations)} evaluations for GPT-4:") +for eval_info in evaluations: + print(f" {eval_info['dataset_name']}: {eval_info['evaluation_id']}") +``` + +### Full Matrix Evaluation + +Create evaluations for all model-benchmark combinations: + +```python +from atlas import Atlas +import time +import itertools + +def create_evaluation_matrix(models: list, benchmarks: list, delay: float = 1.0): + """Create evaluations for all model-benchmark combinations""" + client = Atlas() + results = {} + total_combinations = len(models) * len(benchmarks) + + print(f"🔄 Creating {total_combinations} evaluations...") + + for i, (model, benchmark) in enumerate(itertools.product(models, benchmarks), 1): + print(f"\n[{i}/{total_combinations}] {model} + {benchmark}") + + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + if model not in results: + results[model] = {} + + results[model][benchmark] = { + "evaluation_id": evaluation.id, + "model_name": evaluation.model_name, + "dataset_name": evaluation.dataset_name, + "status": evaluation.status, + "success": True + } + print(f"✅ Success: {evaluation.id}") + else: + print(f"❌ Failed: No evaluation created") + + except Exception as e: + print(f"❌ Error: {e}") + if model not in results: + results[model] = {} + results[model][benchmark] = { + "error": str(e), + "success": False + } + + # Rate limiting + if i < total_combinations: + time.sleep(delay) + + return results + +# Usage +test_models = ["gpt-4", "claude-3-opus", "llama-2-70b"] +test_benchmarks = ["mmlu", "hellaswag", "arc-challenge"] + +matrix_results = create_evaluation_matrix(test_models, test_benchmarks, delay=2.0) + +# Print summary table +print(f"\n📊 Evaluation Matrix Results:") +print("Model".ljust(15), end="") +for benchmark in test_benchmarks: + print(benchmark.ljust(15), end="") +print() + +for model in test_models: + print(model.ljust(15), end="") + for benchmark in test_benchmarks: + if model in matrix_results and benchmark in matrix_results[model]: + result = matrix_results[model][benchmark] + status = "✅" if result["success"] else "❌" + print(status.ljust(15), end="") + else: + print("❓".ljust(15), end="") + print() +``` + +## Error Handling and Resilience + +### Robust Evaluation Creation with Retries + +```python +import atlas +from atlas import Atlas +import time +import random + +def create_evaluation_with_retry( + model: str, + benchmark: str, + max_retries: int = 3, + base_delay: float = 1.0 +): + """Create evaluation with exponential backoff retry logic""" + client = Atlas() + + for attempt in range(max_retries): + try: + print(f"🔄 Attempt {attempt + 1}/{max_retries}: Creating evaluation...") + + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark, + timeout=120.0 # 2-minute timeout + ) + + if evaluation: + print(f"✅ Success on attempt {attempt + 1}: {evaluation.id}") + return evaluation + else: + print(f"❌ Evaluation creation returned None on attempt {attempt + 1}") + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after', base_delay * (2 ** attempt)) + print(f"⏳ Rate limited, waiting {retry_after}s...") + time.sleep(float(retry_after)) + continue + + except atlas.InternalServerError: + if attempt < max_retries - 1: + delay = base_delay * (2 ** attempt) + random.uniform(0, 1) + print(f"🔄 Server error, retrying in {delay:.1f}s...") + time.sleep(delay) + continue + else: + print("❌ Server error - max retries exceeded") + break + + except atlas.APIConnectionError: + if attempt < max_retries - 1: + delay = base_delay * (2 ** attempt) + print(f"🔄 Connection error, retrying in {delay:.1f}s...") + time.sleep(delay) + continue + else: + print("❌ Connection failed - max retries exceeded") + break + + except atlas.AuthenticationError: + print("❌ Authentication failed - check your API key") + break + + except atlas.NotFoundError: + print(f"❌ Model '{model}' or benchmark '{benchmark}' not found") + break + + except atlas.PermissionDeniedError: + print("❌ Permission denied - check your access rights") + break + + except atlas.APIError as e: + print(f"❌ API error: {e}") + break + + return None + +# Usage +evaluation = create_evaluation_with_retry( + model="gpt-4", + benchmark="mmlu", + max_retries=3 +) + +if evaluation: + print(f"Final result: {evaluation.id}") +else: + print("Failed to create evaluation after all attempts") +``` + +### Validation Before Creation + +```python +import atlas +from atlas import Atlas + +def validate_and_create_evaluation(model: str, benchmark: str): + """Validate model and benchmark before creating evaluation""" + client = Atlas() + + # Pre-validation checks + if not model or not model.strip(): + print("❌ Model cannot be empty") + return None + + if not benchmark or not benchmark.strip(): + print("❌ Benchmark cannot be empty") + return None + + print(f"🔍 Validating {model} + {benchmark}...") + + try: + # Attempt to create the evaluation + evaluation = client.evaluations.create( + model=model.strip(), + benchmark=benchmark.strip() + ) + + if evaluation: + print(f"✅ Validation successful!") + print(f" Evaluation ID: {evaluation.id}") + print(f" Model: {evaluation.model_name} ({evaluation.model_company})") + print(f" Benchmark: {evaluation.dataset_name}") + print(f" Status: {evaluation.status}") + return evaluation + else: + print("❌ Validation failed: No evaluation returned") + return None + + except atlas.NotFoundError: + print(f"❌ Validation failed: Model '{model}' or benchmark '{benchmark}' not found") + print("💡 Suggestions:") + print(" • Check spelling of model and benchmark IDs") + print(" • Verify available options in Atlas dashboard") + print(" • Ensure your organization has access to these resources") + return None + + except atlas.AuthenticationError: + print("❌ Authentication failed") + print("💡 Check your API key configuration") + return None + + except atlas.PermissionDeniedError: + print("❌ Permission denied") + print("💡 Contact your administrator for access") + return None + + except atlas.APIError as e: + print(f"❌ Validation failed: {e}") + return None + +# Usage with validation +test_combinations = [ + ("gpt-4", "mmlu"), + ("claude-3-opus", "hellaswag"), + ("nonexistent-model", "mmlu"), # This should fail + ("gpt-4", "nonexistent-benchmark"), # This should fail +] + +for model, benchmark in test_combinations: + print(f"\n{'='*50}") + evaluation = validate_and_create_evaluation(model, benchmark) + + if evaluation: + print(f"Ready to monitor evaluation: {evaluation.id}") +``` + +## Custom Timeout Configurations + +### Different Timeouts for Different Operations + +```python +from atlas import Atlas +import httpx + +def create_evaluations_with_custom_timeouts(): + """Demonstrate different timeout configurations""" + + # Quick timeout for testing connectivity + quick_client = Atlas(timeout=30.0) # 30 seconds + + # Standard timeout for regular evaluations + standard_client = Atlas(timeout=300.0) # 5 minutes + + # Long timeout for complex evaluations + patient_client = Atlas( + timeout=httpx.Timeout( + connect=10.0, # 10s to connect + read=1800.0, # 30min to read response + write=60.0, # 1min to send request + pool=30.0 # 30s for connection pool + ) + ) + + # Test connectivity with quick client + print("🔍 Testing connectivity...") + try: + test_eval = quick_client.evaluations.create( + model="gpt-3.5-turbo", # Faster model for testing + benchmark="arc-easy" # Smaller benchmark for testing + ) + print("✅ Connectivity test passed") + except atlas.APITimeoutError: + print("❌ Quick connectivity test failed - network issues?") + return + except atlas.APIError as e: + print(f"❌ API error during connectivity test: {e}") + return + + # Create standard evaluation + print("\n🔄 Creating standard evaluation...") + try: + standard_eval = standard_client.evaluations.create( + model="gpt-4", + benchmark="mmlu" + ) + if standard_eval: + print(f"✅ Standard evaluation created: {standard_eval.id}") + except atlas.APITimeoutError: + print("❌ Standard evaluation timed out") + + # Create complex evaluation with patient timeout + print("\n🔄 Creating complex evaluation...") + try: + complex_eval = patient_client.evaluations.create( + model="gpt-4", + benchmark="math" # Complex benchmark + ) + if complex_eval: + print(f"✅ Complex evaluation created: {complex_eval.id}") + except atlas.APITimeoutError: + print("❌ Complex evaluation timed out even with extended timeout") + +# Run the example +create_evaluations_with_custom_timeouts() +``` + +### Per-Request Timeout Override + +```python +from atlas import Atlas + +def create_evaluation_with_override_timeout(): + """Override timeout for specific requests""" + client = Atlas(timeout=60.0) # Default 1-minute timeout + + evaluations = [] + + # Quick evaluation with short timeout + print("🔄 Quick evaluation (30s timeout)...") + try: + quick_eval = client.with_options(timeout=30.0).evaluations.create( + model="gpt-3.5-turbo", + benchmark="arc-easy" + ) + if quick_eval: + evaluations.append(("Quick", quick_eval)) + print(f"✅ Quick: {quick_eval.id}") + except atlas.APITimeoutError: + print("❌ Quick evaluation timed out") + + # Standard evaluation (uses default timeout) + print("\n🔄 Standard evaluation (default 60s timeout)...") + try: + standard_eval = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" + ) + if standard_eval: + evaluations.append(("Standard", standard_eval)) + print(f"✅ Standard: {standard_eval.id}") + except atlas.APITimeoutError: + print("❌ Standard evaluation timed out") + + # Long evaluation with extended timeout + print("\n🔄 Long evaluation (5min timeout)...") + try: + long_eval = client.with_options(timeout=300.0).evaluations.create( + model="gpt-4", + benchmark="math" + ) + if long_eval: + evaluations.append(("Long", long_eval)) + print(f"✅ Long: {long_eval.id}") + except atlas.APITimeoutError: + print("❌ Long evaluation timed out") + + return evaluations + +evaluations = create_evaluation_with_override_timeout() +print(f"\n📊 Created {len(evaluations)} evaluations total") +``` + +## Monitoring and Logging + +### Evaluation Creation with Logging + +```python +import logging +from datetime import datetime +from atlas import Atlas +import atlas + +# Configure logging +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.FileHandler('atlas_evaluations.log'), + logging.StreamHandler() + ] +) +logger = logging.getLogger(__name__) + +def create_evaluation_with_logging(model: str, benchmark: str, context: dict = None): + """Create evaluation with comprehensive logging""" + client = Atlas() + context = context or {} + + logger.info(f"Starting evaluation creation: {model} + {benchmark}") + logger.info(f"Context: {context}") + + start_time = datetime.now() + + try: + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + end_time = datetime.now() + duration = (end_time - start_time).total_seconds() + + if evaluation: + logger.info(f"✅ Evaluation created successfully in {duration:.2f}s") + logger.info(f" ID: {evaluation.id}") + logger.info(f" Model: {evaluation.model_name} ({evaluation.model_company})") + logger.info(f" Benchmark: {evaluation.dataset_name}") + logger.info(f" Status: {evaluation.status}") + logger.info(f" Submitted at: {evaluation.submitted_at}") + + return { + "success": True, + "evaluation": evaluation, + "duration": duration, + "timestamp": start_time.isoformat() + } + else: + logger.error(f"❌ Evaluation creation failed - returned None") + return { + "success": False, + "error": "No evaluation returned", + "duration": duration, + "timestamp": start_time.isoformat() + } + + except atlas.RateLimitError as e: + logger.warning(f"⏳ Rate limited - request ID: {getattr(e, 'request_id', 'N/A')}") + return {"success": False, "error": "rate_limited", "retry_after": e.response.headers.get('retry-after')} + + except atlas.AuthenticationError: + logger.error("❌ Authentication failed - check API key") + return {"success": False, "error": "authentication_failed"} + + except atlas.NotFoundError: + logger.error(f"❌ Model '{model}' or benchmark '{benchmark}' not found") + return {"success": False, "error": "not_found", "model": model, "benchmark": benchmark} + + except atlas.APIError as e: + logger.error(f"❌ API error: {e}") + return {"success": False, "error": str(e), "error_type": type(e).__name__} + + except Exception as e: + logger.error(f"❌ Unexpected error: {e}") + return {"success": False, "error": f"unexpected: {e}"} + +# Usage +evaluation_configs = [ + {"model": "gpt-4", "benchmark": "mmlu", "context": {"purpose": "baseline_test"}}, + {"model": "claude-3-opus", "benchmark": "hellaswag", "context": {"purpose": "reasoning_comparison"}}, + {"model": "llama-2-70b", "benchmark": "gsm8k", "context": {"purpose": "math_evaluation"}}, +] + +results = [] +for config in evaluation_configs: + result = create_evaluation_with_logging(**config) + results.append(result) + + if not result["success"]: + logger.error(f"Failed to create evaluation: {config}") + +# Summary +successful = [r for r in results if r["success"]] +failed = [r for r in results if not r["success"]] + +logger.info(f"📊 Summary: {len(successful)} successful, {len(failed)} failed") +for result in successful: + logger.info(f" ✅ {result['evaluation'].id} ({result['duration']:.2f}s)") +for result in failed: + logger.info(f" ❌ {result.get('error', 'unknown_error')}") +``` + +## Advanced Patterns + +### Evaluation Factory Pattern + +```python +from atlas import Atlas +from abc import ABC, abstractmethod +from typing import List, Dict, Any +import atlas + +class EvaluationStrategy(ABC): + """Abstract base class for evaluation strategies""" + + @abstractmethod + def get_model_benchmark_pairs(self) -> List[tuple]: + pass + + @abstractmethod + def get_description(self) -> str: + pass + +class GeneralIntelligenceStrategy(EvaluationStrategy): + """Strategy for general intelligence assessment""" + + def get_model_benchmark_pairs(self) -> List[tuple]: + models = ["gpt-4", "claude-3-opus", "llama-2-70b"] + benchmarks = ["mmlu", "arc-challenge", "hellaswag"] + return [(m, b) for m in models for b in benchmarks] + + def get_description(self) -> str: + return "General intelligence assessment across major benchmarks" + +class CodeGenerationStrategy(EvaluationStrategy): + """Strategy for code generation assessment""" + + def get_model_benchmark_pairs(self) -> List[tuple]: + models = ["gpt-4", "code-llama-34b", "claude-3-sonnet"] + benchmarks = ["humaneval", "mbpp"] + return [(m, b) for m in models for b in benchmarks] + + def get_description(self) -> str: + return "Code generation capability assessment" + +class MathReasoningStrategy(EvaluationStrategy): + """Strategy for mathematical reasoning assessment""" + + def get_model_benchmark_pairs(self) -> List[tuple]: + models = ["gpt-4", "claude-3-opus", "minerva-62b"] + benchmarks = ["gsm8k", "math"] + return [(m, b) for m in models for b in benchmarks] + + def get_description(self) -> str: + return "Mathematical reasoning and problem-solving assessment" + +class EvaluationFactory: + """Factory for creating evaluations based on strategies""" + + def __init__(self): + self.client = Atlas() + + def execute_strategy(self, strategy: EvaluationStrategy) -> Dict[str, Any]: + """Execute an evaluation strategy""" + pairs = strategy.get_model_benchmark_pairs() + description = strategy.get_description() + + print(f"🔄 Executing strategy: {description}") + print(f"📊 Creating {len(pairs)} evaluations...") + + results = { + "strategy": description, + "evaluations": [], + "errors": [], + "summary": {"total": len(pairs), "successful": 0, "failed": 0} + } + + for model, benchmark in pairs: + try: + evaluation = self.client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + results["evaluations"].append({ + "model": model, + "benchmark": benchmark, + "evaluation_id": evaluation.id, + "model_name": evaluation.model_name, + "dataset_name": evaluation.dataset_name, + "status": evaluation.status + }) + results["summary"]["successful"] += 1 + print(f"✅ {model} + {benchmark}: {evaluation.id}") + else: + results["errors"].append({ + "model": model, + "benchmark": benchmark, + "error": "No evaluation returned" + }) + results["summary"]["failed"] += 1 + print(f"❌ {model} + {benchmark}: Failed") + + except atlas.APIError as e: + results["errors"].append({ + "model": model, + "benchmark": benchmark, + "error": str(e), + "error_type": type(e).__name__ + }) + results["summary"]["failed"] += 1 + print(f"❌ {model} + {benchmark}: {e}") + + return results + +# Usage +factory = EvaluationFactory() + +# Run different strategies +strategies = [ + GeneralIntelligenceStrategy(), + CodeGenerationStrategy(), + MathReasoningStrategy() +] + +all_results = [] +for strategy in strategies: + result = factory.execute_strategy(strategy) + all_results.append(result) + + print(f"\n📈 Strategy Results: {result['strategy']}") + print(f" Successful: {result['summary']['successful']}") + print(f" Failed: {result['summary']['failed']}") + print() + +# Overall summary +total_evaluations = sum(r["summary"]["successful"] for r in all_results) +total_errors = sum(r["summary"]["failed"] for r in all_results) + +print(f"🎯 Overall Summary:") +print(f" Total evaluations created: {total_evaluations}") +print(f" Total errors: {total_errors}") +print(f" Success rate: {total_evaluations/(total_evaluations+total_errors)*100:.1f}%") +``` diff --git a/docs/examples/retrieving-results.md b/docs/examples/retrieving-results.md new file mode 100644 index 0000000..9b4261a --- /dev/null +++ b/docs/examples/retrieving-results.md @@ -0,0 +1,828 @@ +# Retrieving Results + +This guide provides practical examples for retrieving and analyzing evaluation results with the Atlas Python SDK. + +## Basic Result Retrieval + +### Simple Result Fetching + +```python +from atlas import Atlas + +# Initialize client +client = Atlas() + +# Get results for a specific evaluation +evaluation_id = "eval_12345" # Replace with your evaluation ID +results = client.results.get(evaluation_id=evaluation_id) + +if results: + print(f"📊 Retrieved {len(results)} results") + + # Show first few results + for i, result in enumerate(results[:3]): + print(f"\nResult {i+1}:") + print(f" Subset: {result.subset}") + print(f" Prompt: {result.prompt[:100]}...") + print(f" Model Response: {result.result[:100]}...") + print(f" Expected: {result.truth}") + print(f" Score: {result.score}") + print(f" Duration: {result.duration}") +else: + print("❌ No results found") +``` + +### Complete Evaluation Workflow + +```python +from atlas import Atlas +import time + +def complete_evaluation_workflow(model: str, benchmark: str): + """Complete workflow: create evaluation and retrieve results""" + client = Atlas() + + # Step 1: Create evaluation + print(f"🔄 Creating evaluation: {model} + {benchmark}") + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + + if not evaluation: + print("❌ Failed to create evaluation") + return None + + print(f"✅ Evaluation created: {evaluation.id}") + print(f" Status: {evaluation.status}") + + # Step 2: Wait for completion (simplified polling) + # In production, use webhooks instead of polling + print("⏳ Waiting for evaluation to complete...") + + # Note: This is a simplified example. In practice, you'd: + # 1. Use webhooks for real-time updates + # 2. Store evaluation ID and check periodically + # 3. Handle various status states properly + + if evaluation.status == "completed": + print("🎉 Evaluation completed!") + + # Step 3: Retrieve results + results = client.results.get(evaluation_id=evaluation.id) + + if results: + print(f"📊 Retrieved {len(results)} detailed results") + + # Basic analysis + correct_answers = sum(1 for r in results if r.score > 0.5) + accuracy = correct_answers / len(results) + avg_duration = sum(r.duration for r in results) / len(results) + + print(f"📈 Quick Analysis:") + print(f" Accuracy: {accuracy:.1%} ({correct_answers}/{len(results)})") + print(f" Average Duration: {avg_duration}") + + return results + else: + print("❌ No results available") + else: + print(f"⏰ Evaluation status: {evaluation.status}") + print(" Check back later for results") + + return None + +# Usage +results = complete_evaluation_workflow("gpt-4", "mmlu") +``` + +## Result Analysis Patterns + +### Performance Analysis + +```python +from atlas import Atlas +from collections import defaultdict, Counter +import statistics +from datetime import timedelta + +def analyze_evaluation_performance(evaluation_id: str): + """Comprehensive performance analysis of evaluation results""" + client = Atlas() + + results = client.results.get(evaluation_id=evaluation_id) + if not results: + print(f"❌ No results found for evaluation {evaluation_id}") + return None + + print(f"📊 Performance Analysis for {evaluation_id}") + print(f"{'='*60}") + + # Overall statistics + total_cases = len(results) + correct_answers = sum(1 for r in results if r.score > 0.5) + total_score = sum(r.score for r in results) + + accuracy = correct_answers / total_cases + avg_score = total_score / total_cases + + print(f"\n🎯 Overall Performance:") + print(f" Total test cases: {total_cases:,}") + print(f" Correct answers: {correct_answers:,}") + print(f" Accuracy: {accuracy:.1%}") + print(f" Average score: {avg_score:.3f}") + + # Timing analysis + durations = [r.duration for r in results] + avg_duration = sum(durations, timedelta()) / len(durations) + min_duration = min(durations) + max_duration = max(durations) + median_duration = statistics.median(durations) + + print(f"\n⏱️ Timing Analysis:") + print(f" Average duration: {avg_duration}") + print(f" Median duration: {median_duration}") + print(f" Min duration: {min_duration}") + print(f" Max duration: {max_duration}") + + # Score distribution + score_ranges = { + "Perfect (1.0)": 0, + "High (0.8-0.99)": 0, + "Medium (0.5-0.79)": 0, + "Low (0.1-0.49)": 0, + "Zero (0.0)": 0 + } + + for result in results: + score = result.score + if score == 1.0: + score_ranges["Perfect (1.0)"] += 1 + elif 0.8 <= score < 1.0: + score_ranges["High (0.8-0.99)"] += 1 + elif 0.5 <= score < 0.8: + score_ranges["Medium (0.5-0.79)"] += 1 + elif 0.1 <= score < 0.5: + score_ranges["Low (0.1-0.49)"] += 1 + else: + score_ranges["Zero (0.0)"] += 1 + + print(f"\n📈 Score Distribution:") + for range_name, count in score_ranges.items(): + percentage = count / total_cases * 100 + print(f" {range_name}: {count:,} ({percentage:.1f}%)") + + # Subset analysis + subset_stats = defaultdict(lambda: {"scores": [], "durations": []}) + + for result in results: + subset_stats[result.subset]["scores"].append(result.score) + subset_stats[result.subset]["durations"].append(result.duration) + + print(f"\n📋 Performance by Subset:") + print(f"{'Subset':<25} {'Cases':<8} {'Accuracy':<10} {'Avg Score':<10} {'Avg Duration':<12}") + print("-" * 75) + + for subset, data in sorted(subset_stats.items()): + case_count = len(data["scores"]) + subset_accuracy = sum(1 for s in data["scores"] if s > 0.5) / case_count + subset_avg_score = sum(data["scores"]) / case_count + subset_avg_duration = sum(data["durations"], timedelta()) / case_count + + print(f"{subset:<25} {case_count:<8} {subset_accuracy:<10.1%} {subset_avg_score:<10.3f} {str(subset_avg_duration):<12}") + + return { + "total_cases": total_cases, + "accuracy": accuracy, + "avg_score": avg_score, + "avg_duration": avg_duration, + "score_distribution": score_ranges, + "subset_stats": dict(subset_stats) + } + +# Usage +analysis = analyze_evaluation_performance("eval_12345") +``` + +### Comparative Analysis + +```python +from atlas import Atlas +from typing import List, Dict + +def compare_evaluation_results(evaluation_ids: List[str], labels: List[str] = None): + """Compare results across multiple evaluations""" + client = Atlas() + + if labels and len(labels) != len(evaluation_ids): + labels = [f"Eval {i+1}" for i in range(len(evaluation_ids))] + elif not labels: + labels = [f"Eval {i+1}" for i in range(len(evaluation_ids))] + + print(f"📊 Comparing {len(evaluation_ids)} evaluations") + print(f"{'='*80}") + + # Collect results for all evaluations + all_results = {} + for eval_id, label in zip(evaluation_ids, labels): + results = client.results.get(evaluation_id=eval_id) + if results: + all_results[label] = results + print(f"✅ Loaded {len(results)} results for {label}") + else: + print(f"❌ No results found for {label} ({eval_id})") + + if not all_results: + print("❌ No results to compare") + return + + print(f"\n📈 Comparative Analysis:") + print(f"{'Metric':<20} " + " ".join(f"{label:<15}" for label in labels)) + print("-" * (20 + 15 * len(labels))) + + # Compare key metrics + metrics = {} + for label, results in all_results.items(): + total_cases = len(results) + correct_answers = sum(1 for r in results if r.score > 0.5) + accuracy = correct_answers / total_cases + avg_score = sum(r.score for r in results) / total_cases + avg_duration = sum(r.duration for r in results) / len(results) + + metrics[label] = { + "total_cases": total_cases, + "accuracy": accuracy, + "avg_score": avg_score, + "avg_duration": avg_duration + } + + # Print comparison table + print(f"{'Total Cases':<20} " + " ".join(f"{metrics[label]['total_cases']:<15,}" for label in labels)) + print(f"{'Accuracy':<20} " + " ".join(f"{metrics[label]['accuracy']:<15.1%}" for label in labels)) + print(f"{'Average Score':<20} " + " ".join(f"{metrics[label]['avg_score']:<15.3f}" for label in labels)) + print(f"{'Average Duration':<20} " + " ".join(f"{str(metrics[label]['avg_duration']):<15}" for label in labels)) + + # Find best performing evaluation + best_accuracy = max(metrics.values(), key=lambda x: x["accuracy"]) + best_speed = min(metrics.values(), key=lambda x: x["avg_duration"]) + + best_accuracy_label = next(label for label, data in metrics.items() if data == best_accuracy) + best_speed_label = next(label for label, data in metrics.items() if data == best_speed) + + print(f"\n🏆 Winners:") + print(f" Best Accuracy: {best_accuracy_label} ({best_accuracy['accuracy']:.1%})") + print(f" Fastest: {best_speed_label} ({best_speed['avg_duration']})") + + # Subset-level comparison (if results have same subsets) + if len(all_results) >= 2: + first_subsets = set(r.subset for r in list(all_results.values())[0]) + common_subsets = first_subsets + + for results in list(all_results.values())[1:]: + result_subsets = set(r.subset for r in results) + common_subsets = common_subsets.intersection(result_subsets) + + if common_subsets: + print(f"\n📋 Subset Comparison ({len(common_subsets)} common subsets):") + print(f"{'Subset':<25} " + " ".join(f"{label} Acc":<12 for label in labels)) + print("-" * (25 + 12 * len(labels))) + + for subset in sorted(common_subsets): + subset_accuracies = [] + for label, results in all_results.items(): + subset_results = [r for r in results if r.subset == subset] + if subset_results: + subset_accuracy = sum(1 for r in subset_results if r.score > 0.5) / len(subset_results) + subset_accuracies.append(f"{subset_accuracy:.1%}") + else: + subset_accuracies.append("N/A") + + print(f"{subset:<25} " + " ".join(f"{acc:<12}" for acc in subset_accuracies)) + + return metrics + +# Usage - compare GPT-4 vs Claude-3 on MMLU +evaluation_ids = ["eval_gpt4_mmlu", "eval_claude3_mmlu", "eval_llama2_mmlu"] +labels = ["GPT-4", "Claude-3", "Llama-2"] + +comparison = compare_evaluation_results(evaluation_ids, labels) +``` + +### Error Analysis + +```python +from atlas import Atlas + +def analyze_failures(evaluation_id: str, error_threshold: float = 0.3): + """Analyze cases where the model performed poorly""" + client = Atlas() + + results = client.results.get(evaluation_id=evaluation_id) + if not results: + print(f"❌ No results found for evaluation {evaluation_id}") + return None + + # Find poor-performing cases + poor_results = [r for r in results if r.score < error_threshold] + good_results = [r for r in results if r.score >= error_threshold] + + print(f"🔍 Error Analysis for {evaluation_id}") + print(f"{'='*60}") + print(f"Total cases: {len(results)}") + print(f"Poor performance (< {error_threshold}): {len(poor_results)} ({len(poor_results)/len(results):.1%})") + print(f"Good performance (>= {error_threshold}): {len(good_results)} ({len(good_results)/len(results):.1%})") + + if not poor_results: + print("🎉 No poor-performing cases found!") + return {"poor_results": [], "analysis": "No errors to analyze"} + + # Analyze failure patterns by subset + failure_by_subset = {} + for result in poor_results: + if result.subset not in failure_by_subset: + failure_by_subset[result.subset] = [] + failure_by_subset[result.subset].append(result) + + print(f"\n❌ Failure Distribution by Subset:") + for subset, failures in sorted(failure_by_subset.items(), key=lambda x: len(x[1]), reverse=True): + total_in_subset = len([r for r in results if r.subset == subset]) + failure_rate = len(failures) / total_in_subset + print(f" {subset}: {len(failures)}/{total_in_subset} failures ({failure_rate:.1%})") + + # Show worst-performing examples + worst_results = sorted(poor_results, key=lambda x: x.score)[:5] + + print(f"\n🔍 Worst Performing Examples:") + for i, result in enumerate(worst_results, 1): + print(f"\n Example {i} [Score: {result.score:.3f}]") + print(f" Subset: {result.subset}") + print(f" Prompt: {result.prompt[:200]}...") + print(f" Model Answer: {result.result[:100]}...") + print(f" Expected: {result.truth[:100]}...") + print(f" Duration: {result.duration}") + + if result.metrics: + print(f" Additional Metrics: {result.metrics}") + + # Common failure patterns + print(f"\n🔍 Common Patterns in Failures:") + + # Analyze prompt lengths + poor_prompt_lengths = [len(r.prompt) for r in poor_results] + good_prompt_lengths = [len(r.prompt) for r in good_results] + + avg_poor_prompt_len = sum(poor_prompt_lengths) / len(poor_prompt_lengths) + avg_good_prompt_len = sum(good_prompt_lengths) / len(good_prompt_lengths) + + print(f" Average prompt length in failures: {avg_poor_prompt_len:.0f} chars") + print(f" Average prompt length in successes: {avg_good_prompt_len:.0f} chars") + + # Analyze response lengths + poor_response_lengths = [len(r.result) for r in poor_results] + good_response_lengths = [len(r.result) for r in good_results] + + avg_poor_response_len = sum(poor_response_lengths) / len(poor_response_lengths) + avg_good_response_len = sum(good_response_lengths) / len(good_response_lengths) + + print(f" Average response length in failures: {avg_poor_response_len:.0f} chars") + print(f" Average response length in successes: {avg_good_response_len:.0f} chars") + + # Analyze durations + avg_poor_duration = sum(r.duration for r in poor_results) / len(poor_results) + avg_good_duration = sum(r.duration for r in good_results) / len(good_results) + + print(f" Average duration for failures: {avg_poor_duration}") + print(f" Average duration for successes: {avg_good_duration}") + + return { + "poor_results": poor_results, + "failure_by_subset": failure_by_subset, + "worst_examples": worst_results, + "patterns": { + "avg_poor_prompt_len": avg_poor_prompt_len, + "avg_good_prompt_len": avg_good_prompt_len, + "avg_poor_response_len": avg_poor_response_len, + "avg_good_response_len": avg_good_response_len, + "avg_poor_duration": avg_poor_duration, + "avg_good_duration": avg_good_duration + } + } + +# Usage +error_analysis = analyze_failures("eval_12345", error_threshold=0.5) +``` + +## Advanced Result Processing + +### Batch Processing Large Result Sets + +```python +from atlas import Atlas +from typing import Iterator, List +import time + +def process_results_in_batches(evaluation_id: str, batch_size: int = 100, processor_func=None): + """Process large result sets in manageable batches""" + client = Atlas() + + results = client.results.get(evaluation_id=evaluation_id) + if not results: + print(f"❌ No results found for evaluation {evaluation_id}") + return None + + total_results = len(results) + print(f"📊 Processing {total_results:,} results in batches of {batch_size}") + + if not processor_func: + # Default processor: just count scores + def processor_func(batch): + return { + "count": len(batch), + "avg_score": sum(r.score for r in batch) / len(batch), + "correct": sum(1 for r in batch if r.score > 0.5) + } + + batch_results = [] + + for i in range(0, total_results, batch_size): + batch = results[i:i + batch_size] + batch_num = i // batch_size + 1 + total_batches = (total_results + batch_size - 1) // batch_size + + print(f"🔄 Processing batch {batch_num}/{total_batches} ({len(batch)} items)") + + start_time = time.time() + batch_result = processor_func(batch) + end_time = time.time() + + batch_result.update({ + "batch_num": batch_num, + "processing_time": end_time - start_time, + "items_processed": len(batch) + }) + + batch_results.append(batch_result) + + print(f" ✅ Completed in {batch_result['processing_time']:.2f}s") + + # Small delay to prevent overwhelming the system + if batch_num < total_batches: + time.sleep(0.1) + + # Aggregate results + total_processing_time = sum(br["processing_time"] for br in batch_results) + total_correct = sum(br.get("correct", 0) for br in batch_results) + overall_accuracy = total_correct / total_results + + print(f"\n📈 Batch Processing Summary:") + print(f" Total batches: {len(batch_results)}") + print(f" Total processing time: {total_processing_time:.2f}s") + print(f" Average time per batch: {total_processing_time/len(batch_results):.2f}s") + print(f" Overall accuracy: {overall_accuracy:.1%}") + + return { + "batch_results": batch_results, + "summary": { + "total_items": total_results, + "total_batches": len(batch_results), + "total_processing_time": total_processing_time, + "overall_accuracy": overall_accuracy + } + } + +# Custom processor for subset analysis +def subset_analyzer(batch): + """Custom processor that analyzes subsets in a batch""" + subset_stats = {} + + for result in batch: + if result.subset not in subset_stats: + subset_stats[result.subset] = {"count": 0, "total_score": 0, "correct": 0} + + subset_stats[result.subset]["count"] += 1 + subset_stats[result.subset]["total_score"] += result.score + if result.score > 0.5: + subset_stats[result.subset]["correct"] += 1 + + return { + "subset_stats": subset_stats, + "unique_subsets": len(subset_stats) + } + +# Usage +batch_results = process_results_in_batches( + evaluation_id="eval_12345", + batch_size=50, + processor_func=subset_analyzer +) +``` + +### Result Caching and Persistence + +```python +import json +import pickle +from pathlib import Path +from datetime import datetime +from atlas import Atlas +import atlas + +class ResultsCache: + """Cache evaluation results to avoid repeated API calls""" + + def __init__(self, cache_dir: str = "results_cache"): + self.cache_dir = Path(cache_dir) + self.cache_dir.mkdir(exist_ok=True) + + def _get_cache_path(self, evaluation_id: str, format: str = "json") -> Path: + """Get cache file path for an evaluation""" + return self.cache_dir / f"{evaluation_id}_results.{format}" + + def _get_metadata_path(self, evaluation_id: str) -> Path: + """Get metadata file path for an evaluation""" + return self.cache_dir / f"{evaluation_id}_metadata.json" + + def is_cached(self, evaluation_id: str) -> bool: + """Check if results are already cached""" + return self._get_cache_path(evaluation_id).exists() + + def save_results(self, evaluation_id: str, results: list, metadata: dict = None): + """Save results to cache""" + try: + # Save as JSON (human-readable) + json_path = self._get_cache_path(evaluation_id, "json") + with open(json_path, 'w') as f: + # Convert results to serializable format + serializable_results = [] + for result in results: + result_dict = { + "subset": result.subset, + "prompt": result.prompt, + "result": result.result, + "truth": result.truth, + "score": result.score, + "duration": str(result.duration), # Convert timedelta to string + "metrics": result.metrics + } + serializable_results.append(result_dict) + + json.dump(serializable_results, f, indent=2, ensure_ascii=False) + + # Save as pickle (preserves exact object types) + pickle_path = self._get_cache_path(evaluation_id, "pkl") + with open(pickle_path, 'wb') as f: + pickle.dump(results, f) + + # Save metadata + if not metadata: + metadata = {} + + metadata.update({ + "evaluation_id": evaluation_id, + "cached_at": datetime.now().isoformat(), + "result_count": len(results), + "cache_format": "both" + }) + + metadata_path = self._get_metadata_path(evaluation_id) + with open(metadata_path, 'w') as f: + json.dump(metadata, f, indent=2) + + print(f"💾 Cached {len(results)} results for {evaluation_id}") + + except Exception as e: + print(f"❌ Error caching results: {e}") + + def load_results(self, evaluation_id: str, format: str = "pickle"): + """Load results from cache""" + try: + if format == "pickle": + cache_path = self._get_cache_path(evaluation_id, "pkl") + with open(cache_path, 'rb') as f: + results = pickle.load(f) + else: + cache_path = self._get_cache_path(evaluation_id, "json") + with open(cache_path, 'r') as f: + results = json.load(f) + + print(f"💾 Loaded {len(results)} results from cache for {evaluation_id}") + return results + + except Exception as e: + print(f"❌ Error loading cached results: {e}") + return None + + def get_metadata(self, evaluation_id: str): + """Get cached metadata""" + try: + metadata_path = self._get_metadata_path(evaluation_id) + with open(metadata_path, 'r') as f: + return json.load(f) + except Exception as e: + print(f"❌ Error loading metadata: {e}") + return None + +def get_results_with_cache(evaluation_id: str, cache: ResultsCache = None, force_refresh: bool = False): + """Get results with automatic caching""" + if not cache: + cache = ResultsCache() + + # Check cache first (unless force refresh) + if not force_refresh and cache.is_cached(evaluation_id): + print(f"📂 Loading results from cache...") + cached_results = cache.load_results(evaluation_id) + + if cached_results: + metadata = cache.get_metadata(evaluation_id) + if metadata: + cached_at = metadata.get("cached_at", "unknown") + print(f"📅 Cached at: {cached_at}") + return cached_results + + # Fetch from API + print(f"🌐 Fetching fresh results from API...") + client = Atlas() + + try: + results = client.results.get(evaluation_id=evaluation_id) + + if results: + # Cache the results + cache.save_results(evaluation_id, results) + return results + else: + print(f"❌ No results found for evaluation {evaluation_id}") + return None + + except atlas.APIError as e: + print(f"❌ Error fetching results: {e}") + + # Try to return cached results as fallback + if cache.is_cached(evaluation_id): + print(f"🔄 Falling back to cached results...") + return cache.load_results(evaluation_id) + + return None + +# Usage examples +cache = ResultsCache("./my_results_cache") + +# First call - fetches from API and caches +results1 = get_results_with_cache("eval_12345", cache) + +# Second call - loads from cache +results2 = get_results_with_cache("eval_12345", cache) + +# Force refresh from API +results3 = get_results_with_cache("eval_12345", cache, force_refresh=True) + +# Batch cache multiple evaluations +evaluation_ids = ["eval_001", "eval_002", "eval_003"] + +for eval_id in evaluation_ids: + results = get_results_with_cache(eval_id, cache) + if results: + print(f"✅ {eval_id}: {len(results)} results cached") + +print(f"\n📁 Cache contents:") +for cache_file in cache.cache_dir.glob("*.json"): + if cache_file.name.endswith("_metadata.json"): + continue + evaluation_id = cache_file.stem.replace("_results", "") + metadata = cache.get_metadata(evaluation_id) + if metadata: + count = metadata.get("result_count", "unknown") + cached_at = metadata.get("cached_at", "unknown") + print(f" {evaluation_id}: {count} results (cached: {cached_at})") +``` + +### Export and Reporting + +```python +import csv +from pathlib import Path +from datetime import datetime +from atlas import Atlas + +def export_results_to_csv(evaluation_id: str, output_path: str = None): + """Export evaluation results to CSV format""" + client = Atlas() + + results = client.results.get(evaluation_id=evaluation_id) + if not results: + print(f"❌ No results found for evaluation {evaluation_id}") + return None + + if not output_path: + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + output_path = f"results_{evaluation_id}_{timestamp}.csv" + + try: + with open(output_path, 'w', newline='', encoding='utf-8') as csvfile: + fieldnames = [ + 'subset', 'prompt', 'model_response', 'expected_answer', + 'score', 'duration_ms', 'prompt_length', 'response_length' + ] + + # Add metric columns if they exist + if results and results[0].metrics: + metric_keys = list(results[0].metrics.keys()) + fieldnames.extend([f"metric_{key}" for key in metric_keys]) + + writer = csv.DictWriter(csvfile, fieldnames=fieldnames) + writer.writeheader() + + for result in results: + row = { + 'subset': result.subset, + 'prompt': result.prompt, + 'model_response': result.result, + 'expected_answer': result.truth, + 'score': result.score, + 'duration_ms': int(result.duration.total_seconds() * 1000), + 'prompt_length': len(result.prompt), + 'response_length': len(result.result) + } + + # Add metrics if present + if result.metrics: + for key, value in result.metrics.items(): + row[f"metric_{key}"] = value + + writer.writerow(row) + + print(f"📄 Exported {len(results)} results to {output_path}") + return output_path + + except Exception as e: + print(f"❌ Error exporting to CSV: {e}") + return None + +def generate_summary_report(evaluation_ids: list, output_path: str = None): + """Generate a summary report comparing multiple evaluations""" + client = Atlas() + + if not output_path: + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + output_path = f"evaluation_summary_{timestamp}.txt" + + with open(output_path, 'w') as f: + f.write("ATLAS EVALUATION SUMMARY REPORT\n") + f.write("=" * 50 + "\n") + f.write(f"Generated: {datetime.now().isoformat()}\n") + f.write(f"Evaluations analyzed: {len(evaluation_ids)}\n\n") + + for i, eval_id in enumerate(evaluation_ids, 1): + f.write(f"EVALUATION {i}: {eval_id}\n") + f.write("-" * 30 + "\n") + + results = client.results.get(evaluation_id=eval_id) + + if not results: + f.write("❌ No results found\n\n") + continue + + # Calculate statistics + total_cases = len(results) + correct_answers = sum(1 for r in results if r.score > 0.5) + accuracy = correct_answers / total_cases + avg_score = sum(r.score for r in results) / total_cases + avg_duration = sum(r.duration for r in results) / len(results) + + # Write statistics + f.write(f"Total test cases: {total_cases:,}\n") + f.write(f"Correct answers: {correct_answers:,}\n") + f.write(f"Accuracy: {accuracy:.1%}\n") + f.write(f"Average score: {avg_score:.3f}\n") + f.write(f"Average duration: {avg_duration}\n") + + # Subset breakdown + subset_stats = {} + for result in results: + if result.subset not in subset_stats: + subset_stats[result.subset] = [] + subset_stats[result.subset].append(result.score) + + f.write(f"\nSubset Performance:\n") + for subset, scores in sorted(subset_stats.items()): + subset_accuracy = sum(1 for s in scores if s > 0.5) / len(scores) + subset_avg = sum(scores) / len(scores) + f.write(f" {subset}: {subset_accuracy:.1%} accuracy, {subset_avg:.3f} avg score ({len(scores)} cases)\n") + + f.write("\n") + + f.write("END OF REPORT\n") + + print(f"📊 Summary report generated: {output_path}") + return output_path + +# Usage examples + +# Export single evaluation to CSV +csv_path = export_results_to_csv("eval_12345") + +# Generate summary report for multiple evaluations +evaluation_list = ["eval_gpt4_mmlu", "eval_claude3_mmlu", "eval_llama2_mmlu"] +report_path = generate_summary_report(evaluation_list) + +print(f"Files generated:") +print(f" CSV Export: {csv_path}") +print(f" Summary Report: {report_path}") +``` diff --git a/docs/examples/timeouts.md b/docs/examples/timeouts.md new file mode 100644 index 0000000..c635dc0 --- /dev/null +++ b/docs/examples/timeouts.md @@ -0,0 +1,715 @@ +# Working with Timeouts + +This guide provides practical examples for configuring and handling timeouts effectively with the Atlas Python SDK. + +## Understanding Timeouts + +Timeouts in the Atlas SDK control how long to wait for API responses. Different operations may require different timeout configurations based on their expected duration and criticality. + +## Basic Timeout Configuration + +### Simple Timeout + +```python +from atlas import Atlas + +# Set a 2-minute timeout for all requests +client = Atlas(timeout=120.0) + +# Create evaluation with 2-minute timeout +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" +) +``` + +### Default Timeout Behavior + +```python +from atlas import Atlas + +# Uses default timeout (10 minutes) +client = Atlas() + +print(f"Default timeout: {client.timeout}") # Should show 10 minutes in seconds +``` + +## Advanced Timeout Configuration + +### Granular Timeout Control + +```python +import httpx +from atlas import Atlas + +# Configure different timeouts for different operations +client = Atlas( + timeout=httpx.Timeout( + connect=10.0, # 10 seconds to establish connection + read=300.0, # 5 minutes to read response + write=30.0, # 30 seconds to send request + pool=60.0 # 1 minute for connection pool operations + ) +) + +evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" +) +``` + +### Per-Request Timeout Override + +```python +from atlas import Atlas + +# Client with default 1-minute timeout +client = Atlas(timeout=60.0) + +# Override timeout for specific operations +try: + # Quick operation with short timeout + quick_eval = client.with_options(timeout=30.0).evaluations.create( + model="gpt-3.5-turbo", + benchmark="arc-easy" + ) + + # Long operation with extended timeout + complex_eval = client.with_options(timeout=600.0).evaluations.create( + model="gpt-4", + benchmark="math" # Complex benchmark + ) + + # Results retrieval with medium timeout + results = client.with_options(timeout=120.0).results.get( + evaluation_id=quick_eval.id + ) + +except Exception as e: + print(f"Operation failed: {e}") +``` + +## Timeout Strategies by Use Case + +### Development and Testing + +```python +from atlas import Atlas +import atlas + +def development_client(): + """Client optimized for development with shorter timeouts""" + return Atlas( + timeout=30.0 # 30 seconds - fail fast during development + ) + +def test_api_connectivity(): + """Quick connectivity test with very short timeout""" + client = development_client() + + try: + # Use simple, fast operation to test connectivity + evaluation = client.with_options(timeout=10.0).evaluations.create( + model="gpt-3.5-turbo", # Usually faster + benchmark="arc-easy" # Smaller benchmark + ) + + if evaluation: + print("✅ API connectivity confirmed") + return True + else: + print("❌ API returned no evaluation") + return False + + except atlas.APITimeoutError: + print("❌ API timeout - connectivity issues or server overload") + return False + except atlas.APIConnectionError: + print("❌ Connection failed - check network") + return False + except atlas.APIError as e: + print(f"❌ API error: {e}") + return False + +# Usage +if test_api_connectivity(): + print("Proceeding with full evaluation...") +else: + print("Fix connectivity issues before continuing") +``` + +### Production Workloads + +```python +import httpx +from atlas import Atlas +import atlas + +def production_client(): + """Client optimized for production workloads""" + return Atlas( + timeout=httpx.Timeout( + connect=30.0, # 30s to connect (allows for network delays) + read=1800.0, # 30 minutes for complex evaluations + write=60.0, # 1 minute to send large requests + pool=120.0 # 2 minutes for connection pool + ) + ) + +def robust_evaluation_creation(model: str, benchmark: str, max_retries: int = 3): + """Production-ready evaluation creation with timeout handling""" + client = production_client() + + for attempt in range(max_retries): + try: + print(f"🔄 Attempt {attempt + 1}/{max_retries}: Creating evaluation...") + + evaluation = client.evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + print(f"✅ Success: {evaluation.id}") + return evaluation + else: + print("❌ No evaluation returned") + + except atlas.APITimeoutError: + print(f"⏰ Timeout on attempt {attempt + 1}") + if attempt < max_retries - 1: + # Increase timeout for retry + retry_timeout = 1800.0 + (attempt * 600.0) # Add 10 minutes per retry + print(f"🔄 Retrying with extended timeout: {retry_timeout/60:.0f} minutes") + + try: + evaluation = client.with_options(timeout=retry_timeout).evaluations.create( + model=model, + benchmark=benchmark + ) + if evaluation: + print(f"✅ Success on retry: {evaluation.id}") + return evaluation + except atlas.APITimeoutError: + print(f"⏰ Extended timeout also failed") + continue + else: + print("❌ All timeout retry attempts failed") + + except atlas.APIError as e: + print(f"❌ API error: {e}") + break # Don't retry API errors + + return None + +# Usage +evaluation = robust_evaluation_creation("gpt-4", "mmlu") +``` + +### Batch Operations + +```python +from atlas import Atlas +import atlas +import time + +def batch_evaluations_with_adaptive_timeout(model_benchmark_pairs: list): + """Create multiple evaluations with adaptive timeout strategy""" + client = Atlas(timeout=120.0) # Start with 2-minute timeout + + results = [] + consecutive_timeouts = 0 + current_timeout = 120.0 + + for i, (model, benchmark) in enumerate(model_benchmark_pairs, 1): + print(f"\n[{i}/{len(model_benchmark_pairs)}] {model} + {benchmark}") + print(f"Current timeout: {current_timeout/60:.1f} minutes") + + try: + evaluation = client.with_options(timeout=current_timeout).evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + results.append({ + "model": model, + "benchmark": benchmark, + "evaluation_id": evaluation.id, + "success": True, + "timeout_used": current_timeout + }) + print(f"✅ Success: {evaluation.id}") + + # Reset timeout on success + consecutive_timeouts = 0 + current_timeout = max(120.0, current_timeout * 0.9) # Slightly reduce timeout + else: + results.append({ + "model": model, + "benchmark": benchmark, + "success": False, + "error": "no_evaluation_returned" + }) + + except atlas.APITimeoutError: + print(f"⏰ Timeout after {current_timeout/60:.1f} minutes") + consecutive_timeouts += 1 + + results.append({ + "model": model, + "benchmark": benchmark, + "success": False, + "error": "timeout", + "timeout_used": current_timeout + }) + + # Increase timeout after consecutive timeouts + if consecutive_timeouts >= 2: + current_timeout = min(3600.0, current_timeout * 1.5) # Max 1 hour + print(f"🔄 Increased timeout to {current_timeout/60:.1f} minutes") + consecutive_timeouts = 0 # Reset counter after adjustment + + except atlas.APIError as e: + print(f"❌ API error: {e}") + results.append({ + "model": model, + "benchmark": benchmark, + "success": False, + "error": str(e) + }) + + # Brief pause between requests + time.sleep(1.0) + + # Summary + successful = [r for r in results if r["success"]] + timeouts = [r for r in results if r.get("error") == "timeout"] + + print(f"\n📊 Batch Summary:") + print(f" Total requests: {len(results)}") + print(f" Successful: {len(successful)}") + print(f" Timeouts: {len(timeouts)}") + print(f" Other errors: {len(results) - len(successful) - len(timeouts)}") + + return results + +# Usage +pairs = [ + ("gpt-4", "mmlu"), + ("claude-3-opus", "hellaswag"), + ("llama-2-70b", "arc-challenge"), + ("gpt-3.5-turbo", "gsm8k"), +] + +batch_results = batch_evaluations_with_adaptive_timeout(pairs) +``` + +## Error Handling and Recovery + +### Timeout-Specific Error Handling + +```python +import atlas +from atlas import Atlas +import time + +def handle_timeout_gracefully(operation_func, *args, **kwargs): + """Generic timeout handler for any Atlas operation""" + max_retries = 3 + base_timeout = 60.0 + + for attempt in range(max_retries): + # Calculate timeout for this attempt + attempt_timeout = base_timeout * (2 ** attempt) # Exponential increase + + print(f"🔄 Attempt {attempt + 1}/{max_retries} (timeout: {attempt_timeout/60:.1f}min)") + + try: + result = operation_func(timeout=attempt_timeout, *args, **kwargs) + print(f"✅ Operation succeeded on attempt {attempt + 1}") + return result + + except atlas.APITimeoutError: + print(f"⏰ Timeout on attempt {attempt + 1}") + + if attempt == max_retries - 1: + print("❌ All retry attempts exhausted") + raise + else: + wait_time = 5 * (attempt + 1) # Progressive wait + print(f"⏳ Waiting {wait_time}s before retry...") + time.sleep(wait_time) + + except atlas.APIError as e: + print(f"❌ Non-timeout error: {e}") + raise # Don't retry non-timeout errors + +def create_evaluation_with_timeout_handling(model: str, benchmark: str): + """Wrapper function for evaluation creation""" + def operation_func(timeout, *args, **kwargs): + client = Atlas(timeout=timeout) + return client.evaluations.create(model=model, benchmark=benchmark) + + return handle_timeout_gracefully(operation_func) + +def get_results_with_timeout_handling(evaluation_id: str): + """Wrapper function for results retrieval""" + def operation_func(timeout, *args, **kwargs): + client = Atlas(timeout=timeout) + return client.results.get(evaluation_id=evaluation_id) + + return handle_timeout_gracefully(operation_func) + +# Usage +try: + evaluation = create_evaluation_with_timeout_handling("gpt-4", "mmlu") + if evaluation: + results = get_results_with_timeout_handling(evaluation.id) + print(f"📊 Retrieved {len(results) if results else 0} results") + +except atlas.APITimeoutError: + print("❌ Operation failed due to persistent timeouts") +except atlas.APIError as e: + print(f"❌ Operation failed: {e}") +``` + +### Circuit Breaker Pattern + +```python +import time +from enum import Enum +import atlas +from atlas import Atlas + +class CircuitState(Enum): + CLOSED = "closed" # Normal operation + OPEN = "open" # Failing, don't try + HALF_OPEN = "half_open" # Testing if recovered + +class TimeoutCircuitBreaker: + """Circuit breaker specifically for timeout management""" + + def __init__(self, + failure_threshold: int = 5, + timeout_threshold: float = 300.0, # 5 minutes + recovery_timeout: int = 60): # 1 minute + self.failure_threshold = failure_threshold + self.timeout_threshold = timeout_threshold + self.recovery_timeout = recovery_timeout + + self.failure_count = 0 + self.last_failure_time = None + self.state = CircuitState.CLOSED + self.current_timeout = 120.0 # Start with 2 minutes + + def call(self, func, *args, **kwargs): + """Execute function with circuit breaker protection""" + if self.state == CircuitState.OPEN: + if (time.time() - self.last_failure_time) < self.recovery_timeout: + raise atlas.APIConnectionError( + message="Circuit breaker is OPEN - too many recent timeouts" + ) + else: + self.state = CircuitState.HALF_OPEN + print("🔄 Circuit breaker transitioning to HALF_OPEN") + + try: + # Use adaptive timeout + if 'timeout' not in kwargs: + kwargs['timeout'] = self.current_timeout + + print(f"🔄 Calling function with {self.current_timeout/60:.1f}min timeout") + result = func(*args, **kwargs) + + # Success - reset circuit breaker + self.on_success() + return result + + except atlas.APITimeoutError as e: + self.on_timeout_failure() + raise + except atlas.APIError as e: + # Non-timeout API errors don't affect circuit state + raise + + def on_success(self): + """Handle successful operation""" + print("✅ Circuit breaker: Operation succeeded") + self.failure_count = 0 + self.state = CircuitState.CLOSED + + # Gradually reduce timeout on success + self.current_timeout = max(60.0, self.current_timeout * 0.95) + + def on_timeout_failure(self): + """Handle timeout failure""" + self.failure_count += 1 + self.last_failure_time = time.time() + + print(f"⏰ Circuit breaker: Timeout failure {self.failure_count}/{self.failure_threshold}") + + # Increase timeout for next attempt + self.current_timeout = min(self.timeout_threshold, self.current_timeout * 1.5) + + if self.failure_count >= self.failure_threshold: + self.state = CircuitState.OPEN + print("🔴 Circuit breaker: OPEN - too many consecutive timeouts") + +# Usage with circuit breaker +def protected_atlas_operations(): + """Example of using circuit breaker with Atlas operations""" + breaker = TimeoutCircuitBreaker( + failure_threshold=3, + timeout_threshold=600.0, # Max 10 minutes + recovery_timeout=120 # 2 minute recovery time + ) + + def create_evaluation_protected(model: str, benchmark: str): + def operation(timeout): + client = Atlas(timeout=timeout) + return client.evaluations.create(model=model, benchmark=benchmark) + return breaker.call(operation) + + def get_results_protected(evaluation_id: str): + def operation(timeout): + client = Atlas(timeout=timeout) + return client.results.get(evaluation_id=evaluation_id) + return breaker.call(operation) + + # Test with multiple operations + operations = [ + ("gpt-4", "mmlu"), + ("claude-3-opus", "hellaswag"), + ("llama-2-70b", "gsm8k"), + ] + + successful_evaluations = [] + + for model, benchmark in operations: + try: + print(f"\n🔄 Creating evaluation: {model} + {benchmark}") + evaluation = create_evaluation_protected(model, benchmark) + + if evaluation: + successful_evaluations.append(evaluation) + print(f"✅ Success: {evaluation.id}") + + # Try to get results + print(f"🔄 Getting results for {evaluation.id}") + results = get_results_protected(evaluation.id) + + if results: + print(f"📊 Retrieved {len(results)} results") + + except atlas.APIConnectionError as e: + if "Circuit breaker is OPEN" in str(e): + print("🔴 Circuit breaker prevented operation") + print(f"⏳ Waiting {breaker.recovery_timeout}s for recovery...") + time.sleep(breaker.recovery_timeout) + else: + print(f"❌ Connection error: {e}") + + except atlas.APITimeoutError: + print("⏰ Timeout occurred - circuit breaker updated") + + except atlas.APIError as e: + print(f"❌ API error: {e}") + + print(f"\n📈 Final Results:") + print(f" Circuit state: {breaker.state.value}") + print(f" Current timeout: {breaker.current_timeout/60:.1f} minutes") + print(f" Successful evaluations: {len(successful_evaluations)}") + + return successful_evaluations + +# Run protected operations +results = protected_atlas_operations() +``` + +## Monitoring and Metrics + +### Timeout Performance Tracking + +```python +import time +from dataclasses import dataclass +from typing import List, Optional +from atlas import Atlas +import atlas + +@dataclass +class TimeoutMetrics: + operation: str + model: str + benchmark: str + timeout_set: float + actual_duration: float + success: bool + error_type: Optional[str] = None + timestamp: float = None + + def __post_init__(self): + if self.timestamp is None: + self.timestamp = time.time() + +class TimeoutMonitor: + """Monitor and analyze timeout patterns""" + + def __init__(self): + self.metrics: List[TimeoutMetrics] = [] + + def record_operation(self, operation: str, model: str, benchmark: str, + timeout_set: float, start_time: float, success: bool, + error_type: str = None): + """Record an operation's timeout metrics""" + actual_duration = time.time() - start_time + + metric = TimeoutMetrics( + operation=operation, + model=model, + benchmark=benchmark, + timeout_set=timeout_set, + actual_duration=actual_duration, + success=success, + error_type=error_type + ) + + self.metrics.append(metric) + + print(f"📊 Recorded: {operation} took {actual_duration:.1f}s (timeout: {timeout_set:.1f}s)") + + def get_timeout_efficiency(self) -> dict: + """Analyze timeout efficiency""" + if not self.metrics: + return {} + + successful_ops = [m for m in self.metrics if m.success] + timeout_ops = [m for m in self.metrics if m.error_type == "timeout"] + + analysis = { + "total_operations": len(self.metrics), + "successful_operations": len(successful_ops), + "timeout_operations": len(timeout_ops), + "success_rate": len(successful_ops) / len(self.metrics), + "timeout_rate": len(timeout_ops) / len(self.metrics), + } + + if successful_ops: + avg_success_duration = sum(m.actual_duration for m in successful_ops) / len(successful_ops) + avg_success_timeout = sum(m.timeout_set for m in successful_ops) / len(successful_ops) + + analysis.update({ + "avg_success_duration": avg_success_duration, + "avg_success_timeout_set": avg_success_timeout, + "timeout_efficiency": avg_success_duration / avg_success_timeout if avg_success_timeout > 0 else 0 + }) + + return analysis + + def suggest_optimal_timeouts(self) -> dict: + """Suggest optimal timeouts based on historical data""" + if not self.metrics: + return {"message": "No data available"} + + # Group by operation type + by_operation = {} + for metric in self.metrics: + if metric.success: # Only use successful operations + key = (metric.operation, metric.model, metric.benchmark) + if key not in by_operation: + by_operation[key] = [] + by_operation[key].append(metric.actual_duration) + + suggestions = {} + for (operation, model, benchmark), durations in by_operation.items(): + # Suggest timeout as 95th percentile + 50% buffer + durations.sort() + p95_index = int(len(durations) * 0.95) + p95_duration = durations[p95_index] if p95_index < len(durations) else durations[-1] + suggested_timeout = p95_duration * 1.5 # 50% buffer + + suggestions[f"{operation}_{model}_{benchmark}"] = { + "suggested_timeout": suggested_timeout, + "based_on_operations": len(durations), + "p95_actual_duration": p95_duration + } + + return suggestions + +def monitored_atlas_operations(): + """Example of Atlas operations with timeout monitoring""" + monitor = TimeoutMonitor() + client = Atlas() + + test_operations = [ + ("gpt-3.5-turbo", "arc-easy", 60.0), # Should be fast + ("gpt-4", "mmlu", 180.0), # Medium complexity + ("claude-3-opus", "math", 600.0), # Complex, longer timeout + ] + + for model, benchmark, timeout in test_operations: + print(f"\n🔄 Testing {model} + {benchmark} (timeout: {timeout/60:.1f}min)") + + # Evaluation creation + start_time = time.time() + try: + evaluation = client.with_options(timeout=timeout).evaluations.create( + model=model, + benchmark=benchmark + ) + + if evaluation: + monitor.record_operation("create_evaluation", model, benchmark, + timeout, start_time, True) + + # Results retrieval + start_time = time.time() + try: + results = client.with_options(timeout=timeout).results.get( + evaluation_id=evaluation.id + ) + + success = results is not None + monitor.record_operation("get_results", model, benchmark, + timeout, start_time, success, + None if success else "no_results") + + except atlas.APITimeoutError: + monitor.record_operation("get_results", model, benchmark, + timeout, start_time, False, "timeout") + except atlas.APIError as e: + monitor.record_operation("get_results", model, benchmark, + timeout, start_time, False, str(e)) + else: + monitor.record_operation("create_evaluation", model, benchmark, + timeout, start_time, False, "no_evaluation") + + except atlas.APITimeoutError: + monitor.record_operation("create_evaluation", model, benchmark, + timeout, start_time, False, "timeout") + except atlas.APIError as e: + monitor.record_operation("create_evaluation", model, benchmark, + timeout, start_time, False, str(e)) + + # Analyze results + print(f"\n📊 Timeout Analysis:") + efficiency = monitor.get_timeout_efficiency() + + for key, value in efficiency.items(): + if isinstance(value, float): + print(f" {key}: {value:.2f}") + else: + print(f" {key}: {value}") + + print(f"\n💡 Timeout Suggestions:") + suggestions = monitor.suggest_optimal_timeouts() + for operation, suggestion in suggestions.items(): + print(f" {operation}:") + print(f" Suggested timeout: {suggestion['suggested_timeout']:.0f}s") + print(f" Based on {suggestion['based_on_operations']} successful operations") + print(f" 95th percentile duration: {suggestion['p95_actual_duration']:.1f}s") + +# Run monitoring example +monitored_atlas_operations() +``` diff --git a/docs/getting-started/authentication.md b/docs/getting-started/authentication.md new file mode 100644 index 0000000..5658bb4 --- /dev/null +++ b/docs/getting-started/authentication.md @@ -0,0 +1,169 @@ +# Authentication & Configuration + +The Atlas Python SDK uses API key authentication to securely access the LayerLens Atlas API. This guide covers how to set up authentication and configure your client. + +## Required Credentials + +You need three pieces of information to use the Atlas SDK: + +1. **API Key** - Your secret API key for authentication +2. **Organization ID** - Your organization identifier +3. **Project ID** - The project you want to work with + +## Getting Your Credentials + +1. **Log in to LayerLens Atlas**: Visit the LayerLens Atlas dashboard +2. **Navigate to Settings**: Go to your account or organization settings +3. **Generate API Key**: Create a new API key if you don't have one +4. **Copy IDs**: Note your Organization ID and Project ID from the dashboard + +## Environment Variables (Recommended) + +The most secure way to configure authentication is using environment variables: + +### Setting Environment Variables + +**Linux/macOS:** +```bash +export LAYERLENS_ATLAS_API_KEY="your_api_key_here" +export LAYERLENS_ATLAS_ORG_ID="your_org_id_here" +export LAYERLENS_ATLAS_PROJECT_ID="your_project_id_here" +``` + +**Windows (Command Prompt):** +```cmd +set LAYERLENS_ATLAS_API_KEY=your_api_key_here +set LAYERLENS_ATLAS_ORG_ID=your_org_id_here +set LAYERLENS_ATLAS_PROJECT_ID=your_project_id_here +``` + +**Windows (PowerShell):** +```powershell +$env:LAYERLENS_ATLAS_API_KEY="your_api_key_here" +$env:LAYERLENS_ATLAS_ORG_ID="your_org_id_here" +$env:LAYERLENS_ATLAS_PROJECT_ID="your_project_id_here" +``` + +### Using a `.env` File + +Create a `.env` file in your project root: + +```bash +LAYERLENS_ATLAS_API_KEY=your_api_key_here +LAYERLENS_ATLAS_ORG_ID=your_org_id_here +LAYERLENS_ATLAS_PROJECT_ID=your_project_id_here +``` + +Then load it in your Python code: + +```python +from dotenv import load_dotenv +import os +from atlas import Atlas + +# Load environment variables from .env file +load_dotenv() + +# Client will automatically use environment variables +client = Atlas() +``` + +> **⚠️ Security Note**: Never commit `.env` files to version control. Add `.env` to your `.gitignore` file. + +## Client Configuration + +### Automatic Configuration + +When environment variables are set, the client configures itself automatically: + +```python +from atlas import Atlas + +# Uses environment variables automatically +client = Atlas() +``` + +### Explicit Configuration + +You can also pass credentials directly to the client: + +```python +from atlas import Atlas + +client = Atlas( + api_key="your_api_key_here", + organization_id="your_org_id_here", + project_id="your_project_id_here" +) +``` + +### Mixed Configuration + +You can mix environment variables with explicit parameters: + +```python +import os +from atlas import Atlas + +client = Atlas( + api_key=os.environ.get("LAYERLENS_ATLAS_API_KEY"), + organization_id="override_org_id", # Override from environment + project_id=os.environ.get("LAYERLENS_ATLAS_PROJECT_ID") +) +``` + +## Advanced Configuration + +### Timeout Configuration + +Configure request timeouts: + +```python +from atlas import Atlas +import httpx + +# Simple timeout (10 seconds) +client = Atlas(timeout=10.0) + +# Advanced timeout configuration +client = Atlas( + timeout=httpx.Timeout( + connect=5.0, # Connection timeout + read=30.0, # Read timeout + write=10.0, # Write timeout + pool=2.0 # Pool timeout + ) +) +``` + +## Validation + +The SDK will validate your configuration on first use: + +```python +from atlas import Atlas + +try: + client = Atlas() + # Test the connection + evaluation = client.evaluations.create(model="test", benchmark="test") +except atlas.AuthenticationError: + print("Invalid API key or authentication failed") +except atlas.PermissionDeniedError: + print("Valid API key but insufficient permissions") +except atlas.AtlasError as e: + print(f"Configuration error: {e}") +``` + +## Security Best Practices + +1. **Never hardcode credentials** in your source code +2. **Use environment variables** or secure credential management systems +3. **Rotate API keys regularly** for enhanced security +4. **Use different API keys** for different environments (dev, staging, prod) +5. **Monitor API key usage** in the LayerLens dashboard +6. **Revoke unused keys** immediately + +## Next Steps + +Once authentication is configured, proceed to the [Quick Start Guide](quickstart.md) to make your first API call. \ No newline at end of file diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md new file mode 100644 index 0000000..37e75e5 --- /dev/null +++ b/docs/getting-started/installation.md @@ -0,0 +1,55 @@ +# Installation + +The Atlas Python SDK supports Python 3.8 and above. You can install it using pip or your preferred Python package manager. + +## Install from PyPI + +```bash +pip install atlas +``` + +## Verify Installation + +After installation, verify that the SDK is working correctly: + +```python +import atlas +print(atlas.__version__) +``` + +This should print the version number of the installed SDK. + +## System Requirements + +- **Python**: 3.8 or higher +- **Operating Systems**: Windows, macOS, Linux +- **Dependencies**: The SDK automatically installs required dependencies: + - `httpx` - HTTP client library + - `pydantic` - Data validation and serialization + - `typing-extensions` - Enhanced type hints for older Python versions + +## Virtual Environment (Recommended) + +We strongly recommend using a virtual environment to avoid dependency conflicts: + +```bash +# Create virtual environment +python -m venv atlas-env + +# Activate it (Linux/macOS) +source atlas-env/bin/activate + +# Activate it (Windows) +atlas-env\Scripts\activate + +# Install the SDK +pip install atlas +``` + +## Upgrading + +To upgrade to the latest version: + +```bash +pip install --upgrade atlas +``` diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md new file mode 100644 index 0000000..e47aaf6 --- /dev/null +++ b/docs/getting-started/quickstart.md @@ -0,0 +1,197 @@ +# Quick Start Guide + +This guide will help you make your first API call with the Atlas Python SDK. We'll walk through creating an evaluation and retrieving results. + +## Prerequisites + +Before you begin, ensure you have: + +1. ✅ [Installed the Atlas SDK](installation.md) +2. ✅ [Configured authentication](authentication.md) with your API key, organization ID, and project ID +3. ✅ Access to LayerLens Atlas platform + +## Your First Evaluation + +Let's create a simple evaluation to test your setup: + +```python +import os +from atlas import Atlas + +# Initialize the client (uses environment variables) +client = Atlas( + api_key=os.environ.get("LAYERLENS_ATLAS_API_KEY"), + organization_id=os.environ.get("LAYERLENS_ATLAS_ORG_ID"), + project_id=os.environ.get("LAYERLENS_ATLAS_PROJECT_ID") +) + +# Create an evaluation +evaluation = client.evaluations.create( + model="gpt-3.5-turbo", # Replace with your model ID + benchmark="mmlu" # Replace with your benchmark ID +) + +if evaluation: + print(f"✅ Evaluation created successfully!") + print(f" ID: {evaluation.id}") + print(f" Status: {evaluation.status}") + print(f" Model: {evaluation.model_name}") + print(f" Benchmark: {evaluation.dataset_name}") +else: + print("❌ Failed to create evaluation") +``` + +## Understanding the Response + +A successful evaluation creation returns an `Evaluation` object with the following key properties: + +```python +evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + +print(f"Evaluation ID: {evaluation.id}") +print(f"Status: {evaluation.status}") +print(f"Status Description: {evaluation.status_description}") +print(f"Submitted At: {evaluation.submitted_at}") +print(f"Model: {evaluation.model_name} ({evaluation.model_company})") +print(f"Dataset: {evaluation.dataset_name}") + +# Available when evaluation is completed +if evaluation.status == "completed": + print(f"Accuracy: {evaluation.accuracy}") + print(f"Readability Score: {evaluation.readability_score}") + print(f"Toxicity Score: {evaluation.toxicity_score}") + print(f"Ethics Score: {evaluation.ethics_score}") +``` + +## Retrieving Results + +Once your evaluation is complete, you can retrieve detailed results: + +```python +# Wait for evaluation to complete, then get results +if evaluation and evaluation.status == "completed": + results = client.results.get(evaluation_id=evaluation.id) + + if results: + print(f"📊 Retrieved {len(results)} results") + + # Examine the first result + first_result = results[0] + print(f"\nFirst Result:") + print(f" Subset: {first_result.subset}") + print(f" Prompt: {first_result.prompt[:100]}...") # First 100 chars + print(f" Model Output: {first_result.result[:100]}...") + print(f" Expected Answer: {first_result.truth}") + print(f" Score: {first_result.score}") + print(f" Duration: {first_result.duration}") + print(f" Metrics: {first_result.metrics}") +``` + +## Complete Example + +Here's a complete example that creates an evaluation and waits for results: + +```python +import os +import time +from atlas import Atlas + +def main(): + # Initialize client + client = Atlas() + + print("🚀 Creating evaluation...") + + try: + # Create evaluation + evaluation = client.evaluations.create( + model="gpt-3.5-turbo", + benchmark="mmlu" + ) + + if not evaluation: + print("❌ Failed to create evaluation") + return + + print(f"✅ Evaluation created: {evaluation.id}") + print(f" Status: {evaluation.status}") + + # Poll for completion (in a real app, use webhooks instead) + print("\n⏳ Waiting for evaluation to complete...") + + while evaluation.status not in ["completed", "failed", "cancelled"]: + time.sleep(30) # Wait 30 seconds + + # In practice, you'd re-fetch the evaluation status + # This is a simplified example + print(f" Status: {evaluation.status}") + + if evaluation.status == "completed": + print(f"🎉 Evaluation completed!") + print(f" Accuracy: {evaluation.accuracy:.2%}") + + # Get detailed results + results = client.results.get(evaluation_id=evaluation.id) + print(f"📊 Retrieved {len(results) if results else 0} detailed results") + + else: + print(f"❌ Evaluation failed with status: {evaluation.status}") + + except Exception as e: + print(f"❌ Error: {e}") + +if __name__ == "__main__": + main() +``` + +## Error Handling + +Always wrap your API calls in try-catch blocks: + +```python +import atlas +from atlas import Atlas + +client = Atlas() + +try: + evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" + ) +except atlas.AuthenticationError: + print("❌ Authentication failed. Check your API key.") +except atlas.PermissionDeniedError: + print("❌ Permission denied. Check your organization/project access.") +except atlas.RateLimitError: + print("❌ Rate limit exceeded. Please wait and try again.") +except atlas.APIConnectionError as e: + print(f"❌ Connection error: {e}") +except atlas.APIStatusError as e: + print(f"❌ API error: {e.status_code} - {e}") +except Exception as e: + print(f"❌ Unexpected error: {e}") +``` + +## Available Models and Benchmarks + +To see available models and benchmarks, you can: + +1. **Check the LayerLens Atlas dashboard** for the most up-to-date list +2. **Contact support** for specific model or benchmark IDs + +## What's Next? + +Now that you've successfully made your first API call: + +1. **[Explore the API Reference](../api-reference/)** - Learn about all available methods +2. **[Check out Code Examples](../examples/)** - See practical usage patterns +3. **[Review Error Handling](../api-reference/errors.md)** - Handle edge cases gracefully +4. **[Security Best Practices](../security/)** - Secure your API usage + +## Need Help? + +- **Documentation**: Browse the complete [API Reference](../api-reference/) +- **Examples**: Check out more [Code Examples](../examples/) +- **Support**: Contact LayerLens support through your dashboard for technical assistance +- **Status**: Check [status.layerlens.com](https://status.layerlens.com) for service updates \ No newline at end of file diff --git a/docs/security/api-key-management.md b/docs/security/api-key-management.md new file mode 100644 index 0000000..ca89c19 --- /dev/null +++ b/docs/security/api-key-management.md @@ -0,0 +1,275 @@ +# API Key Management + +This guide covers best practices for securely managing your Atlas API keys throughout the development lifecycle. + +## API Key Security Fundamentals + +### What Makes API Keys Sensitive + +API keys are sensitive credentials that provide access to your Atlas organization and projects. They should be treated with the same level of security as passwords or other authentication tokens. + +**Risks of compromised API keys**: +- Unauthorized access to your evaluations and data +- Unintended usage charges on your account +- Potential data breaches or intellectual property theft +- Abuse of your API quotas and rate limits + +### API Key Best Practices + +1. **Never hardcode API keys in source code** +2. **Use environment variables or secure credential stores** +3. **Rotate keys regularly** +4. **Use different keys for different environments** +5. **Monitor key usage and access patterns** +6. **Revoke unused or compromised keys immediately** + +## Secure API Key Storage + +### Environment Variables (Recommended) + +**✅ Good - Using environment variables**: +```python +import os +from atlas import Atlas + +# Secure: Load from environment variables +client = Atlas( + api_key=os.getenv('LAYERLENS_ATLAS_API_KEY'), + organization_id=os.getenv('LAYERLENS_ATLAS_ORG_ID'), + project_id=os.getenv('LAYERLENS_ATLAS_PROJECT_ID') +) +``` +### Setting Environment Variables Securely + +**Linux/macOS**: +```bash +# Add to your shell profile (.bashrc, .zshrc, etc.) +export LAYERLENS_ATLAS_API_KEY="sk-your-key-here" +export LAYERLENS_ATLAS_ORG_ID="org-your-org-here" +export LAYERLENS_ATLAS_PROJECT_ID="proj-your-project-here" + +# Reload your shell configuration +source ~/.bashrc # or ~/.zshrc +``` + +**Windows**: +```cmd +# Command Prompt (persistent) +setx LAYERLENS_ATLAS_API_KEY "sk-your-key-here" +setx LAYERLENS_ATLAS_ORG_ID "org-your-org-here" +setx LAYERLENS_ATLAS_PROJECT_ID "proj-your-project-here" + +# PowerShell (session-only) +$env:LAYERLENS_ATLAS_API_KEY="sk-your-key-here" +$env:LAYERLENS_ATLAS_ORG_ID="org-your-org-here" +$env:LAYERLENS_ATLAS_PROJECT_ID="proj-your-project-here" +``` + +### Using .env Files + +**Create a .env file** (never commit this to version control): +```bash +# .env +LAYERLENS_ATLAS_API_KEY=sk-your-key-here +LAYERLENS_ATLAS_ORG_ID=org-your-org-here +LAYERLENS_ATLAS_PROJECT_ID=proj-your-project-here +``` + +**Load .env file in Python**: +```python +from dotenv import load_dotenv +import os + +# Load environment variables from .env file +load_dotenv() + +from atlas import Atlas + +# Now environment variables are available +client = Atlas() +``` + +**Important**: Add `.env` to your `.gitignore` file: +```bash +# .gitignore +.env +.env.local +.env.*.local +*.env +``` + +### Advanced Credential Management + +#### Using External Secret Managers + +**AWS Secrets Manager**: +```python +import boto3 +import json +from atlas import Atlas + +def get_atlas_credentials_from_aws(): + """Retrieve Atlas credentials from AWS Secrets Manager""" + session = boto3.session.Session() + client = session.client('secretsmanager', region_name='us-east-1') + + try: + response = client.get_secret_value(SecretId='layerlens/atlas/credentials') + secrets = json.loads(response['SecretString']) + + return { + 'api_key': secrets['api_key'], + 'organization_id': secrets['organization_id'], + 'project_id': secrets['project_id'] + } + except Exception as e: + print(f"Error retrieving secrets: {e}") + return None + +# Usage +credentials = get_atlas_credentials_from_aws() +if credentials: + client = Atlas(**credentials) +``` +## Environment-Specific Key Management + +### Separating Development and Production Keys + +**Use different API keys for different environments**: + +```python +import os +from atlas import Atlas + +def get_atlas_client(): + """Get Atlas client based on environment""" + environment = os.getenv('ATLAS_ENV', 'development') + + if environment == 'development': + return Atlas( + api_key=os.getenv('DEV_ATLAS_API_KEY'), + organization_id=os.getenv('DEV_ATLAS_ORG_ID'), + project_id=os.getenv('DEV_ATLAS_PROJECT_ID'), + base_url=os.getenv('DEV_ATLAS_BASE_URL') # Dev server if applicable + ) + elif environment == 'staging': + return Atlas( + api_key=os.getenv('STAGING_ATLAS_API_KEY'), + organization_id=os.getenv('STAGING_ATLAS_ORG_ID'), + project_id=os.getenv('STAGING_ATLAS_PROJECT_ID') + ) + elif environment == 'production': + return Atlas( + api_key=os.getenv('PROD_ATLAS_API_KEY'), + organization_id=os.getenv('PROD_ATLAS_ORG_ID'), + project_id=os.getenv('PROD_ATLAS_PROJECT_ID') + ) + else: + raise ValueError(f"Unknown environment: {environment}") + +# Usage +client = get_atlas_client() +``` + +**Environment-specific .env files**: +```bash +# .env.development +DEV_ATLAS_API_KEY=sk-dev-key-here +DEV_ATLAS_ORG_ID=dev-org-id +DEV_ATLAS_PROJECT_ID=dev-project-id +DEV_ATLAS_BASE_URL=https://dev-api.layerlens.com + +# .env.production +PROD_ATLAS_API_KEY=sk-prod-key-here +PROD_ATLAS_ORG_ID=prod-org-id +PROD_ATLAS_PROJECT_ID=prod-project-id +``` + +### Container and Deployment Security + +**Docker Secrets**: +```yaml +# docker-compose.yml +version: '3.8' + +services: + atlas-app: + image: your-app:latest + secrets: + - atlas_api_key + - atlas_org_id + - atlas_project_id + environment: + - LAYERLENS_ATLAS_API_KEY_FILE=/run/secrets/atlas_api_key + - LAYERLENS_ATLAS_ORG_ID_FILE=/run/secrets/atlas_org_id + - LAYERLENS_ATLAS_PROJECT_ID_FILE=/run/secrets/atlas_project_id + +secrets: + atlas_api_key: + file: ./secrets/atlas_api_key.txt + atlas_org_id: + file: ./secrets/atlas_org_id.txt + atlas_project_id: + file: ./secrets/atlas_project_id.txt +``` + +**Reading Docker secrets in Python**: +```python +import os +from atlas import Atlas + +def read_docker_secret(secret_name): + """Read secret from Docker secrets file""" + secret_file = f"/run/secrets/{secret_name}" + try: + with open(secret_file, 'r') as f: + return f.read().strip() + except FileNotFoundError: + return None + +def get_atlas_client_from_docker_secrets(): + """Initialize Atlas client using Docker secrets""" + # Try Docker secrets first, fall back to environment variables + api_key = (read_docker_secret('atlas_api_key') or + os.getenv('LAYERLENS_ATLAS_API_KEY')) + + org_id = (read_docker_secret('atlas_org_id') or + os.getenv('LAYERLENS_ATLAS_ORG_ID')) + + project_id = (read_docker_secret('atlas_project_id') or + os.getenv('LAYERLENS_ATLAS_PROJECT_ID')) + + if not all([api_key, org_id, project_id]): + raise ValueError("Missing required Atlas credentials") + + return Atlas( + api_key=api_key, + organization_id=org_id, + project_id=project_id + ) + +# Usage +client = get_atlas_client_from_docker_secrets() +``` + + +## Security Checklist + +### Development Security Checklist + +- [ ] ✅ API keys stored in environment variables, not hardcoded +- [ ] ✅ `.env` files added to `.gitignore` +- [ ] ✅ Different API keys for development, staging, and production +- [ ] ✅ API key validation implemented before deployment +- [ ] ✅ Error handling doesn't expose API keys in logs +- [ ] ✅ Code review process includes credential security checks + +### Production Security Checklist + +- [ ] ✅ API keys stored in secure credential management system +- [ ] ✅ Key rotation schedule established and automated +- [ ] ✅ API usage monitoring and alerting configured +- [ ] ✅ Audit logging enabled for all API operations +- [ ] ✅ Network security controls (firewalls, VPNs) in place +- [ ] ✅ Least privilege access principles applied +- [ ] ✅ Incident response plan includes credential compromise scenarios diff --git a/docs/security/data-privacy.md b/docs/security/data-privacy.md new file mode 100644 index 0000000..cf2953b --- /dev/null +++ b/docs/security/data-privacy.md @@ -0,0 +1,81 @@ +# Data Privacy + +This guide covers data privacy considerations and best practices when using the Atlas Python SDK to ensure compliance with privacy regulations and protect sensitive information. + +## Overview + +When using the Atlas Python SDK, you may be handling sensitive data including: + +- **AI model outputs** and evaluation results +- **Prompt data** used in evaluations +- **API credentials** and authentication tokens +- **Organizational information** and project data +- **Usage patterns** and performance metrics + +Proper data privacy practices are essential for regulatory compliance and maintaining user trust. + +## Data Classification + +### Understanding Your Data Types + +**Public Data** ✅ (No privacy concerns): +- Model names and identifiers +- Benchmark names and types +- General evaluation statistics +- Documentation and configuration + +**Internal Data** ⚠️ (Moderate privacy): +- Evaluation results and scores +- Performance metrics +- Usage analytics +- System logs (without sensitive content) + +**Confidential Data** 🔒 (High privacy): +- API keys and credentials +- Custom prompts and datasets +- Proprietary model outputs +- Personal identifiable information (PII) + +**Restricted Data** 🚫 (Maximum privacy): +- Personal data under GDPR/CCPA +- Financial or healthcare information +- Trade secrets and intellectual property +- Customer data requiring special handling + +### Data Classification Example + +```python +from enum import Enum +from dataclasses import dataclass +from typing import Optional, List + +class DataClassification(Enum): + PUBLIC = "public" + INTERNAL = "internal" + CONFIDENTIAL = "confidential" + RESTRICTED = "restricted" + +@dataclass +class EvaluationDataMap: + """Map Atlas data types to privacy classifications""" + + model_name: DataClassification = DataClassification.PUBLIC + benchmark_name: DataClassification = DataClassification.PUBLIC + evaluation_scores: DataClassification = DataClassification.INTERNAL + model_outputs: DataClassification = DataClassification.CONFIDENTIAL + api_credentials: DataClassification = DataClassification.RESTRICTED + custom_prompts: DataClassification = DataClassification.CONFIDENTIAL + +def classify_atlas_data(): + """Example data classification for Atlas SDK usage""" + data_map = EvaluationDataMap() + + print("🔍 Atlas Data Classification:") + for field_name, field_value in data_map.__dict__.items(): + privacy_level = field_value.value + print(f" {field_name}: {privacy_level.upper()}") + + return data_map + +classify_atlas_data() +``` diff --git a/docs/security/environment-variables.md b/docs/security/environment-variables.md new file mode 100644 index 0000000..437864e --- /dev/null +++ b/docs/security/environment-variables.md @@ -0,0 +1,223 @@ +# Environment Variables + +This guide covers secure practices for managing environment variables when using the Atlas Python SDK. + +## Overview + +Environment variables provide a secure way to configure your Atlas SDK without hardcoding sensitive credentials in your source code. This approach separates configuration from code and enables different configurations for different environments. + +## Required Environment Variables + +The Atlas SDK uses these primary environment variables: + +| Variable | Description | Required | Example | +|----------|-------------|----------|---------| +| `LAYERLENS_ATLAS_API_KEY` | Your Atlas API key | Yes | `sk-abc123...` | +| `LAYERLENS_ATLAS_ORG_ID` | Organization identifier | Yes | `org-abc123` | +| `LAYERLENS_ATLAS_PROJECT_ID` | Project identifier | Yes | `proj-xyz789` | + +## Setting Environment Variables + +### Development Environment + +**Linux/macOS (Bash/Zsh)**: +```bash +# Set for current session +export LAYERLENS_ATLAS_API_KEY="sk-your-key-here" +export LAYERLENS_ATLAS_ORG_ID="org-your-org-here" +export LAYERLENS_ATLAS_PROJECT_ID="proj-your-project-here" + +# Add to shell profile for persistence (.bashrc, .zshrc, etc.) +echo 'export LAYERLENS_ATLAS_API_KEY="sk-your-key-here"' >> ~/.bashrc +echo 'export LAYERLENS_ATLAS_ORG_ID="org-your-org-here"' >> ~/.bashrc +echo 'export LAYERLENS_ATLAS_PROJECT_ID="proj-your-project-here"' >> ~/.bashrc + +# Reload shell configuration +source ~/.bashrc +``` + +**Windows Command Prompt**: +```cmd +# Set for current session +set LAYERLENS_ATLAS_API_KEY=sk-your-key-here +set LAYERLENS_ATLAS_ORG_ID=org-your-org-here +set LAYERLENS_ATLAS_PROJECT_ID=proj-your-project-here + +# Set permanently (requires admin rights) +setx LAYERLENS_ATLAS_API_KEY "sk-your-key-here" +setx LAYERLENS_ATLAS_ORG_ID "org-your-org-here" +setx LAYERLENS_ATLAS_PROJECT_ID "proj-your-project-here" +``` + +**Windows PowerShell**: +```powershell +# Set for current session +$env:LAYERLENS_ATLAS_API_KEY="sk-your-key-here" +$env:LAYERLENS_ATLAS_ORG_ID="org-your-org-here" +$env:LAYERLENS_ATLAS_PROJECT_ID="proj-your-project-here" + +# Set permanently for current user +[Environment]::SetEnvironmentVariable("LAYERLENS_ATLAS_API_KEY", "sk-your-key-here", "User") +[Environment]::SetEnvironmentVariable("LAYERLENS_ATLAS_ORG_ID", "org-your-org-here", "User") +[Environment]::SetEnvironmentVariable("LAYERLENS_ATLAS_PROJECT_ID", "proj-your-project-here", "User") +``` + +### Verification + +**Check if variables are set correctly**: +```python +import os + +def verify_atlas_environment(): + """Verify Atlas environment variables are configured""" + required_vars = { + 'LAYERLENS_ATLAS_API_KEY': 'API Key', + 'LAYERLENS_ATLAS_ORG_ID': 'Organization ID', + 'LAYERLENS_ATLAS_PROJECT_ID': 'Project ID' + } + + print("🔍 Atlas Environment Variable Check") + print("=" * 40) + + all_set = True + for var_name, description in required_vars.items(): + value = os.getenv(var_name) + + if value: + # Don't print the full value for security + masked_value = f"{value[:8]}..." if len(value) > 8 else "***" + print(f"✅ {description}: {masked_value}") + else: + print(f"❌ {description}: Not set") + all_set = False + + + if all_set: + print(f"\n🎉 All required variables are set!") + else: + print(f"\n⚠️ Some required variables are missing") + + return all_set + +# Run verification +verify_atlas_environment() +``` + +## Using .env Files + +### Creating .env Files + +**.env file for development**: +```bash +# .env +LAYERLENS_ATLAS_API_KEY=sk-development-key-here +LAYERLENS_ATLAS_ORG_ID=org-dev-12345 +LAYERLENS_ATLAS_PROJECT_ID=proj-dev-67890 + +# Optional: Set environment name +ATLAS_ENV=development +``` + +**Loading .env files in Python**: +```python +from dotenv import load_dotenv +import os + +# Load .env file from current directory +load_dotenv() + +# Or load specific .env file +load_dotenv('.env.development') + +# Or load from specific path +load_dotenv('/path/to/your/.env') + +# Verify variables are loaded +from atlas import Atlas + +try: + client = Atlas() # Will use environment variables + print("✅ Atlas client initialized successfully") +except Exception as e: + print(f"❌ Failed to initialize client: {e}") +``` + +### Environment-Specific .env Files + +**Create separate files for each environment**: + +**.env.development**: +```bash +LAYERLENS_ATLAS_API_KEY=sk-dev-key-here +LAYERLENS_ATLAS_ORG_ID=org-dev-12345 +LAYERLENS_ATLAS_PROJECT_ID=proj-dev-67890 +``` + +**.env.staging**: +```bash +LAYERLENS_ATLAS_API_KEY=sk-staging-key-here +LAYERLENS_ATLAS_ORG_ID=org-staging-12345 +LAYERLENS_ATLAS_PROJECT_ID=proj-staging-67890 +``` + +**.env.production**: +```bash +LAYERLENS_ATLAS_API_KEY=sk-prod-key-here +LAYERLENS_ATLAS_ORG_ID=org-prod-12345 +LAYERLENS_ATLAS_PROJECT_ID=proj-prod-67890 +``` + +**Load environment-specific configuration**: +```python +import os +from dotenv import load_dotenv +from atlas import Atlas + +def load_environment_config(): + """Load environment-specific configuration""" + # Determine environment + env = os.getenv('ATLAS_ENV', 'development') + + # Load base .env file first + load_dotenv('.env') + + # Override with environment-specific file + env_file = f'.env.{env}' + if os.path.exists(env_file): + load_dotenv(env_file, override=True) + print(f"📄 Loaded configuration from {env_file}") + else: + print(f"⚠️ Environment file {env_file} not found, using base configuration") + + return env + +def get_atlas_client(): + """Get Atlas client with environment-specific configuration""" + env = load_environment_config() + + # Create client with loaded environment variables + client = Atlas() + + # Log configuration (without sensitive data) + print(f"🌍 Environment: {env}") + print(f"🔗 Base URL: {client.base_url}") + print(f"⏱️ Timeout: {client.timeout}s") + + return client + +# Usage +client = get_atlas_client() +``` + +## Security Best Practices + +### Environment Variable Security Checklist + +- [ ] ✅ No sensitive values hardcoded in source code +- [ ] ✅ .env files added to .gitignore +- [ ] ✅ Different credentials for each environment (dev/staging/prod) +- [ ] ✅ Environment variables validated before use +- [ ] ✅ Production secrets managed through secure systems (not .env files) +- [ ] ✅ Regular rotation of API keys +- [ ] ✅ Monitoring for credential exposure in logs +- [ ] ✅ Team members trained on secure credential handling diff --git a/docs/security/rate-limiting.md b/docs/security/rate-limiting.md new file mode 100644 index 0000000..1cfac05 --- /dev/null +++ b/docs/security/rate-limiting.md @@ -0,0 +1,570 @@ +# Rate Limiting + +This guide covers how to handle rate limiting when using the Atlas Python SDK, including best practices for avoiding rate limits and properly handling rate limit errors. + +## Identifying Rate Limit Errors + +### Rate Limit HTTP Response + +When you exceed rate limits, the API returns a `429 Too Many Requests` status: + +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + + # Making too many requests quickly + for i in range(100): + evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" + ) + +except atlas.RateLimitError as e: + print(f"Rate limited: {e}") + print(f"Status code: {e.status_code}") # 429 + print(f"Response headers: {dict(e.response.headers)}") +``` + +### Rate Limit Headers + +The API response includes helpful headers: + +```python +import atlas +from atlas import Atlas + +def inspect_rate_limit_headers(error): + """Inspect rate limit headers from error response""" + headers = error.response.headers + + # Common rate limit headers + rate_limit_info = { + 'retry_after': headers.get('retry-after'), + 'x_ratelimit_limit': headers.get('x-ratelimit-limit'), + 'x_ratelimit_remaining': headers.get('x-ratelimit-remaining'), + 'x_ratelimit_reset': headers.get('x-ratelimit-reset'), + } + + print("Rate limit information:") + for key, value in rate_limit_info.items(): + if value: + print(f" {key}: {value}") + +try: + client = Atlas() + # ... make request that triggers rate limit + +except atlas.RateLimitError as e: + inspect_rate_limit_headers(e) +``` + +## Handling Rate Limits + +### Basic Retry with Backoff + +```python +import time +import random +import atlas +from atlas import Atlas + +def create_evaluation_with_retry(model: str, benchmark: str, max_retries: int = 3): + """Create evaluation with rate limit retry logic""" + client = Atlas() + + for attempt in range(max_retries): + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + + if evaluation: + print(f"✅ Success on attempt {attempt + 1}") + return evaluation + + except atlas.RateLimitError as e: + print(f"⏳ Rate limited on attempt {attempt + 1}") + + # Check if server provided retry-after header + retry_after = e.response.headers.get('retry-after') + + if retry_after: + wait_time = int(retry_after) + print(f" Server requests waiting {wait_time} seconds") + else: + # Exponential backoff with jitter + base_wait = 2 ** attempt + jitter = random.uniform(0, 1) + wait_time = base_wait + jitter + print(f" Using exponential backoff: {wait_time:.1f} seconds") + + if attempt < max_retries - 1: + time.sleep(wait_time) + else: + print(f"❌ Exhausted all {max_retries} retry attempts") + raise + + except atlas.APIError as e: + print(f"❌ Non-rate-limit error: {e}") + raise + + return None + +# Usage +evaluation = create_evaluation_with_retry("gpt-4", "mmlu") +``` + +### Advanced Retry Strategies + +#### Exponential Backoff with Jitter + +```python +import time +import random +import atlas +from atlas import Atlas + +class ExponentialBackoffRetry: + """Implement exponential backoff with jitter for rate limit handling""" + + def __init__(self, max_retries=5, base_delay=1.0, max_delay=60.0): + self.max_retries = max_retries + self.base_delay = base_delay + self.max_delay = max_delay + + def calculate_delay(self, attempt: int, retry_after: str = None) -> float: + """Calculate delay before next retry""" + + # If server provided retry-after, use that + if retry_after: + try: + return float(retry_after) + except (ValueError, TypeError): + pass + + # Exponential backoff: 2^attempt * base_delay + delay = self.base_delay * (2 ** attempt) + + # Add jitter to prevent thundering herd + jitter = delay * 0.1 * random.uniform(-1, 1) + delay += jitter + + # Cap at maximum delay + return min(delay, self.max_delay) + + def retry_operation(self, operation_func, *args, **kwargs): + """Retry operation with exponential backoff""" + + for attempt in range(self.max_retries): + try: + return operation_func(*args, **kwargs) + + except atlas.RateLimitError as e: + if attempt == self.max_retries - 1: + # Last attempt - re-raise the error + raise + + retry_after = e.response.headers.get('retry-after') + delay = self.calculate_delay(attempt, retry_after) + + print(f"⏳ Rate limited (attempt {attempt + 1}/{self.max_retries})") + print(f" Waiting {delay:.1f} seconds before retry...") + + time.sleep(delay) + continue + + except atlas.APIError as e: + # Don't retry other API errors + print(f"❌ Non-retryable error: {e}") + raise + +# Usage +backoff = ExponentialBackoffRetry(max_retries=5, base_delay=2.0, max_delay=120.0) + +def create_evaluation(): + client = Atlas() + return client.evaluations.create(model="gpt-4", benchmark="mmlu") + +evaluation = backoff.retry_operation(create_evaluation) +``` + + +## Proactive Rate Limit Management + +### Request Throttling + +```python +import time +from threading import Lock +from datetime import datetime, timedelta +import atlas +from atlas import Atlas + +class ThrottledAtlasClient: + """Atlas client with built-in request throttling""" + + def __init__(self, requests_per_minute=30, **client_kwargs): + self.client = Atlas(**client_kwargs) + self.requests_per_minute = requests_per_minute + self.min_interval = 60.0 / requests_per_minute # seconds between requests + self.last_request_time = None + self.lock = Lock() + + def _wait_for_next_request(self): + """Wait if necessary to maintain rate limit""" + with self.lock: + if self.last_request_time: + elapsed = time.time() - self.last_request_time + if elapsed < self.min_interval: + wait_time = self.min_interval - elapsed + print(f"⏳ Throttling: waiting {wait_time:.1f}s") + time.sleep(wait_time) + + self.last_request_time = time.time() + + def create_evaluation(self, *args, **kwargs): + """Create evaluation with throttling""" + self._wait_for_next_request() + return self.client.evaluations.create(*args, **kwargs) + + def get_results(self, *args, **kwargs): + """Get results with throttling""" + self._wait_for_next_request() + return self.client.results.get(*args, **kwargs) + +# Usage +throttled_client = ThrottledAtlasClient(requests_per_minute=20) + +# These requests will be automatically throttled +evaluations = [] +for i in range(10): + evaluation = throttled_client.create_evaluation( + model="gpt-4", + benchmark="mmlu" + ) + evaluations.append(evaluation) +``` + +### Batch Request Management + +```python +import time +from typing import List, Tuple, Callable, Any +from concurrent.futures import ThreadPoolExecutor, as_completed +import atlas +from atlas import Atlas + +class BatchRequestManager: + """Manage batch requests with rate limiting""" + + def __init__(self, requests_per_minute=30, max_concurrent=5): + self.requests_per_minute = requests_per_minute + self.max_concurrent = max_concurrent + self.request_interval = 60.0 / requests_per_minute + + def execute_batch(self, operations: List[Tuple[Callable, tuple, dict]], + handle_rate_limits=True) -> List[Any]: + """Execute a batch of operations with rate limiting""" + + results = [] + + if self.max_concurrent == 1 or not handle_rate_limits: + # Sequential execution + for i, (func, args, kwargs) in enumerate(operations): + if i > 0 and handle_rate_limits: + time.sleep(self.request_interval) + + try: + result = func(*args, **kwargs) + results.append({"success": True, "result": result, "index": i}) + except Exception as e: + results.append({"success": False, "error": e, "index": i}) + else: + # Concurrent execution with rate limiting + with ThreadPoolExecutor(max_workers=self.max_concurrent) as executor: + future_to_index = {} + + for i, (func, args, kwargs) in enumerate(operations): + if i > 0 and handle_rate_limits: + # Stagger request submissions + time.sleep(self.request_interval / self.max_concurrent) + + future = executor.submit(self._execute_with_retry, func, args, kwargs) + future_to_index[future] = i + + # Collect results + for future in as_completed(future_to_index): + index = future_to_index[future] + try: + result = future.result() + results.append({"success": True, "result": result, "index": index}) + except Exception as e: + results.append({"success": False, "error": e, "index": index}) + + # Sort results by original order + results.sort(key=lambda x: x["index"]) + return results + + def _execute_with_retry(self, func, args, kwargs, max_retries=3): + """Execute operation with retry on rate limit""" + for attempt in range(max_retries): + try: + return func(*args, **kwargs) + except atlas.RateLimitError as e: + if attempt == max_retries - 1: + raise + + retry_after = e.response.headers.get('retry-after', 60) + wait_time = int(retry_after) + time.sleep(wait_time) + +# Usage +client = Atlas() +batch_manager = BatchRequestManager(requests_per_minute=20, max_concurrent=3) + +# Prepare batch operations +operations = [] +models = ["gpt-4", "claude-3-opus", "gpt-3.5-turbo"] * 5 + +for model in models: + operation = ( + client.evaluations.create, # function + (), # args + {"model": model, "benchmark": "mmlu"} # kwargs + ) + operations.append(operation) + +# Execute batch +print(f"📦 Executing batch of {len(operations)} operations...") +results = batch_manager.execute_batch(operations) + +# Process results +successful = [r for r in results if r["success"]] +failed = [r for r in results if not r["success"]] + +print(f"✅ Successful: {len(successful)}") +print(f"❌ Failed: {len(failed)}") + +for result in failed: + print(f" Failed operation {result['index']}: {result['error']}") +``` + +## Monitoring Rate Limits + +### Rate Limit Usage Tracking + +```python +import time +from collections import defaultdict, deque +from datetime import datetime, timedelta +from typing import Dict, List +import atlas +from atlas import Atlas + +class RateLimitMonitor: + """Monitor and track rate limit usage""" + + def __init__(self, window_minutes=60): + self.window_minutes = window_minutes + self.request_times = deque() + self.rate_limit_events = [] + self.operation_counts = defaultdict(int) + self.error_counts = defaultdict(int) + + def record_request(self, operation: str): + """Record a successful request""" + now = datetime.now() + self.request_times.append(now) + self.operation_counts[operation] += 1 + self._cleanup_old_data(now) + + def record_rate_limit(self, operation: str, retry_after: int = None): + """Record a rate limit event""" + event = { + 'timestamp': datetime.now(), + 'operation': operation, + 'retry_after': retry_after + } + self.rate_limit_events.append(event) + self.error_counts['rate_limit'] += 1 + + def _cleanup_old_data(self, current_time: datetime): + """Remove data outside monitoring window""" + cutoff = current_time - timedelta(minutes=self.window_minutes) + + # Clean request times + while self.request_times and self.request_times[0] < cutoff: + self.request_times.popleft() + + # Clean rate limit events + self.rate_limit_events = [ + event for event in self.rate_limit_events + if event['timestamp'] > cutoff + ] + + def get_current_rate(self) -> float: + """Get current requests per minute""" + self._cleanup_old_data(datetime.now()) + + if not self.request_times: + return 0.0 + + # Calculate rate over actual time window + time_span = (datetime.now() - self.request_times[0]).total_seconds() / 60 + return len(self.request_times) / max(time_span, 1) + + def get_statistics(self) -> Dict: + """Get comprehensive rate limit statistics""" + self._cleanup_old_data(datetime.now()) + + recent_rate_limits = len(self.rate_limit_events) + total_requests = len(self.request_times) + + return { + 'current_rate_per_minute': self.get_current_rate(), + 'total_requests_in_window': total_requests, + 'rate_limit_events': recent_rate_limits, + 'rate_limit_percentage': (recent_rate_limits / max(total_requests, 1)) * 100, + 'operation_breakdown': dict(self.operation_counts), + 'last_rate_limit': max([e['timestamp'] for e in self.rate_limit_events], + default=None) + } + + def should_slow_down(self, threshold_percentage=5) -> bool: + """Check if we should slow down requests based on rate limits""" + stats = self.get_statistics() + return stats['rate_limit_percentage'] > threshold_percentage + +class MonitoredAtlasClient: + """Atlas client with rate limit monitoring""" + + def __init__(self, **client_kwargs): + self.client = Atlas(**client_kwargs) + self.monitor = RateLimitMonitor() + + def create_evaluation(self, *args, **kwargs): + """Create evaluation with monitoring""" + try: + result = self.client.evaluations.create(*args, **kwargs) + self.monitor.record_request('create_evaluation') + + # Adaptive slowdown + if self.monitor.should_slow_down(): + print("⚠️ High rate limit percentage detected, slowing down...") + time.sleep(2) + + return result + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after') + self.monitor.record_rate_limit('create_evaluation', retry_after) + raise + + def get_results(self, *args, **kwargs): + """Get results with monitoring""" + try: + result = self.client.results.get(*args, **kwargs) + self.monitor.record_request('get_results') + return result + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after') + self.monitor.record_rate_limit('get_results', retry_after) + raise + + def print_statistics(self): + """Print current rate limit statistics""" + stats = self.monitor.get_statistics() + + print("📊 Rate Limit Statistics (last hour):") + print(f" Current rate: {stats['current_rate_per_minute']:.1f} requests/min") + print(f" Total requests: {stats['total_requests_in_window']}") + print(f" Rate limit events: {stats['rate_limit_events']}") + print(f" Rate limit percentage: {stats['rate_limit_percentage']:.1f}%") + + if stats['operation_breakdown']: + print(" Operations:") + for op, count in stats['operation_breakdown'].items(): + print(f" {op}: {count}") + + if stats['last_rate_limit']: + print(f" Last rate limit: {stats['last_rate_limit']}") + +# Usage +monitored_client = MonitoredAtlasClient() + +# Make requests and monitor +for i in range(20): + try: + evaluation = monitored_client.create_evaluation( + model="gpt-4", + benchmark="mmlu" + ) + print(f"✅ Evaluation {i+1} created") + + if i % 5 == 0: # Print stats every 5 requests + monitored_client.print_statistics() + + except atlas.RateLimitError: + print(f"⏳ Rate limited on request {i+1}") + time.sleep(30) # Wait before continuing + +# Final statistics +monitored_client.print_statistics() +``` + +## Best Practices Summary + +### 1. Implement Proper Retry Logic +```python +# ✅ Good: Exponential backoff with jitter +def robust_request(operation_func, max_retries=3): + for attempt in range(max_retries): + try: + return operation_func() + except atlas.RateLimitError as e: + if attempt == max_retries - 1: + raise + + # Use server-suggested wait time if available + retry_after = e.response.headers.get('retry-after', 2 ** attempt) + wait_time = int(retry_after) + random.uniform(0, 1) + time.sleep(wait_time) +``` + +### 2. Respect Server Headers +```python +# ✅ Good: Check retry-after header +except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after') + if retry_after: + time.sleep(int(retry_after)) +``` + +### 3. Monitor Your Usage +```python +# ✅ Good: Track your rate limit usage +monitor = RateLimitMonitor() +# ... use monitor to adjust request patterns +``` + +### 4. Use Appropriate Request Rates +```python +# ✅ Good: Conservative request rate +throttled_client = ThrottledAtlasClient(requests_per_minute=20) + +# ❌ Bad: Aggressive request rate +# aggressive_client = ThrottledAtlasClient(requests_per_minute=1000) +``` + +### 5. Handle Rate Limits Gracefully +```python +# ✅ Good: Graceful handling +try: + result = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.RateLimitError: + # Log the event, wait, and potentially retry + logger.warning("Rate limit hit, backing off") + time.sleep(60) +``` diff --git a/docs/troubleshooting/authentication.md b/docs/troubleshooting/authentication.md new file mode 100644 index 0000000..fbf3ee6 --- /dev/null +++ b/docs/troubleshooting/authentication.md @@ -0,0 +1,186 @@ +# Authentication Problems + +This guide covers authentication-related issues and their solutions when using the Atlas Python SDK. + +## Understanding Atlas Authentication + +The Atlas SDK uses API key-based authentication with three required components: + +1. **API Key**: Your secret authentication token +2. **Organization ID**: Your organization identifier +3. **Project ID**: The specific project you're working with + +## Common Authentication Errors + +### Invalid or Missing API Key + +**Error**: `AuthenticationError: Invalid API key` + +**Symptoms**: +- 401 Unauthorized responses +- "Invalid API key" error messages +- Authentication fails immediately + + +### Missing Required Configuration + +**Error**: `AtlasError: The api_key client option must be set either by passing api_key to the client or by setting the LAYERLENS_ATLAS_API_KEY environment variable` + +**Solutions**: + +1. **Check all required environment variables**: + ```bash + # Linux/macOS + echo $LAYERLENS_ATLAS_API_KEY + echo $LAYERLENS_ATLAS_ORG_ID + echo $LAYERLENS_ATLAS_PROJECT_ID + + # Windows + echo %LAYERLENS_ATLAS_API_KEY% + echo %LAYERLENS_ATLAS_ORG_ID% + echo %LAYERLENS_ATLAS_PROJECT_ID% + ``` + +2. **Set environment variables properly**: + ```bash + # Linux/macOS - in your shell profile (.bashrc, .zshrc, etc.) + export LAYERLENS_ATLAS_API_KEY="sk-..." + export LAYERLENS_ATLAS_ORG_ID="org-..." + export LAYERLENS_ATLAS_PROJECT_ID="proj-..." + + # Windows - persistently + setx LAYERLENS_ATLAS_API_KEY "sk-..." + setx LAYERLENS_ATLAS_ORG_ID "org-..." + setx LAYERLENS_ATLAS_PROJECT_ID "proj-..." + ``` + +3. **Use .env file**: + ```bash + # Create .env file in your project root + LAYERLENS_ATLAS_API_KEY=sk-your-key-here + LAYERLENS_ATLAS_ORG_ID=org-your-org-here + LAYERLENS_ATLAS_PROJECT_ID=proj-your-project-here + ``` + + ```python + # Load .env file in your Python code + from dotenv import load_dotenv + import os + + load_dotenv() + + from atlas import Atlas + client = Atlas() + ``` + +### Permission Denied Errors + +**Error**: `PermissionDeniedError: 403 Forbidden` + +**Symptoms**: +- Valid API key but still get 403 errors +- Can authenticate but cannot create evaluations +- Access denied to specific models or benchmarks + +**Diagnosis**: +```python +import atlas +from atlas import Atlas + +def diagnose_permissions(): + client = Atlas() + + print("🔍 Permission Diagnosis:") + + # Test basic access + try: + # This should fail with specific error types + evaluation = client.evaluations.create( + model="test-model", + benchmark="test-benchmark" + ) + except atlas.AuthenticationError: + print(" ❌ Authentication failed - invalid API key") + return + except atlas.PermissionDeniedError: + print(" ❌ Permission denied - valid key, insufficient permissions") + except atlas.NotFoundError: + print(" ✅ Authentication works (model/benchmark not found is normal)") + except Exception as e: + print(f" ❓ Unexpected error: {e}") + + # Test with common models/benchmarks + test_combinations = [ + ("gpt-3.5-turbo", "mmlu"), + ("gpt-4", "hellaswag"), + ("claude-3-sonnet", "arc-challenge") + ] + + print("\n Testing access to specific resources:") + + for model, benchmark in test_combinations: + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + if evaluation: + print(f" ✅ {model} + {benchmark}: Access granted") + except atlas.PermissionDeniedError: + print(f" ❌ {model} + {benchmark}: Permission denied") + except atlas.NotFoundError: + print(f" ⚠️ {model} + {benchmark}: Resource not found") + except Exception as e: + print(f" ❓ {model} + {benchmark}: {e}") + +diagnose_permissions() +``` + +### Organization/Project Access Issues + +**Problem**: Valid API key but wrong organization or project + +**Symptoms**: +- Authentication succeeds +- Cannot access expected models or benchmarks +- Permission errors for resources you should have access to + +**Diagnosis**: +```python +import os +from atlas import Atlas +import atlas + +def verify_org_project_access(): + # Test with different org/project combinations + api_key = os.getenv('LAYERLENS_ATLAS_API_KEY') + + if not api_key: + print("❌ No API key found") + return + + # Test current configuration + current_org = os.getenv('LAYERLENS_ATLAS_ORG_ID') + current_project = os.getenv('LAYERLENS_ATLAS_PROJECT_ID') + + print(f"Testing current configuration:") + print(f" Organization: {current_org}") + print(f" Project: {current_project}") + + try: + client = Atlas( + api_key=api_key, + organization_id=current_org, + project_id=current_project + ) + + evaluation = client.evaluations.create(model="test", benchmark="test") + + except atlas.AuthenticationError: + print(" ❌ Authentication failed") + except atlas.PermissionDeniedError: + print(" ❌ Permission denied - check org/project IDs") + except atlas.NotFoundError: + print(" ✅ Access granted (test model not found is expected)") + except Exception as e: + print(f" ❓ Error: {e}") + +verify_org_project_access() +``` \ No newline at end of file diff --git a/docs/troubleshooting/common-issues.md b/docs/troubleshooting/common-issues.md new file mode 100644 index 0000000..1df848e --- /dev/null +++ b/docs/troubleshooting/common-issues.md @@ -0,0 +1,112 @@ +# Common Issues + +This guide covers the most frequently encountered issues when using the Atlas Python SDK and provides step-by-step solutions. + +## Installation Issues + +### Package Not Found + +**Problem**: `pip install atlas` fails with "No matching distribution found" + +**Solutions**: + +1. **Check Python version compatibility**: + ```bash + python --version + # Atlas requires Python 3.8+ + ``` + +2. **Update pip and try again**: + ```bash + python -m pip install --upgrade pip + pip install atlas + ``` + +3. **Use Python 3 explicitly**: + ```bash + python3 -m pip install atlas + ``` + +## Configuration Issues + +### Missing Environment Variables + +**Problem**: `AtlasError: The api_key client option must be set` + +**Diagnosis**: +```python +import os +print(f"API Key: {os.getenv('LAYERLENS_ATLAS_API_KEY', 'NOT SET')}") +print(f"Org ID: {os.getenv('LAYERLENS_ATLAS_ORG_ID', 'NOT SET')}") +print(f"Project ID: {os.getenv('LAYERLENS_ATLAS_PROJECT_ID', 'NOT SET')}") +``` + +**Solutions**: + +1. **Set environment variables**: + ```bash + # Linux/macOS + export LAYERLENS_ATLAS_API_KEY="your_api_key_here" + export LAYERLENS_ATLAS_ORG_ID="your_org_id_here" + export LAYERLENS_ATLAS_PROJECT_ID="your_project_id_here" + + # Windows + set LAYERLENS_ATLAS_API_KEY=your_api_key_here + set LAYERLENS_ATLAS_ORG_ID=your_org_id_here + set LAYERLENS_ATLAS_PROJECT_ID=your_project_id_here + ``` + +2. **Use .env file**: + ```bash + # Create .env file + LAYERLENS_ATLAS_API_KEY=your_api_key_here + LAYERLENS_ATLAS_ORG_ID=your_org_id_here + LAYERLENS_ATLAS_PROJECT_ID=your_project_id_here + ``` + + ```python + from dotenv import load_dotenv + load_dotenv() + + from atlas import Atlas + client = Atlas() + ``` + +3. **Pass explicitly to client**: + ```python + from atlas import Atlas + + client = Atlas( + api_key="your_api_key_here", + organization_id="your_org_id_here", + project_id="your_project_id_here" + ) + ``` + +### Where to Get Help + +1. **LayerLens Support**: Contact support through your LayerLens dashboard for technical issues +2. **Documentation**: Check the [complete documentation](../README.md) +3. **Community**: Join LayerLens community channels for discussions + +### Creating a Good Bug Report + +Include this information when reporting issues: + +1. **Environment details** (from debug info above) +2. **Complete error message** with stack trace +3. **Minimal reproducible example**: + ```python + from atlas import Atlas + + client = Atlas() + + # Minimal code that demonstrates the problem + evaluation = client.evaluations.create( + model="gpt-4", + benchmark="mmlu" + ) + ``` +4. **Expected vs actual behavior** +5. **Steps to reproduce** +6. **Workarounds attempted** diff --git a/docs/troubleshooting/error-codes.md b/docs/troubleshooting/error-codes.md new file mode 100644 index 0000000..227b16f --- /dev/null +++ b/docs/troubleshooting/error-codes.md @@ -0,0 +1,689 @@ +# Error Codes Reference + +This reference guide provides detailed information about all error codes and exceptions in the Atlas Python SDK. + +## Exception Hierarchy + +``` +AtlasError (Base exception) +├── APIError (Base for API-related errors) +│ ├── APIConnectionError (Network/connection issues) +│ │ └── APITimeoutError (Request timeouts) +│ ├── APIResponseValidationError (Invalid response format) +│ └── APIStatusError (HTTP status errors) +│ ├── BadRequestError (400) +│ ├── AuthenticationError (401) +│ ├── PermissionDeniedError (403) +│ ├── NotFoundError (404) +│ ├── ConflictError (409) +│ ├── UnprocessableEntityError (422) +│ ├── RateLimitError (429) +│ └── InternalServerError (500+) +``` + +## HTTP Status Code Errors + +### 400 - Bad Request (`BadRequestError`) + +**When it occurs**: +- Invalid request parameters +- Missing required fields +- Malformed request data + +**Common causes**: +```python +# Empty or invalid parameters +client.evaluations.create(model="", benchmark="") # Empty strings +client.evaluations.create(model=None, benchmark="mmlu") # None values + +# Invalid parameter types +client.evaluations.create(model=123, benchmark="mmlu") # Wrong type +``` + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="", benchmark="mmlu") +except atlas.BadRequestError as e: + print(f"Bad request: {e}") + print(f"Status code: {e.status_code}") # 400 + print(f"Response body: {e.body}") +``` + +**Solutions**: +1. **Validate parameters before making requests**: + ```python + def validate_evaluation_params(model, benchmark): + if not model or not isinstance(model, str): + raise ValueError("Model must be a non-empty string") + if not benchmark or not isinstance(benchmark, str): + raise ValueError("Benchmark must be a non-empty string") + return True + + if validate_evaluation_params(model, benchmark): + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + ``` + +2. **Check parameter format requirements**: + ```python + # Ensure parameters meet expected format + model = model.strip() if model else "" + benchmark = benchmark.strip() if benchmark else "" + + if len(model) < 2 or len(benchmark) < 2: + raise ValueError("Model and benchmark names must be at least 2 characters") + ``` + +### 401 - Unauthorized (`AuthenticationError`) + +**When it occurs**: +- Missing API key +- Invalid or expired API key +- API key format issues + +**Common causes**: +```python +# Missing API key +client = Atlas(api_key=None) + +# Invalid API key format +client = Atlas(api_key="invalid-key") + +# Expired API key (need to regenerate) +client = Atlas(api_key="sk-old-expired-key") +``` + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas(api_key="invalid-key") + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.AuthenticationError as e: + print(f"Authentication failed: {e}") + print(f"Status code: {e.status_code}") # 401 + print(f"Request ID: {e.request_id}") +``` + +**Solutions**: +1. **Verify API key configuration**: + ```python + import os + + api_key = os.getenv('LAYERLENS_ATLAS_API_KEY') + if not api_key: + print("❌ API key not found in environment variables") + elif len(api_key) < 10: + print("⚠️ API key seems too short") + else: + print("✅ API key found and looks valid") + ``` + +2. **Regenerate API key**: + - Log into Atlas dashboard + - Go to Settings > API Keys + - Generate new API key + - Update environment variables + +3. **Test authentication separately**: + ```python + def test_authentication(api_key): + try: + client = Atlas(api_key=api_key) + # Try minimal operation to test auth + client.evaluations.create(model="test", benchmark="test") + except atlas.AuthenticationError: + return False, "Invalid API key" + except atlas.NotFoundError: + return True, "Authentication successful (test resources not found is expected)" + except Exception as e: + return False, f"Unexpected error: {e}" + + is_valid, message = test_authentication(your_api_key) + print(f"Authentication test: {message}") + ``` + +### 403 - Forbidden (`PermissionDeniedError`) + +**When it occurs**: +- Valid API key but insufficient permissions +- No access to specific models or benchmarks +- Organization/project access issues + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="restricted-model", benchmark="mmlu") +except atlas.PermissionDeniedError as e: + print(f"Permission denied: {e}") + print(f"Status code: {e.status_code}") # 403 + print(f"Response body: {e.body}") +``` + +**Solutions**: +1. **Check organization and project IDs**: + ```python + import os + + print(f"Organization ID: {os.getenv('LAYERLENS_ATLAS_ORG_ID')}") + print(f"Project ID: {os.getenv('LAYERLENS_ATLAS_PROJECT_ID')}") + + # Verify these match your Atlas dashboard settings + ``` + +2. **Test access to different resources**: + ```python + def test_resource_access(models, benchmarks): + client = Atlas() + access_matrix = {} + + for model in models: + access_matrix[model] = {} + for benchmark in benchmarks: + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + access_matrix[model][benchmark] = "✅ Access granted" + except atlas.PermissionDeniedError: + access_matrix[model][benchmark] = "❌ Permission denied" + except atlas.NotFoundError: + access_matrix[model][benchmark] = "❓ Resource not found" + except Exception as e: + access_matrix[model][benchmark] = f"❓ {type(e).__name__}" + + return access_matrix + + # Test common resources + models = ["gpt-3.5-turbo", "gpt-4", "claude-3-sonnet"] + benchmarks = ["mmlu", "hellaswag", "arc-easy"] + + access = test_resource_access(models, benchmarks) + ``` + +3. **Contact administrator for access**: + - Request access to specific models or benchmarks + - Verify project membership + - Check organization-level permissions + +### 404 - Not Found (`NotFoundError`) + +**When it occurs**: +- Model ID doesn't exist +- Benchmark ID doesn't exist +- Evaluation ID not found (for results) +- Resource doesn't exist in your organization + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="nonexistent-model", benchmark="mmlu") +except atlas.NotFoundError as e: + print(f"Resource not found: {e}") + print(f"Status code: {e.status_code}") # 404 +``` + +**Solutions**: +1. **Verify resource names**: + ```python + def find_available_models(): + """Try common model names to find available ones""" + client = Atlas() + + common_models = [ + "gpt-4", "gpt-3.5-turbo", "gpt-4-turbo", + "claude-3-opus", "claude-3-sonnet", "claude-3-haiku", + "llama-2-70b", "llama-2-13b", "mistral-7b" + ] + + available_models = [] + + for model in common_models: + try: + # Test with common benchmark + evaluation = client.evaluations.create(model=model, benchmark="mmlu") + if evaluation: + available_models.append(model) + except atlas.NotFoundError: + # Model or benchmark not found + continue + except atlas.PermissionDeniedError: + # Model exists but no permission + available_models.append(f"{model} (no permission)") + except Exception: + # Other errors - model might exist + available_models.append(f"{model} (unknown status)") + + return available_models + + available = find_available_models() + print(f"Available models: {available}") + ``` + +2. **Check spelling and case sensitivity**: + ```python + # Common mistakes + correct_names = { + "GPT-4": "gpt-4", + "GPT4": "gpt-4", + "MMLU": "mmlu", + "HellaSwag": "hellaswag", + "arc_challenge": "arc-challenge" # Underscore vs hyphen + } + ``` + +3. **Use exact names from Atlas dashboard**: + - Log into Atlas dashboard + - Check available models and benchmarks + - Copy exact names (case-sensitive) + +### 409 - Conflict (`ConflictError`) + +**When it occurs**: +- Resource already exists +- Conflicting operation in progress +- State conflict (e.g., trying to modify completed evaluation) + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + # Some operation that conflicts with current state + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.ConflictError as e: + print(f"Conflict error: {e}") + print(f"Status code: {e.status_code}") # 409 +``` + +**Solutions**: +1. **Check current resource state** +2. **Wait for ongoing operations to complete** +3. **Use different resource identifiers** + +### 422 - Unprocessable Entity (`UnprocessableEntityError`) + +**When it occurs**: +- Valid request format but business logic prevents processing +- Parameter combinations that don't make sense +- Resource constraints exceeded + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="invalid-benchmark") +except atlas.UnprocessableEntityError as e: + print(f"Unprocessable entity: {e}") + print(f"Status code: {e.status_code}") # 422 + print(f"Response details: {e.body}") +``` + +**Solutions**: +1. **Check business logic constraints** +2. **Verify parameter combinations are valid** +3. **Review API documentation for limitations** + +### 429 - Rate Limited (`RateLimitError`) + +**When it occurs**: +- Too many requests in short time period +- API rate limits exceeded +- Organization-level quotas reached + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + + # Making too many requests quickly + for i in range(100): + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") + +except atlas.RateLimitError as e: + print(f"Rate limited: {e}") + print(f"Status code: {e.status_code}") # 429 + print(f"Retry after: {e.response.headers.get('retry-after', 'not specified')}") +``` + +**Solutions**: +1. **Implement retry with backoff**: + ```python + import time + import atlas + from atlas import Atlas + + def create_evaluation_with_rate_limit_handling(model, benchmark, max_retries=3): + client = Atlas() + + for attempt in range(max_retries): + try: + return client.evaluations.create(model=model, benchmark=benchmark) + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after') + + if retry_after: + wait_time = int(retry_after) + print(f"Rate limited. Waiting {wait_time}s as requested...") + else: + wait_time = (2 ** attempt) * 60 # Exponential backoff + print(f"Rate limited. Waiting {wait_time}s...") + + if attempt < max_retries - 1: + time.sleep(wait_time) + else: + raise # Re-raise on final attempt + + return None + + evaluation = create_evaluation_with_rate_limit_handling("gpt-4", "mmlu") + ``` + +2. **Add delays between requests**: + ```python + import time + + evaluations = [] + models = ["gpt-4", "claude-3-opus", "llama-2-70b"] + + for model in models: + evaluation = client.evaluations.create(model=model, benchmark="mmlu") + evaluations.append(evaluation) + + # Wait between requests to avoid rate limits + time.sleep(2) # 2-second delay + ``` + +3. **Monitor rate limit headers**: + ```python + def monitor_rate_limits(client): + """Monitor rate limit status""" + # This would require SDK modification to expose headers + # Check with LayerLens documentation for rate limit details + pass + ``` + +### 500+ - Server Errors (`InternalServerError`) + +**When it occurs**: +- Atlas API server errors +- Temporary service unavailability +- Infrastructure issues + +**Example error**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas() + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.InternalServerError as e: + print(f"Server error: {e}") + print(f"Status code: {e.status_code}") # 500, 502, 503, etc. + print(f"Request ID: {e.request_id}") # Include in support requests +``` + +**Solutions**: +1. **Implement retry logic**: + ```python + import time + import atlas + from atlas import Atlas + + def create_evaluation_with_server_error_handling(model, benchmark): + client = Atlas() + max_retries = 3 + base_delay = 5 # seconds + + for attempt in range(max_retries): + try: + return client.evaluations.create(model=model, benchmark=benchmark) + + except atlas.InternalServerError as e: + print(f"Server error on attempt {attempt + 1}: {e}") + + if attempt < max_retries - 1: + # Exponential backoff with jitter + delay = base_delay * (2 ** attempt) + random.uniform(0, 2) + print(f"Retrying in {delay:.1f}s...") + time.sleep(delay) + else: + print(f"All {max_retries} attempts failed. Request ID: {e.request_id}") + raise + + return None + ``` + +2. **Check service status**: + - Visit LayerLens status page + - Check for ongoing incidents + - Monitor Atlas service announcements + +3. **Report persistent issues**: + - Include request ID from error + - Provide timestamp and error details + - Contact LayerLens support + +## Connection Errors + +### `APIConnectionError` + +**When it occurs**: +- Network connectivity issues +- DNS resolution failures +- Firewall blocking requests +- Proxy configuration problems + +**Example**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas(timeout=10.0) + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APIConnectionError as e: + print(f"Connection error: {e}") + print(f"Request URL: {e.request.url}") +``` + +**Solutions**: +1. **Test basic connectivity**: + ```bash + ping api.layerlens.com + curl -I https://api.layerlens.com + ``` + +2. **Check proxy/firewall settings** +3. **Verify DNS resolution** + +### `APITimeoutError` + +**When it occurs**: +- Request takes longer than configured timeout +- Network latency issues +- Server processing delays + +**Example**: +```python +import atlas +from atlas import Atlas + +try: + client = Atlas(timeout=30.0) # 30-second timeout + evaluation = client.evaluations.create(model="gpt-4", benchmark="mmlu") +except atlas.APITimeoutError as e: + print(f"Request timed out: {e}") +``` + +**Solutions**: +1. **Increase timeout**: + ```python + client = Atlas(timeout=600.0) # 10 minutes + ``` + +2. **Use appropriate timeouts for operation type**: + ```python + # Quick operations + quick_client = Atlas(timeout=60.0) + + # Long-running evaluations + patient_client = Atlas(timeout=1800.0) # 30 minutes + ``` + +## Error Handling Best Practices + +### Comprehensive Error Handling + +```python +import atlas +from atlas import Atlas +import time +import logging + +logging.basicConfig(level=logging.INFO) +logger = logging.getLogger(__name__) + +def robust_create_evaluation(model: str, benchmark: str): + """Create evaluation with comprehensive error handling""" + client = Atlas() + + try: + evaluation = client.evaluations.create(model=model, benchmark=benchmark) + + if evaluation: + logger.info(f"✅ Evaluation created: {evaluation.id}") + return evaluation + else: + logger.warning("⚠️ Evaluation creation returned None") + return None + + except atlas.BadRequestError as e: + logger.error(f"❌ Bad request - check parameters: {e}") + logger.error(f" Model: '{model}', Benchmark: '{benchmark}'") + return None + + except atlas.AuthenticationError as e: + logger.error(f"❌ Authentication failed: {e}") + logger.error(" Check API key configuration") + return None + + except atlas.PermissionDeniedError as e: + logger.error(f"❌ Permission denied: {e}") + logger.error(f" No access to model '{model}' or benchmark '{benchmark}'") + return None + + except atlas.NotFoundError as e: + logger.error(f"❌ Resource not found: {e}") + logger.error(f" Model '{model}' or benchmark '{benchmark}' doesn't exist") + return None + + except atlas.RateLimitError as e: + retry_after = e.response.headers.get('retry-after', 60) + logger.warning(f"⏳ Rate limited - retry after {retry_after}s") + return None # Could implement retry logic here + + except atlas.InternalServerError as e: + logger.error(f"❌ Server error: {e}") + logger.error(f" Request ID: {e.request_id} (include in support requests)") + return None + + except atlas.APITimeoutError as e: + logger.error(f"⏰ Request timed out: {e}") + logger.error(" Consider increasing timeout or checking network") + return None + + except atlas.APIConnectionError as e: + logger.error(f"🔌 Connection error: {e}") + logger.error(" Check network connectivity and proxy settings") + return None + + except atlas.APIError as e: + logger.error(f"❌ Unexpected API error: {e}") + logger.error(f" Type: {type(e).__name__}") + return None + + except Exception as e: + logger.error(f"❌ Unexpected error: {e}") + logger.error(f" Type: {type(e).__name__}") + return None + +# Usage +evaluation = robust_create_evaluation("gpt-4", "mmlu") +``` + +### Error Recovery Patterns + +```python +import atlas +from atlas import Atlas +import time +import random + +class AtlasErrorRecovery: + """Implement various error recovery patterns""" + + def __init__(self, client: Atlas): + self.client = client + + def exponential_backoff_retry(self, operation, max_retries=3, base_delay=1): + """Retry with exponential backoff""" + for attempt in range(max_retries): + try: + return operation() + except (atlas.InternalServerError, atlas.APIConnectionError, atlas.APITimeoutError) as e: + if attempt == max_retries - 1: + raise # Last attempt - re-raise the error + + delay = base_delay * (2 ** attempt) + random.uniform(0, 1) + print(f"Attempt {attempt + 1} failed: {e}") + print(f"Retrying in {delay:.1f}s...") + time.sleep(delay) + + def circuit_breaker(self, operation, failure_threshold=5, recovery_time=60): + """Implement circuit breaker pattern""" + # This would be a more complex implementation + # See advanced-usage.md for full implementation + pass + + def fallback_strategy(self, primary_operation, fallback_operation): + """Try primary operation, fall back to alternative""" + try: + return primary_operation() + except atlas.APIError as e: + print(f"Primary operation failed: {e}") + print("Trying fallback...") + return fallback_operation() + +# Usage +client = Atlas() +recovery = AtlasErrorRecovery(client) + +def create_evaluation(): + return client.evaluations.create(model="gpt-4", benchmark="mmlu") + +# Retry with exponential backoff +evaluation = recovery.exponential_backoff_retry(create_evaluation) +```