Skip to content

Commit 5be0bdc

Browse files
Evaluation (#405)
* first stab at setting up langfuse evaluation * bringing traces to sync * cleanup * getting it up and running with sync * using utils * cleanups * added logs * code refactoring * cleanups * moving to separate files * using pydantic types * Remove project_id dependency from evaluation endpoints - Remove project_id parameter from /evaluate endpoint - Update get_provider_credential calls to not require project_id - Credentials now retrieved via API key authentication - Clean up logging configuration and imports - Fix linting errors and update type annotations * using hardcoded * adding endpoint for uploading dataset * added steps for starting evaluation using batchAPI * added testcase * using celery beat and evaluation batch * first stab at running evaluation * cleaning up traces in langfuse * cleanup unnecessary code * syncing with master changes * moving to batch table * checking out AWS * cleanup migration * added support for cosine similarity score * first stab at pushing cosine to langfuse * cleanup logs * optimizing similarity * added evaluation dataset * update endpoints * updated testcases * using single migration file * code cleanups * few more cleanups and tests * added support for sanitizing dataset name * fix import issues in testcases * fixing imports * minor cleanups for evaluation * passing project id as well * updated testcases and error codes * using util for file uploads * optimizing cosine similarities * added support for duplication factor limit * cleanup for dataset id in evaluation * file validations * refactoring file structure * Evaluation: Add cron job endpoint and script for periodic evaluation processing (#428) * add cron job endpoint and script for periodic evaluation processing * add cron job endpoint and update invocation script for periodic evaluation processing * refactor: remove redundant dependency from evaluation cron job * minor fixes * cleanup cruds * removed celery beat * cleanup evaluation run update and context runs * cleanup logs * using response id * type checking for clean code * cleaner documentation * added indexes * removing unnecessary asyncs * using get_langfuse_client instead * update migration head * refactoring and cleanups * cleanup cron * moving to env for cron * formatting code * updated endpoints --------- Co-authored-by: Kartikeya Pophali <kartikeyapophali@gmail.com>
1 parent 04103ad commit 5be0bdc

43 files changed

Lines changed: 7418 additions & 31 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,12 @@ FIRST_SUPERUSER=superuser@example.com
2323
FIRST_SUPERUSER_PASSWORD=changethis
2424
EMAIL_TEST_USER="test@example.com"
2525

26+
# API Base URL for cron scripts (defaults to http://localhost:8000 if not set)
27+
API_BASE_URL=http://localhost:8000
28+
29+
# Cron interval in minutes (defaults to 5 minutes if not set)
30+
CRON_INTERVAL_MINUTES=5
31+
2632
# Postgres
2733
POSTGRES_SERVER=localhost
2834
POSTGRES_PORT=5432
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
"""create_evaluation_run_table, batch_job_table, and evaluation_dataset_table
2+
3+
Revision ID: 6fe772038a5a
4+
Revises: 219033c644de
5+
Create Date: 2025-11-05 22:47:18.266070
6+
7+
"""
8+
from alembic import op
9+
import sqlalchemy as sa
10+
from sqlalchemy.dialects import postgresql
11+
import sqlmodel.sql.sqltypes
12+
13+
14+
# revision identifiers, used by Alembic.
15+
revision = "6fe772038a5a"
16+
down_revision = "219033c644de"
17+
branch_labels = None
18+
depends_on = None
19+
20+
21+
def upgrade():
22+
# Create batch_job table first (as evaluation_run will reference it)
23+
op.create_table(
24+
"batch_job",
25+
sa.Column("id", sa.Integer(), nullable=False),
26+
sa.Column(
27+
"provider",
28+
sa.String(),
29+
nullable=False,
30+
comment="LLM provider name (e.g., 'openai', 'anthropic')",
31+
),
32+
sa.Column(
33+
"job_type",
34+
sa.String(),
35+
nullable=False,
36+
comment="Type of batch job (e.g., 'evaluation', 'classification', 'embedding')",
37+
),
38+
sa.Column(
39+
"config",
40+
postgresql.JSONB(astext_type=sa.Text()),
41+
nullable=False,
42+
server_default=sa.text("'{}'::jsonb"),
43+
comment="Complete batch configuration",
44+
),
45+
sa.Column(
46+
"provider_batch_id",
47+
sa.String(),
48+
nullable=True,
49+
comment="Provider's batch job ID",
50+
),
51+
sa.Column(
52+
"provider_file_id",
53+
sa.String(),
54+
nullable=True,
55+
comment="Provider's input file ID",
56+
),
57+
sa.Column(
58+
"provider_output_file_id",
59+
sa.String(),
60+
nullable=True,
61+
comment="Provider's output file ID",
62+
),
63+
sa.Column(
64+
"provider_status",
65+
sa.String(),
66+
nullable=True,
67+
comment="Provider-specific status (e.g., OpenAI: validating, in_progress, completed, failed)",
68+
),
69+
sa.Column(
70+
"raw_output_url",
71+
sa.String(),
72+
nullable=True,
73+
comment="S3 URL of raw batch output file",
74+
),
75+
sa.Column(
76+
"total_items",
77+
sa.Integer(),
78+
nullable=False,
79+
server_default=sa.text("0"),
80+
comment="Total number of items in the batch",
81+
),
82+
sa.Column(
83+
"error_message",
84+
sa.Text(),
85+
nullable=True,
86+
comment="Error message if batch failed",
87+
),
88+
sa.Column("organization_id", sa.Integer(), nullable=False),
89+
sa.Column("project_id", sa.Integer(), nullable=False),
90+
sa.Column("inserted_at", sa.DateTime(), nullable=False),
91+
sa.Column("updated_at", sa.DateTime(), nullable=False),
92+
sa.ForeignKeyConstraint(
93+
["organization_id"], ["organization.id"], ondelete="CASCADE"
94+
),
95+
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
96+
sa.PrimaryKeyConstraint("id"),
97+
)
98+
op.create_index(
99+
op.f("ix_batch_job_job_type"), "batch_job", ["job_type"], unique=False
100+
)
101+
op.create_index(
102+
op.f("ix_batch_job_organization_id"),
103+
"batch_job",
104+
["organization_id"],
105+
unique=False,
106+
)
107+
op.create_index(
108+
op.f("ix_batch_job_project_id"), "batch_job", ["project_id"], unique=False
109+
)
110+
op.create_index(
111+
"idx_batch_job_status_org",
112+
"batch_job",
113+
["provider_status", "organization_id"],
114+
unique=False,
115+
)
116+
op.create_index(
117+
"idx_batch_job_status_project",
118+
"batch_job",
119+
["provider_status", "project_id"],
120+
unique=False,
121+
)
122+
123+
# Create evaluation_dataset table
124+
op.create_table(
125+
"evaluation_dataset",
126+
sa.Column("id", sa.Integer(), nullable=False),
127+
sa.Column("name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
128+
sa.Column("description", sqlmodel.sql.sqltypes.AutoString(), nullable=True),
129+
sa.Column(
130+
"dataset_metadata",
131+
postgresql.JSONB(astext_type=sa.Text()),
132+
nullable=False,
133+
server_default=sa.text("'{}'::jsonb"),
134+
),
135+
sa.Column(
136+
"object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
137+
),
138+
sa.Column(
139+
"langfuse_dataset_id",
140+
sqlmodel.sql.sqltypes.AutoString(),
141+
nullable=True,
142+
),
143+
sa.Column("organization_id", sa.Integer(), nullable=False),
144+
sa.Column("project_id", sa.Integer(), nullable=False),
145+
sa.Column("inserted_at", sa.DateTime(), nullable=False),
146+
sa.Column("updated_at", sa.DateTime(), nullable=False),
147+
sa.ForeignKeyConstraint(
148+
["organization_id"], ["organization.id"], ondelete="CASCADE"
149+
),
150+
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
151+
sa.PrimaryKeyConstraint("id"),
152+
sa.UniqueConstraint(
153+
"name",
154+
"organization_id",
155+
"project_id",
156+
name="uq_evaluation_dataset_name_org_project",
157+
),
158+
)
159+
op.create_index(
160+
op.f("ix_evaluation_dataset_name"),
161+
"evaluation_dataset",
162+
["name"],
163+
unique=False,
164+
)
165+
166+
# Create evaluation_run table with all columns and foreign key references
167+
op.create_table(
168+
"evaluation_run",
169+
sa.Column("run_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
170+
sa.Column("dataset_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
171+
sa.Column("config", sa.JSON(), nullable=False),
172+
sa.Column("batch_job_id", sa.Integer(), nullable=True),
173+
sa.Column(
174+
"embedding_batch_job_id",
175+
sa.Integer(),
176+
nullable=True,
177+
comment="Reference to the batch_job for embedding-based similarity scoring",
178+
),
179+
sa.Column("dataset_id", sa.Integer(), nullable=False),
180+
sa.Column("status", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
181+
sa.Column(
182+
"object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
183+
),
184+
sa.Column("total_items", sa.Integer(), nullable=False),
185+
sa.Column("score", sa.JSON(), nullable=True),
186+
sa.Column("error_message", sa.Text(), nullable=True),
187+
sa.Column("organization_id", sa.Integer(), nullable=False),
188+
sa.Column("project_id", sa.Integer(), nullable=False),
189+
sa.Column("id", sa.Integer(), nullable=False),
190+
sa.Column("inserted_at", sa.DateTime(), nullable=False),
191+
sa.Column("updated_at", sa.DateTime(), nullable=False),
192+
sa.ForeignKeyConstraint(
193+
["batch_job_id"],
194+
["batch_job.id"],
195+
ondelete="SET NULL",
196+
),
197+
sa.ForeignKeyConstraint(
198+
["embedding_batch_job_id"],
199+
["batch_job.id"],
200+
name="fk_evaluation_run_embedding_batch_job_id",
201+
ondelete="SET NULL",
202+
),
203+
sa.ForeignKeyConstraint(
204+
["dataset_id"],
205+
["evaluation_dataset.id"],
206+
name="fk_evaluation_run_dataset_id",
207+
ondelete="CASCADE",
208+
),
209+
sa.ForeignKeyConstraint(
210+
["organization_id"], ["organization.id"], ondelete="CASCADE"
211+
),
212+
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
213+
sa.PrimaryKeyConstraint("id"),
214+
)
215+
op.create_index(
216+
op.f("ix_evaluation_run_run_name"), "evaluation_run", ["run_name"], unique=False
217+
)
218+
op.create_index(
219+
"idx_eval_run_status_org",
220+
"evaluation_run",
221+
["status", "organization_id"],
222+
unique=False,
223+
)
224+
op.create_index(
225+
"idx_eval_run_status_project",
226+
"evaluation_run",
227+
["status", "project_id"],
228+
unique=False,
229+
)
230+
231+
232+
def downgrade():
233+
# Drop evaluation_run table first (has foreign keys to batch_job and evaluation_dataset)
234+
op.drop_index("idx_eval_run_status_project", table_name="evaluation_run")
235+
op.drop_index("idx_eval_run_status_org", table_name="evaluation_run")
236+
op.drop_index(op.f("ix_evaluation_run_run_name"), table_name="evaluation_run")
237+
op.drop_table("evaluation_run")
238+
239+
# Drop evaluation_dataset table
240+
op.drop_index(op.f("ix_evaluation_dataset_name"), table_name="evaluation_dataset")
241+
op.drop_table("evaluation_dataset")
242+
243+
# Drop batch_job table
244+
op.drop_index("idx_batch_job_status_project", table_name="batch_job")
245+
op.drop_index("idx_batch_job_status_org", table_name="batch_job")
246+
op.drop_index(op.f("ix_batch_job_project_id"), table_name="batch_job")
247+
op.drop_index(op.f("ix_batch_job_organization_id"), table_name="batch_job")
248+
op.drop_index(op.f("ix_batch_job_job_type"), table_name="batch_job")
249+
op.drop_table("batch_job")

backend/app/api/deps.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def get_current_user(
7070
if not user:
7171
raise HTTPException(status_code=404, detail="User not found")
7272
if not user.is_active:
73-
raise HTTPException(status_code=400, detail="Inactive user")
73+
raise HTTPException(status_code=403, detail="Inactive user")
7474

7575
return user # Return only User object
7676

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
Start an evaluation using OpenAI Batch API.
2+
3+
This endpoint:
4+
1. Fetches the dataset from database and validates it has Langfuse dataset ID
5+
2. Creates an EvaluationRun record in the database
6+
3. Fetches dataset items from Langfuse
7+
4. Builds JSONL for batch processing (config is used as-is)
8+
5. Creates a batch job via the generic batch infrastructure
9+
6. Returns the evaluation run details with batch_job_id
10+
11+
The batch will be processed asynchronously by Celery Beat (every 60s).
12+
Use GET /evaluations/{evaluation_id} to check progress.
13+
14+
## Request Body
15+
16+
- **dataset_id** (required): ID of the evaluation dataset (from /evaluations/datasets)
17+
- **experiment_name** (required): Name for this evaluation experiment/run
18+
- **config** (optional): Configuration dict that will be used as-is in JSONL generation. Can include any OpenAI Responses API parameters like:
19+
- model: str (e.g., "gpt-4o", "gpt-5")
20+
- instructions: str
21+
- tools: list (e.g., [{"type": "file_search", "vector_store_ids": [...]}])
22+
- reasoning: dict (e.g., {"effort": "low"})
23+
- text: dict (e.g., {"verbosity": "low"})
24+
- temperature: float
25+
- include: list (e.g., ["file_search_call.results"])
26+
- Note: "input" will be added automatically from the dataset
27+
- **assistant_id** (optional): Assistant ID to fetch configuration from. If provided, configuration will be fetched from the assistant in the database. Config can be passed as empty dict {} when using assistant_id.
28+
29+
## Example with config
30+
31+
```json
32+
{
33+
"dataset_id": 123,
34+
"experiment_name": "test_run",
35+
"config": {
36+
"model": "gpt-4.1",
37+
"instructions": "You are a helpful FAQ assistant.",
38+
"tools": [
39+
{
40+
"type": "file_search",
41+
"vector_store_ids": ["vs_12345"],
42+
"max_num_results": 3
43+
}
44+
],
45+
"include": ["file_search_call.results"]
46+
}
47+
}
48+
```
49+
50+
## Example with assistant_id
51+
52+
```json
53+
{
54+
"dataset_id": 123,
55+
"experiment_name": "test_run",
56+
"config": {},
57+
"assistant_id": "asst_xyz"
58+
}
59+
```
60+
61+
## Returns
62+
63+
EvaluationRunPublic with batch details and status:
64+
- id: Evaluation run ID
65+
- run_name: Name of the evaluation run
66+
- dataset_name: Name of the dataset used
67+
- dataset_id: ID of the dataset used
68+
- config: Configuration used for the evaluation
69+
- batch_job_id: ID of the batch job processing this evaluation
70+
- status: Current status (pending, running, completed, failed)
71+
- total_items: Total number of items being evaluated
72+
- completed_items: Number of items completed so far
73+
- results: Evaluation results (when completed)
74+
- error_message: Error message if failed
75+
76+
## Error Responses
77+
78+
- **404**: Dataset or assistant not found or not accessible
79+
- **400**: Missing required credentials (OpenAI or Langfuse), dataset missing Langfuse ID, or config missing required fields
80+
- **500**: Failed to configure API clients or start batch evaluation
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
Delete a dataset by ID.
2+
3+
This will remove the dataset record from the database. The CSV file in object store (if exists) will remain for audit purposes, but the dataset will no longer be accessible for creating new evaluations.
4+
5+
## Path Parameters
6+
7+
- **dataset_id**: ID of the dataset to delete
8+
9+
## Returns
10+
11+
Success message with deleted dataset details:
12+
- message: Confirmation message
13+
- dataset_id: ID of the deleted dataset
14+
15+
## Error Responses
16+
17+
- **404**: Dataset not found or not accessible to your organization/project
18+
- **400**: Dataset cannot be deleted (e.g., has active evaluation runs)
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Get details of a specific dataset by ID.
2+
3+
Retrieves comprehensive information about a dataset including metadata, object store URL, and Langfuse integration details.
4+
5+
## Path Parameters
6+
7+
- **dataset_id**: ID of the dataset to retrieve
8+
9+
## Returns
10+
11+
DatasetUploadResponse with dataset details:
12+
- dataset_id: Unique identifier for the dataset
13+
- dataset_name: Name of the dataset (sanitized)
14+
- total_items: Total number of items including duplication
15+
- original_items: Number of original items before duplication
16+
- duplication_factor: Factor by which items were duplicated
17+
- langfuse_dataset_id: ID of the dataset in Langfuse
18+
- object_store_url: URL to the CSV file in object storage
19+
20+
## Error Responses
21+
22+
- **404**: Dataset not found or not accessible to your organization/project

0 commit comments

Comments
 (0)