Describe the bug
The evaluation_run table score column stores two different JSON schemas:
- Newer schema: {"traces": [{"scores": [{"name": "...", "value": ...}]}]}
- Older legacy schema: {"cosine_similarity": {"avg": ..., "per_item_scores": [...]}}
This causes frontend rendering issues. Additionally, the older schema only contains cosine similarity scores (missing judge scores),
while the newer schema has both.
To Reproduce
Occurs sporadically. Likely related to the order in which cosine similarity and judge scores are fetched/merged.
Expected behavior
Single consistent schema for the score column across all evaluation runs.
Additional context
Older schema example:
{
"cosine_similarity": {
"avg": 0.6414,
"std": 0.0800,
"total_pairs": 24,
"per_item_scores": [
{
"trace_id": "9b80f66b-...",
"cosine_similarity": 0.7379
}
]
}
}
Newer schema example:
{
"traces": [
{
"scores": [
{
"name": "SNEHA correctness",
"value": ...
}
]
}
]
}
Screenshots

Describe the bug
The
evaluation_runtablescorecolumn stores two different JSON schemas:This causes frontend rendering issues. Additionally, the older schema only contains cosine similarity scores (missing judge scores),
while the newer schema has both.
To Reproduce
Occurs sporadically. Likely related to the order in which cosine similarity and judge scores are fetched/merged.
Expected behavior
Single consistent schema for the score column across all evaluation runs.
Additional context
Older schema example:
Newer schema example:
Screenshots