Is your feature request related to a problem? Please describe.
Teams using Kaapi to generate response using AI need a repeatable, comparable, and auditable way to evaluate LLM answer quality against a golden Q&A set. Today this process is manual and scattered across tools, making it hard to:
(a) run consistent experiments (prompt/model/temp/vector store),
(b) quantify flakiness, and
(c) view per-item traces alongside aggregate scores. We need an automated pipeline that runs on a golden dataset, generates answers in batch, computes scores that can determine quality of responses with change in config
Describe the solution you'd like
Build an end-to-end evaluation flow in Kaapi that:
- Accepts a golden CSV of questions & answers, duplicates each question N=5 times to measure flakiness
- Uses an /evaluate flow that reads a config (assistant settings) and start evaluation
- Consolidates results (question, generated_output, ground_truth), archives to S3, generate score for each question answer pair and persist in Kaapi’s DB.
Reference
Solution Doc
Is your feature request related to a problem? Please describe.
Teams using Kaapi to generate response using AI need a repeatable, comparable, and auditable way to evaluate LLM answer quality against a golden Q&A set. Today this process is manual and scattered across tools, making it hard to:
(a) run consistent experiments (prompt/model/temp/vector store),
(b) quantify flakiness, and
(c) view per-item traces alongside aggregate scores. We need an automated pipeline that runs on a golden dataset, generates answers in batch, computes scores that can determine quality of responses with change in config
Describe the solution you'd like
Build an end-to-end evaluation flow in Kaapi that:
Reference
Solution Doc