diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 0000000..b8e32cf
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,144 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+```bash
+npm run dev # Start dev server at http://localhost:3000
+npm run lint # Run ESLint
+npm run build # Production build
+npm run format # Format TS/tsx with Prettier
+```
+
+## Architecture Overview
+
+### Tech Stack
+
+- **Framework**: Next.js 16.0.7 (App Router)
+- **React 19.2.0** + **TypeScript** (strict mode)
+- **Routing**: Next.js App Router (react-router-dom is listed as a dependency but currently unused)
+- **Styling**: Tailwind CSS 4.x + custom color system and styles defined in `/app/globals.css` for cases not supported by Tailwind
+- **Data Fetching**: Native Fetch API + SWR 2.3.6 (used selectively where required)
+- **Date/Time**: date-fns 4.1.0, date-fns-tz 3.2.0
+- **ESLint**: Used for maintaining code quality, enforcing consistent coding standards, and catching potential issues during development and build time
+
+### Directory Structure
+
+| Path | Purpose |
+| ------------------------------ | --------------------------------------------------------------------------------------------- |
+| `app/(auth)/` | Authentication-related routes (e.g., invite, verify flows) |
+| `app/(main)/` | Main application routes (dashboard-level features like datasets, evaluations, settings, etc.) |
+| `app/api/` | Backend API route handlers (Next.js route handlers acting as BFF layer) |
+| `app/components/` | App-scoped components used within routes/Pages |
+| `app/components/icons/` | Hand-authored React icon components |
+| `app/hooks/` | Custom React hooks specific to app features |
+| `app/lib/` | Core shared logic and utilities across the application |
+| `app/lib/context/` | React context providers (global state handling) |
+| `app/lib/store/` | State management logic (custom/global store) |
+| `app/lib/types/` | TypeScript type definitions (shared across modules) |
+| `app/lib/utils/` | Domain-specific utility modules (e.g., evaluation, guardrails) |
+| `app/lib/data/` | Static data and validators (e.g., guardrails validators) |
+| `app/lib/apiClient.ts` | Centralized API client for forwarding requests to the backend |
+| `app/lib/authCookie.ts` | Authentication cookie utilities (get/set/remove tokens) |
+| `app/lib/configFetchers.ts` | API fetchers related to configuration modules |
+| `app/lib/constants.ts` | Global constants used across the app |
+| `app/lib/guardrailsClient.ts` | Client-side API helpers for guardrails features |
+| `app/lib/models.ts` | Data models/interfaces for structured data handling |
+| `app/lib/navConfig.ts` | Navigation configuration (sidebar/menu structure) |
+| `app/lib/promptEditorUtils.ts` | Utility functions for prompt editor logic |
+| `app/lib/utils.ts` | General utility/helper functions |
+| `public/favicon.ico` | Application favicon |
+
+## Import Aliases
+
+[tsconfig.json](./tsconfig.json) sets paths: `{ "@/*": ["./*"] }`, so imports are resolved from the project root using the `@/` prefix. Use:
+
+```
+import { apiClient } from '@/app/lib/apiClient';
+import { Providers } from '@/app/components/providers';
+import { APP_NAME } from '@/app/lib/constants';
+```
+
+SVGs follow Next.js defaults (imported as static assets via next/image or referenced from /public).
+
+## Routing & Role-Based Access
+
+Routing uses the **Next.js App Router** exclusively. Routes are organized via route groups:
+
+- `app/(auth)/` - unauthenticated flows (`/invite`, `/verify`)
+- `app/(main)/` — authenticated app surface (`/evaluations`, `/datasets`, `/configurations`, `/guardrails`, `/knowledge-base`, `/settings`, etc.)
+
+Role gating lives in middleware.ts and reads a kaapi_role cookie with two values:
+
+- `user` - standard authenticated user
+- `superuser` - admin; required for `/settings/*`
+
+The cookie is issued server-side by [authCookie.ts](app/lib/authCookie.ts) after login/verify based on user.is_superuser. Middleware classifies each request into one of:
+
+- `PUBLIC_ROUTES` — open to everyone (`/evaluations`, `/invite`, `/verify`, `/coming-soon/*`)
+- `GUEST_ONLY_ROUTES` — unauthenticated only (`/keystore`); authenticated users are redirected to `/evaluations`
+- `/settings/*` — superuser only
+- Everything else — any authenticated user
+
+There is no dynamic/custom role system; only the two static roles above.
+
+## Toast Notifications
+
+Toasts are managed via a React Context provider ([Toast.tsx](app/components/Toast.tsx)), mounted once in [Providers.tsx](app/components/providers/Providers.tsx). Consume them from any client component:
+
+```
+import { useToast } from '@/app/components/Toast';
+// or the re-export: import { useToast } from '@/app/hooks/useToast';
+
+function MyComponent() {
+ const toast = useToast();
+
+ toast.success('Saved successfully'); // success toast
+ toast.error('Something went wrong'); // error toast
+ toast.warning('Heads up'); // warning toast
+ toast.info('FYI'); // info toast
+
+ // Optional: override the default 5000ms auto-dismiss
+ toast.success('Saved', 3000);
+
+ // Low-level API (type + duration)
+ toast.addToast('Custom message', 'success', 4000);
+}
+```
+
+## Authentication [AuthContext.tsx](app/lib/context/AuthContext.tsx)
+
+There is no `AuthService` class. Auth state is owned by a React Context provider (`AuthProvider`) mounted in [Providers.tsx](app/components/providers/Providers.tsx), and consumed via the `useAuth()` hook:
+
+```
+import { useAuth } from '@/app/lib/context/AuthContext';
+
+function MyComponent() {
+ const {
+ isAuthenticated, isHydrated,
+ session, currentUser, googleProfile,
+ apiKeys, activeKey, addKey, removeKey, setKeys,
+ loginWithToken, logout,
+ } = useAuth();
+}
+```
+
+## App Context [AppContext.tsx](app/lib/context/AppContext.tsx)
+
+Sidebar state is managed via `AppProvider`, consumed with `useApp()`:
+
+```
+import { useApp } from '@/app/lib/context/AppContext';
+
+const { sidebarCollapsed, setSidebarCollapsed, toggleSidebar } = useApp();
+```
+
+## API Client & Error Handling
+
+The BFF layer uses [apiClient.ts](app/lib/apiClient.ts) which forwards requests from Next.js route handlers to the backend at `BACKEND_URL` (defaults to `http://localhost:8000`). Key patterns:
+
+- **Server-side (route handlers)**: Use `apiClient(request, endpoint, options)` — it relays `X-API-KEY` and `Cookie` headers automatically and returns `{ status, data, headers }`.
+- **Client-side**: Use `clientFetch(endpoint, options)` — handles token refresh on 401, dispatches `AUTH_EXPIRED_EVENT` when refresh fails, and throws with a message extracted from `error`, `message`, or `detail` fields in the response body.
+- **Error extraction**: `extractErrorMessage(body, fallback)` reads `body.error || body.message || body.detail` — follow this pattern when adding new API routes.
+- **Auth expiry**: On 401 with failed refresh, a `CustomEvent(AUTH_EXPIRED_EVENT)` is dispatched on `window`, which `AuthContext` listens to for automatic logout.
diff --git a/app/(main)/configurations/page.tsx b/app/(main)/configurations/page.tsx
index 2a68c20..a36ab48 100644
--- a/app/(main)/configurations/page.tsx
+++ b/app/(main)/configurations/page.tsx
@@ -13,7 +13,7 @@ import { colors } from "@/app/lib/colors";
import { usePaginatedList, useInfiniteScroll } from "@/app/hooks";
import ConfigCard from "@/app/components/ConfigCard";
import Loader, { LoaderBox } from "@/app/components/Loader";
-import { EvalJob } from "@/app/components/types";
+import { EvalJob } from "@/app/lib/types/evaluation";
import {
ConfigPublic,
ConfigVersionItems,
diff --git a/app/(main)/evaluations/[id]/page.tsx b/app/(main)/evaluations/[id]/page.tsx
index d2e583f..517c927 100644
--- a/app/(main)/evaluations/[id]/page.tsx
+++ b/app/(main)/evaluations/[id]/page.tsx
@@ -10,19 +10,21 @@ import { useRouter, useParams } from "next/navigation";
import { apiFetch } from "@/app/lib/apiClient";
import { useAuth } from "@/app/lib/context/AuthContext";
import { useApp } from "@/app/lib/context/AppContext";
-import {
+import type {
EvalJob,
AssistantConfig,
+ GroupedTraceItem,
+} from "@/app/lib/types/evaluation";
+import {
hasSummaryScores,
isNewScoreObjectV2,
getScoreObject,
normalizeToIndividualScores,
- GroupedTraceItem,
isGroupedFormat,
-} from "@/app/components/types";
+} from "@/app/lib/utils/evaluation";
import ConfigModal from "@/app/components/ConfigModal";
import Sidebar from "@/app/components/Sidebar";
-import DetailedResultsTable from "@/app/components/DetailedResultsTable";
+import DetailedResultsTable from "@/app/components/evaluations/DetailedResultsTable";
import { colors } from "@/app/lib/colors";
import { useToast } from "@/app/components/Toast";
import Loader from "@/app/components/Loader";
@@ -126,7 +128,6 @@ export default function EvaluationReport() {
if (isAuthenticated && jobId) fetchJobDetails();
}, [isAuthenticated, jobId, fetchJobDetails]);
- // Export grouped format CSV
const exportGroupedCSV = (traces: GroupedTraceItem[]) => {
if (!job) return;
try {
@@ -391,9 +392,9 @@ export default function EvaluationReport() {
>
-
+
- {/* Actions */}
-
+
setExportFormat("row")}
- className="inline-flex items-center gap-1.5 px-3 py-1.5 rounded-md text-xs font-medium transition-all cursor-pointer"
- style={{
- backgroundColor:
- exportFormat === "row"
- ? colors.bg.primary
- : "transparent",
- color:
- exportFormat === "row"
- ? colors.text.primary
- : colors.text.primary,
- boxShadow:
- exportFormat === "row"
- ? "0 1px 2px rgba(0,0,0,0.08)"
- : "none",
- border:
- exportFormat === "row"
- ? `1px solid ${colors.border}`
- : "1px solid transparent",
- }}
- onMouseEnter={(e) => {
- if (exportFormat !== "row") {
- e.currentTarget.style.backgroundColor =
- "rgba(0,0,0,0.04)";
- e.currentTarget.style.boxShadow =
- "0 0 0 1px rgba(0,0,0,0.06)";
- }
- }}
- onMouseLeave={(e) => {
- if (exportFormat !== "row") {
- e.currentTarget.style.backgroundColor = "transparent";
- e.currentTarget.style.boxShadow = "none";
- }
- }}
+ data-selected={exportFormat === "row"}
+ className="inline-flex items-center gap-1.5 px-3 py-1.5 rounded-md text-xs font-medium transition-all cursor-pointer border border-transparent text-text-primary hover:bg-black/4 hover:shadow-[0_0_0_1px_rgba(0,0,0,0.06)] data-[selected=true]:bg-bg-primary data-[selected=true]:border-border data-[selected=true]:shadow-[0_1px_2px_rgba(0,0,0,0.08)] data-[selected=true]:hover:bg-bg-primary data-[selected=true]:hover:shadow-[0_1px_2px_rgba(0,0,0,0.08)]"
>
-
+
Individual Rows
setExportFormat("grouped")}
- className="inline-flex items-center gap-1.5 px-3 py-1.5 rounded-md text-xs font-medium transition-all cursor-pointer"
- style={{
- backgroundColor:
- exportFormat === "grouped"
- ? colors.bg.primary
- : "transparent",
- color:
- exportFormat === "grouped"
- ? colors.text.primary
- : colors.text.primary,
- boxShadow:
- exportFormat === "grouped"
- ? "0 1px 2px rgba(0,0,0,0.08)"
- : "none",
- border:
- exportFormat === "grouped"
- ? `1px solid ${colors.border}`
- : "1px solid transparent",
- }}
- onMouseEnter={(e) => {
- if (exportFormat !== "grouped") {
- e.currentTarget.style.backgroundColor =
- "rgba(0,0,0,0.04)";
- e.currentTarget.style.boxShadow =
- "0 0 0 1px rgba(0,0,0,0.06)";
- }
- }}
- onMouseLeave={(e) => {
- if (exportFormat !== "grouped") {
- e.currentTarget.style.backgroundColor = "transparent";
- e.currentTarget.style.boxShadow = "none";
- }
- }}
+ data-selected={exportFormat === "grouped"}
+ className="inline-flex items-center gap-1.5 px-3 py-1.5 rounded-md text-xs font-medium transition-all cursor-pointer border border-transparent text-text-primary hover:bg-black/4 hover:shadow-[0_0_0_1px_rgba(0,0,0,0.06)] data-[selected=true]:bg-bg-primary data-[selected=true]:border-border data-[selected=true]:shadow-[0_1px_2px_rgba(0,0,0,0.08)] data-[selected=true]:hover:bg-bg-primary data-[selected=true]:hover:shadow-[0_1px_2px_rgba(0,0,0,0.08)]"
>
-
+
Group by Questions
setIsConfigModalOpen(true)}
- className="px-3 py-1.5 rounded-md text-xs font-medium border"
- style={{
- backgroundColor: "transparent",
- borderColor: colors.border,
- color: colors.text.primary,
- }}
+ className="px-3 py-1.5 rounded-md text-xs font-medium border bg-transparent border-border text-text-primary"
>
View Config
diff --git a/app/(main)/evaluations/page.tsx b/app/(main)/evaluations/page.tsx
index d7900f3..13ca97c 100644
--- a/app/(main)/evaluations/page.tsx
+++ b/app/(main)/evaluations/page.tsx
@@ -49,12 +49,8 @@ function SimplifiedEvalContent() {
const [duplicationFactor, setDuplicationFactor] = useState("1");
const [uploadedFile, setUploadedFile] = useState
(null);
const [isUploading, setIsUploading] = useState(false);
-
- // Stored datasets
const [storedDatasets, setStoredDatasets] = useState([]);
const [isDatasetsLoading, setIsDatasetsLoading] = useState(false);
-
- // Evaluation config state
const [selectedDatasetId, setSelectedDatasetId] = useState(() => {
return searchParams.get("dataset") || "";
});
@@ -235,6 +231,10 @@ function SimplifiedEvalContent() {
});
setIsEvaluating(false);
+ setExperimentName("");
+ setSelectedDatasetId("");
+ setSelectedConfigId("");
+ setSelectedConfigVersion(0);
toast.success(`Evaluation created!`);
return true;
} catch (error: unknown) {
diff --git a/app/components/CodeBlock.tsx b/app/components/CodeBlock.tsx
new file mode 100644
index 0000000..e76d9a3
--- /dev/null
+++ b/app/components/CodeBlock.tsx
@@ -0,0 +1,13 @@
+import type { ReactNode } from "react";
+
+interface CodeBlockProps {
+ children: ReactNode;
+}
+
+export default function CodeBlock({ children }: CodeBlockProps) {
+ return (
+
+ {children}
+
+ );
+}
diff --git a/app/components/ConfigModal.tsx b/app/components/ConfigModal.tsx
index 817b24f..0f3f412 100644
--- a/app/components/ConfigModal.tsx
+++ b/app/components/ConfigModal.tsx
@@ -7,7 +7,11 @@
import React, { useState, useEffect } from "react";
import { colors } from "@/app/lib/colors";
-import { EvalJob, AssistantConfig } from "./types";
+import CopyableCodeBlock from "@/app/components/CopyableCodeBlock";
+import CodeBlock from "@/app/components/CodeBlock";
+import Tag from "@/app/components/Tag";
+import { CloseIcon } from "@/app/components/icons";
+import { EvalJob, AssistantConfig } from "@/app/lib/types/evaluation";
import { useAuth } from "@/app/lib/context/AuthContext";
import { apiFetch } from "@/app/lib/apiClient";
import {
@@ -35,6 +39,24 @@ interface ConfigVersionInfo {
knowledge_base_ids?: string[];
}
+const ConfigField = ({
+ label,
+ children,
+}: {
+ label: string;
+ children: React.ReactNode;
+}) => (
+
+
+ {label}
+
+ {children}
+
+);
+
export default function ConfigModal({
isOpen,
onClose,
@@ -80,15 +102,14 @@ export default function ConfigModal({
const params: CompletionParams =
blob?.completion?.params || ({} as CompletionParams);
- // Extract knowledge base IDs from multiple sources
const knowledgeBaseIds: string[] = [];
- // 1. Check direct params.knowledge_base_ids
+ // Check direct params.knowledge_base_ids
if (Array.isArray(params.knowledge_base_ids)) {
knowledgeBaseIds.push(...params.knowledge_base_ids);
}
- // 2. Check tools array for knowledge_base_ids
+ // Check tools array for knowledge_base_ids
if (params.tools) {
const toolKbIds = params.tools
.filter(
@@ -100,7 +121,6 @@ export default function ConfigModal({
knowledgeBaseIds.push(...toolKbIds);
}
- // Remove duplicates
const uniqueKbIds = [...new Set(knowledgeBaseIds)];
setConfigVersionInfo({
@@ -128,51 +148,9 @@ export default function ConfigModal({
if (!isOpen) return null;
- const ConfigField = ({
- label,
- children,
- }: {
- label: string;
- children: React.ReactNode;
- }) => (
-
-
- {label}
-
- {children}
-
- );
-
- const CodeBlock = ({ children }: { children: React.ReactNode }) => (
-
- {children}
-
- );
-
- const Tag = ({ children }: { children: React.ReactNode }) => (
-
- {children}
-
- );
-
return (
e.stopPropagation()}
>
- {/* Header */}
- {/* Content */}
{isLoadingConfig ? (
@@ -295,9 +259,11 @@ export default function ConfigModal({
{configVersionInfo?.knowledge_base_ids &&
configVersionInfo.knowledge_base_ids.length > 0 && (
-
+
{configVersionInfo.knowledge_base_ids.join("\n")}
-
+
)}
@@ -305,11 +271,17 @@ export default function ConfigModal({
assistantConfig?.instructions ||
job.config?.instructions) && (
-
+
{configVersionInfo?.instructions ||
assistantConfig?.instructions ||
job.config?.instructions}
-
+
)}
diff --git a/app/components/CopyableCodeBlock.tsx b/app/components/CopyableCodeBlock.tsx
new file mode 100644
index 0000000..7578033
--- /dev/null
+++ b/app/components/CopyableCodeBlock.tsx
@@ -0,0 +1,49 @@
+"use client";
+
+import React, { useState, useCallback } from "react";
+import { useToast } from "@/app/hooks/useToast";
+import { CheckIcon, CopyIcon } from "@/app/components/icons";
+
+interface CopyableCodeBlockProps {
+ children: React.ReactNode;
+ copyText: string;
+}
+
+export default function CopyableCodeBlock({
+ children,
+ copyText,
+}: CopyableCodeBlockProps) {
+ const toast = useToast();
+ const [copied, setCopied] = useState(false);
+
+ const handleCopy = useCallback(async () => {
+ try {
+ await navigator.clipboard.writeText(copyText);
+ setCopied(true);
+ toast.success("Copied to clipboard");
+ setTimeout(() => setCopied(false), 2000);
+ } catch {
+ toast.error("Failed to copy");
+ }
+ }, [copyText, toast]);
+
+ return (
+
+
+ {children}
+
+
+ {copied ? (
+
+ ) : (
+
+ )}
+
+
+ );
+}
diff --git a/app/components/DetailedResultsTable.tsx b/app/components/DetailedResultsTable.tsx
deleted file mode 100644
index c4b9de0..0000000
--- a/app/components/DetailedResultsTable.tsx
+++ /dev/null
@@ -1,639 +0,0 @@
-/**
- * DetailedResultsTable.tsx - Table view for evaluation results
- *
- * Displays Q&A pairs with scores in a tabular format
- * Supports both row format (individual traces) and grouped format (multiple answers per question)
- */
-
-import React, { useState, useEffect } from "react";
-import {
- TraceScore,
- getScoreObject,
- normalizeToIndividualScores,
- hasSummaryScores,
- isNewScoreObjectV2,
- isGroupedFormat,
- GroupedTraceItem,
- EvalJob,
-} from "@/app/components/types";
-
-// Helper function to format score value with color
-const formatScoreValue = (score: TraceScore | undefined) => {
- if (!score) return { value: "N/A", color: "#737373", bg: "transparent" };
-
- if (score.data_type === "CATEGORICAL") {
- const catValue = String(score.value);
- let color = "#171717";
- let bg = "#fafafa";
-
- if (catValue === "CORRECT") {
- color = "#15803d";
- bg = "#dcfce7";
- } else if (catValue === "PARTIAL") {
- color = "#92400e";
- bg = "#fef3c7";
- } else if (catValue === "INCORRECT") {
- color = "#dc2626";
- bg = "#fee2e2";
- }
-
- return { value: catValue, color, bg };
- }
-
- // NUMERIC
- const numValue = Number(score.value);
- const formattedValue = numValue.toFixed(2);
- let color = "#171717";
- let bg = "transparent";
-
- // Color based on value
- if (numValue >= 0.7) {
- color = "#15803d";
- bg = "#dcfce7";
- } else if (numValue >= 0.5) {
- color = "#92400e";
- bg = "#fef3c7";
- } else {
- color = "#dc2626";
- bg = "#fee2e2";
- }
-
- return { value: formattedValue, color, bg };
-};
-
-interface DetailedResultsTableProps {
- job: EvalJob;
-}
-
-export default function DetailedResultsTable({
- job,
-}: DetailedResultsTableProps) {
- const [openCommentId, setOpenCommentId] = useState
(null);
- const [commentPos, setCommentPos] = useState({ top: 0, left: 0 });
-
- useEffect(() => {
- if (!openCommentId) return;
- const handleScroll = () => setOpenCommentId(null);
- window.addEventListener("scroll", handleScroll, true);
- return () => {
- window.removeEventListener("scroll", handleScroll, true);
- };
- }, [openCommentId]);
-
- const scoreObject = getScoreObject(job);
-
- // 1. First check: Does it have summary_scores at all?
- if (!scoreObject || !hasSummaryScores(scoreObject)) {
- return (
-
-
- No detailed results available or using legacy format
-
-
- );
- }
-
- // 2. Second check: Does it have traces? (NewScoreObjectV2)
- if (isNewScoreObjectV2(scoreObject)) {
- // Check if grouped format
- if (isGroupedFormat(scoreObject.traces)) {
- return (
-
- );
- }
- // Otherwise show row format
- }
-
- // 3. Try to normalize to IndividualScore format
- // This handles NewScoreObjectV2 (with traces)
- const individual_scores = normalizeToIndividualScores(scoreObject);
-
- // 4. If no individual scores available (e.g., BasicScoreObject with only summary_scores)
- if (!individual_scores || individual_scores.length === 0) {
- return (
-
-
- No individual scores available. Only summary metrics are available for
- this evaluation.
-
-
- );
- }
-
- // Get all unique score names from the first item
- const scoreNames =
- individual_scores[0]?.trace_scores?.map((s) => s.name) || [];
-
- // Helper function to get score value by name
- const getScoreByName = (
- scores: TraceScore[],
- name: string,
- ): TraceScore | undefined => {
- if (!scores || !Array.isArray(scores)) return undefined;
- return scores.find((s) => s?.name === name);
- };
-
- return (
-
- {/* Table Container */}
-
-
- {/* Table Header */}
-
-
-
-
- Question
-
-
- Ground Truth
-
-
- Answer
-
- {scoreNames.map((scoreName) => (
-
- {scoreName}
-
- ))}
-
-
-
- {/* Table Body */}
-
- {individual_scores.map((item, index) => {
- const question = item.input?.question || "N/A";
- const answer = item.output?.answer || "N/A";
- const groundTruth = item.metadata?.ground_truth || "N/A";
-
- return (
- {
- const row = e.currentTarget;
- row.style.backgroundColor = "#fafafa";
- }}
- onMouseLeave={(e) => {
- const row = e.currentTarget;
- row.style.backgroundColor = "#ffffff";
- }}
- >
-
- {index + 1}
-
-
- {/* Question */}
-
-
- {question}
-
-
-
- {/* Ground Truth */}
-
-
- {groundTruth}
-
-
-
- {/* Answer */}
-
-
- {answer}
-
-
-
- {/* Score Columns */}
- {scoreNames.map((scoreName) => {
- const score = getScoreByName(item.trace_scores, scoreName);
- const { value, color, bg } = formatScoreValue(score);
-
- return (
-
-
-
- {value}
-
- {score?.comment && (
- <>
-
{
- const rect =
- e.currentTarget.getBoundingClientRect();
- const tooltipWidth = 300;
- const centerX = rect.left + rect.width / 2;
- const clampedLeft = Math.min(
- Math.max(centerX - tooltipWidth / 2, 8),
- window.innerWidth - tooltipWidth - 8,
- );
- setCommentPos({
- top: rect.top - 8,
- left: clampedLeft,
- });
- setOpenCommentId(`${index}-${scoreName}`);
- }}
- onMouseLeave={() => setOpenCommentId(null)}
- >
- i
-
- {openCommentId === `${index}-${scoreName}` && (
-
- {score.comment}
-
- )}
- >
- )}
-
-
- );
- })}
-
- );
- })}
-
-
-
-
- );
-}
-
-function GroupedResultsTable({ traces }: { traces: GroupedTraceItem[] }) {
- const [openCommentId, setOpenCommentId] = useState(null);
- const [commentPos, setCommentPos] = useState({ top: 0, left: 0 });
-
- useEffect(() => {
- if (!openCommentId) return;
- const handleScroll = () => setOpenCommentId(null);
- window.addEventListener("scroll", handleScroll, true);
- return () => {
- window.removeEventListener("scroll", handleScroll, true);
- };
- }, [openCommentId]);
-
- if (!traces || traces.length === 0) {
- return (
-
-
- No grouped results available
-
-
- );
- }
-
- // Get max answers count
- const maxAnswers = Math.max(...traces.map((t) => t.llm_answers.length));
-
- // Fixed column widths (in pixels) for predictable layout
- const COLUMN_WIDTHS = {
- qId: 60,
- question: 200,
- groundTruth: 200,
- answer: 250,
- };
-
- // Calculate minimum table width based on number of answers
- // This ensures horizontal scroll activates at the right point
- const fixedColumnsWidth =
- COLUMN_WIDTHS.qId + COLUMN_WIDTHS.question + COLUMN_WIDTHS.groundTruth;
- const tableMinWidth = fixedColumnsWidth + maxAnswers * COLUMN_WIDTHS.answer;
-
- return (
-
- {/* Table Container - overflow-x-auto enables horizontal scroll when table exceeds viewport */}
-
-
- {/* Table Header - matching row format styling */}
-
-
-
- Q.ID
-
-
- Question
-
-
- Ground Truth
-
- {Array.from({ length: maxAnswers }, (_, i) => (
-
- Answer {i + 1}
-
- ))}
-
-
-
- {/* Table Body */}
-
- {traces.map((group, index) => (
-
- {/* Text row */}
-
- {/* Question ID */}
-
- {group.question_id}
-
-
- {/* Question */}
-
-
- {group.question}
-
-
-
- {/* Ground Truth */}
-
-
- {group.ground_truth_answer}
-
-
-
- {/* Answer text only */}
- {Array.from({ length: maxAnswers }, (_, answerIndex) => {
- const answer = group.llm_answers[answerIndex];
- return (
-
- {answer ? (
-
- {answer}
-
- ) : (
- -
- )}
-
- );
- })}
-
- {/* Scores row */}
-
- {/* Empty cells for Q.ID, Question, Ground Truth */}
-
-
-
-
- {/* Score cells */}
- {Array.from({ length: maxAnswers }, (_, answerIndex) => {
- const answerScores: TraceScore[] =
- group.scores?.[answerIndex] || [];
- const answer = group.llm_answers[answerIndex];
-
- return (
-
- {answer && answerScores.length > 0 ? (
-
- {answerScores.map(
- (score: TraceScore, scoreIdx: number) => {
- if (!score) return null;
- const { value, color, bg } =
- formatScoreValue(score);
- return (
-
-
- {score.name}:
-
-
-
- {value}
-
- {score?.comment &&
- (() => {
- const commentId = `g${index}-a${answerIndex}-s${scoreIdx}`;
- return (
- <>
-
{
- const rect =
- e.currentTarget.getBoundingClientRect();
- const tooltipWidth = 300;
- const centerX =
- rect.left + rect.width / 2;
- const clampedLeft = Math.min(
- Math.max(
- centerX -
- tooltipWidth / 2,
- 8,
- ),
- window.innerWidth -
- tooltipWidth -
- 8,
- );
- setCommentPos({
- top: rect.top - 8,
- left: clampedLeft,
- });
- setOpenCommentId(commentId);
- }}
- onMouseLeave={() =>
- setOpenCommentId(null)
- }
- >
- i
-
- {openCommentId === commentId && (
-
- {score.comment}
-
- )}
- >
- );
- })()}
-
-
- );
- },
- )}
-
- ) : null}
-
- );
- })}
-
-
- ))}
-
-
-
-
- );
-}
diff --git a/app/components/InfoTooltip.tsx b/app/components/InfoTooltip.tsx
index d070496..902841d 100644
--- a/app/components/InfoTooltip.tsx
+++ b/app/components/InfoTooltip.tsx
@@ -11,15 +11,16 @@ export default function InfoTooltip({ text }: InfoTooltipProps) {
i
{text}
+
);
diff --git a/app/components/StatusBadge.tsx b/app/components/StatusBadge.tsx
index 48df1e5..12b3704 100644
--- a/app/components/StatusBadge.tsx
+++ b/app/components/StatusBadge.tsx
@@ -13,20 +13,14 @@ interface StatusBadgeProps {
}
export default function StatusBadge({ status, size = "sm" }: StatusBadgeProps) {
- const colors = getStatusColor(status);
+ const statusColor = getStatusColor(status);
const sizeClasses =
size === "md" ? "px-3 py-1.5 text-sm" : "px-2 py-1 text-xs";
return (
{status.toUpperCase()}
diff --git a/app/components/Tag.tsx b/app/components/Tag.tsx
new file mode 100644
index 0000000..6932e29
--- /dev/null
+++ b/app/components/Tag.tsx
@@ -0,0 +1,13 @@
+import type { ReactNode } from "react";
+
+interface TagProps {
+ children: ReactNode;
+}
+
+export default function Tag({ children }: TagProps) {
+ return (
+
+ {children}
+
+ );
+}
diff --git a/app/components/Toast.tsx b/app/components/Toast.tsx
index 2951ae8..bf3217b 100644
--- a/app/components/Toast.tsx
+++ b/app/components/Toast.tsx
@@ -88,7 +88,7 @@ function ToastContainer({
removeToast: (id: string) => void;
}) {
return (
-
+
{toasts.map((toast) => (
void }) {
setExiting(true)}
- className="shrink-0 self-start p-2 opacity-50 hover:opacity-100 transition-opacity"
+ onClick={onClose}
+ className="shrink-0 self-start p-2 opacity-50 hover:opacity-100 transition-opacity cursor-pointer"
>
diff --git a/app/components/evaluations/DetailedResultsTable.tsx b/app/components/evaluations/DetailedResultsTable.tsx
new file mode 100644
index 0000000..9d50ebd
--- /dev/null
+++ b/app/components/evaluations/DetailedResultsTable.tsx
@@ -0,0 +1,235 @@
+/**
+ * DetailedResultsTable.tsx - Table view for evaluation results
+ *
+ * Displays Q&A pairs with scores in a tabular format
+ * Supports both row format (individual traces) and grouped format (multiple answers per question)
+ */
+
+import { useState, useEffect } from "react";
+import type { GroupedTraceItem, EvalJob } from "@/app/lib/types/evaluation";
+import {
+ getScoreObject,
+ normalizeToIndividualScores,
+ hasSummaryScores,
+ isNewScoreObjectV2,
+ isGroupedFormat,
+} from "@/app/lib/utils/evaluation";
+import { formatScoreValue, getScoreByName } from "@/app/lib/utils";
+import GroupedResultsTable from "@/app/components/evaluations/GroupedResultsTable";
+
+interface DetailedResultsTableProps {
+ job: EvalJob;
+}
+
+export default function DetailedResultsTable({
+ job,
+}: DetailedResultsTableProps) {
+ const [openCommentId, setOpenCommentId] = useState
(null);
+ const [commentPos, setCommentPos] = useState({ top: 0, left: 0 });
+
+ useEffect(() => {
+ if (!openCommentId) return;
+ const handleScroll = () => setOpenCommentId(null);
+ window.addEventListener("scroll", handleScroll, true);
+ return () => {
+ window.removeEventListener("scroll", handleScroll, true);
+ };
+ }, [openCommentId]);
+
+ const scoreObject = getScoreObject(job);
+
+ if (!scoreObject || !hasSummaryScores(scoreObject)) {
+ return (
+
+
+ No detailed results available or using legacy format
+
+
+ );
+ }
+
+ if (isNewScoreObjectV2(scoreObject)) {
+ if (isGroupedFormat(scoreObject.traces)) {
+ return (
+
+ );
+ }
+ }
+
+ const individual_scores = normalizeToIndividualScores(scoreObject);
+
+ if (!individual_scores || individual_scores.length === 0) {
+ return (
+
+
+ No individual scores available. Only summary metrics are available for
+ this evaluation.
+
+
+ );
+ }
+
+ // Get all unique score names from the first item
+ const scoreNames =
+ individual_scores[0]?.trace_scores?.map((s) => s.name) || [];
+
+ const COLUMN_WIDTHS = {
+ index: 50,
+ question: 250,
+ groundTruth: 250,
+ answer: 250,
+ score: 160,
+ };
+ const tableMinWidth =
+ COLUMN_WIDTHS.index +
+ COLUMN_WIDTHS.question +
+ COLUMN_WIDTHS.groundTruth +
+ COLUMN_WIDTHS.answer +
+ scoreNames.length * COLUMN_WIDTHS.score;
+
+ return (
+
+
+
+
+
+
+
+ Question
+
+
+ Ground Truth
+
+
+ Answer
+
+ {scoreNames.map((scoreName) => (
+
+ {scoreName}
+
+ ))}
+
+
+
+
+ {individual_scores.map((item, index) => {
+ const question = item.input?.question || "N/A";
+ const answer = item.output?.answer || "N/A";
+ const groundTruth = item.metadata?.ground_truth || "N/A";
+
+ return (
+
+
+ {index + 1}
+
+
+
+
+ {question}
+
+
+
+
+
+ {groundTruth}
+
+
+
+
+
+ {answer}
+
+
+
+ {scoreNames.map((scoreName) => {
+ const score = getScoreByName(item.trace_scores, scoreName);
+ const { value, color, bg } = formatScoreValue(score);
+
+ return (
+
+
+
+ {value}
+
+ {score?.comment && (
+ <>
+
{
+ const rect =
+ e.currentTarget.getBoundingClientRect();
+ const tooltipWidth = 300;
+ const centerX = rect.left + rect.width / 2;
+ const clampedLeft = Math.min(
+ Math.max(centerX - tooltipWidth / 2, 8),
+ window.innerWidth - tooltipWidth - 8,
+ );
+ setCommentPos({
+ top: rect.top - 8,
+ left: clampedLeft,
+ });
+ setOpenCommentId(`${index}-${scoreName}`);
+ }}
+ onMouseLeave={() => setOpenCommentId(null)}
+ >
+ i
+
+ {openCommentId === `${index}-${scoreName}` && (
+
+ {score.comment}
+
+ )}
+ >
+ )}
+
+
+ );
+ })}
+
+ );
+ })}
+
+
+
+
+ );
+}
diff --git a/app/components/evaluations/EvalDatasetDescription.tsx b/app/components/evaluations/EvalDatasetDescription.tsx
index 4579101..b0b99a0 100644
--- a/app/components/evaluations/EvalDatasetDescription.tsx
+++ b/app/components/evaluations/EvalDatasetDescription.tsx
@@ -15,7 +15,7 @@ export default function EvalDatasetDescription({
return (
diff --git a/app/components/evaluations/EvalRunCard.tsx b/app/components/evaluations/EvalRunCard.tsx
index 65990d1..64dba5b 100644
--- a/app/components/evaluations/EvalRunCard.tsx
+++ b/app/components/evaluations/EvalRunCard.tsx
@@ -2,18 +2,14 @@
import { useState } from "react";
import { useRouter } from "next/navigation";
-import { colors } from "@/app/lib/colors";
-import {
- EvalJob,
- AssistantConfig,
- getScoreObject,
-} from "@/app/components/types";
-import { getStatusColor, formatCostUSD } from "@/app/components/utils";
-import { timeAgo } from "@/app/lib/utils";
-import ConfigModal from "@/app/components/ConfigModal";
-import ScoreDisplay from "@/app/components/ScoreDisplay";
+import type { EvalJob, AssistantConfig } from "@/app/lib/types/evaluation";
+import { getScoreObject } from "@/app/lib/utils/evaluation";
+import { getStatusColor } from "@/app/components/utils";
+import { timeAgo, formatCostUSD } from "@/app/lib/utils";
+import { ConfigModal, InfoTooltip } from "@/app/components";
+import ScoreDisplay from "@/app/components/evaluations/ScoreDisplay";
import CostIcon from "@/app/components/icons/evaluations/CostIcon";
-import InfoTooltip from "@/app/components/InfoTooltip";
+import DatabaseIcon from "@/app/components/icons/evaluations/DatabaseIcon";
export interface EvalRunCardProps {
job: EvalJob;
@@ -33,91 +29,55 @@ export default function EvalRunCard({
return (
- {/* Row 1: Run Name (left) | Status (right) */}
-
+
{job.run_name}
{job.inserted_at && (
-
+
{timeAgo(job.inserted_at)}
)}
{/* Error message (if failed) */}
{job.error_message && (
-
{job.status}
- {/* Row 2: Scores */}
{scoreObj && (
)}
- {/* Row 3: Dataset + Config + Cost (left) | Actions (right) */}
-
+
{job.dataset_name && (
-
-
-
+
{job.dataset_name}
)}
{job.assistant_id && assistantConfig?.name && (
-
+
{assistantConfig.name}
)}
{job.cost?.total_cost_usd != null && (
-
+
{formatCostUSD(job.cost.total_cost_usd)}
)}
-
+
setIsConfigModalOpen(true)}
- className="px-3 py-1.5 rounded-lg text-xs font-medium border"
- style={{
- backgroundColor: "transparent",
- borderColor: colors.border,
- color: colors.text.primary,
- }}
+ className="px-3 py-1.5 rounded-lg text-xs font-medium border border-border bg-transparent text-text-primary"
>
View Config
router.push(`/evaluations/${job.id}`)}
disabled={!isCompleted}
- className="px-3 py-1.5 rounded-lg text-xs font-medium border cursor-pointer disabled:cursor-not-allowed"
- style={{
- backgroundColor: "transparent",
- borderColor: colors.border,
- color: isCompleted
- ? colors.text.primary
- : colors.text.secondary,
- opacity: isCompleted ? 1 : 0.5,
- }}
+ className={`px-3 py-1.5 rounded-lg text-xs font-medium border border-border bg-transparent cursor-pointer disabled:cursor-not-allowed ${
+ isCompleted
+ ? "text-text-primary opacity-100"
+ : "text-text-secondary opacity-50"
+ }`}
>
View Results
diff --git a/app/components/evaluations/EvaluationsTab.tsx b/app/components/evaluations/EvaluationsTab.tsx
index 3fce988..d57ea05 100644
--- a/app/components/evaluations/EvaluationsTab.tsx
+++ b/app/components/evaluations/EvaluationsTab.tsx
@@ -4,12 +4,13 @@ import { useState, useEffect, useCallback } from "react";
import { apiFetch } from "@/app/lib/apiClient";
import { colors } from "@/app/lib/colors";
import { Dataset } from "@/app/lib/types/dataset";
-import { EvalJob, AssistantConfig } from "@/app/components/types";
+import { EvalJob, AssistantConfig } from "@/app/lib/types/evaluation";
import ConfigSelector from "@/app/components/ConfigSelector";
import Loader from "@/app/components/Loader";
import EvalRunCard from "./EvalRunCard";
import EvalDatasetDescription from "./EvalDatasetDescription";
import { useAuth } from "@/app/lib/context/AuthContext";
+import { RefreshIcon } from "@/app/components/icons";
type Tab = "datasets" | "evaluations";
@@ -390,23 +391,12 @@ export default function EvaluationsTab({
-
-
-
+
@@ -418,14 +408,12 @@ export default function EvaluationsTab({
boxShadow: "0 1px 3px rgba(0, 0, 0, 0.04)",
}}
>
- {/* Loading */}
{isLoading && evalJobs.length === 0 && (
)}
- {/* Error */}
{error && (
)}
- {/* Empty State */}
{!isLoading && evalJobs.length === 0 && !error && (
)}
- {/* Runs List */}
{evalJobs.length > 0 &&
(() => {
const filteredJobs =
diff --git a/app/components/evaluations/GroupedResultsTable.tsx b/app/components/evaluations/GroupedResultsTable.tsx
new file mode 100644
index 0000000..1943d22
--- /dev/null
+++ b/app/components/evaluations/GroupedResultsTable.tsx
@@ -0,0 +1,261 @@
+/**
+ * GroupedResultsTable.tsx - Grouped view for evaluation results
+ *
+ * Displays multiple LLM answers per question in a grouped table format
+ */
+
+import { useState, useEffect, Fragment } from "react";
+import { TraceScore, GroupedTraceItem } from "@/app/lib/types/evaluation";
+import { formatScoreValue } from "@/app/lib/utils";
+
+export default function GroupedResultsTable({
+ traces,
+}: {
+ traces: GroupedTraceItem[];
+}) {
+ const [openCommentId, setOpenCommentId] = useState(null);
+ const [commentPos, setCommentPos] = useState({ top: 0, left: 0 });
+
+ useEffect(() => {
+ if (!openCommentId) return;
+ const handleScroll = () => setOpenCommentId(null);
+ window.addEventListener("scroll", handleScroll, true);
+ return () => {
+ window.removeEventListener("scroll", handleScroll, true);
+ };
+ }, [openCommentId]);
+
+ if (!traces || traces.length === 0) {
+ return (
+
+
No grouped results available
+
+ );
+ }
+
+ // Get max answers count
+ const maxAnswers = Math.max(...traces.map((t) => t.llm_answers.length));
+
+ // Fixed column widths (in pixels) for predictable layout
+ const COLUMN_WIDTHS = {
+ qId: 60,
+ question: 200,
+ groundTruth: 200,
+ answer: 250,
+ };
+
+ // Calculate minimum table width based on number of answers
+ // This ensures horizontal scroll activates at the right point
+ const fixedColumnsWidth =
+ COLUMN_WIDTHS.qId + COLUMN_WIDTHS.question + COLUMN_WIDTHS.groundTruth;
+ const tableMinWidth = fixedColumnsWidth + maxAnswers * COLUMN_WIDTHS.answer;
+
+ return (
+
+
+
+
+
+
+ Q.ID
+
+
+ Question
+
+
+ Ground Truth
+
+ {Array.from({ length: maxAnswers }, (_, i) => (
+
+ Answer {i + 1}
+
+ ))}
+
+
+
+
+ {traces.map((group, index) => (
+
+
+
+ {group.question_id}
+
+
+
+
+ {group.question}
+
+
+
+
+
+ {group.ground_truth_answer}
+
+
+
+ {/* Answer */}
+ {Array.from({ length: maxAnswers }, (_, answerIndex) => {
+ const answer = group.llm_answers[answerIndex];
+ return (
+
+ {answer ? (
+
+ {answer}
+
+ ) : (
+ -
+ )}
+
+ );
+ })}
+
+
+
+
+
+
+ {Array.from({ length: maxAnswers }, (_, answerIndex) => {
+ const answerScores: TraceScore[] =
+ group.scores?.[answerIndex] || [];
+ const answer = group.llm_answers[answerIndex];
+
+ return (
+
+ {answer && answerScores.length > 0 ? (
+
+ {answerScores.map(
+ (score: TraceScore, scoreIdx: number) => {
+ if (!score) return null;
+ const { value, color, bg } =
+ formatScoreValue(score);
+ return (
+
+
+ {score.name}:
+
+
+
+ {value}
+
+ {score?.comment &&
+ (() => {
+ const commentId = `g${index}-a${answerIndex}-s${scoreIdx}`;
+ return (
+ <>
+
{
+ const rect =
+ e.currentTarget.getBoundingClientRect();
+ const tooltipWidth = 300;
+ const centerX =
+ rect.left + rect.width / 2;
+ const clampedLeft = Math.min(
+ Math.max(
+ centerX -
+ tooltipWidth / 2,
+ 8,
+ ),
+ window.innerWidth -
+ tooltipWidth -
+ 8,
+ );
+ setCommentPos({
+ top: rect.top - 8,
+ left: clampedLeft,
+ });
+ setOpenCommentId(commentId);
+ }}
+ onMouseLeave={() =>
+ setOpenCommentId(null)
+ }
+ >
+ i
+
+ {openCommentId === commentId && (
+
+ {score.comment}
+
+ )}
+ >
+ );
+ })()}
+
+
+ );
+ },
+ )}
+
+ ) : null}
+
+ );
+ })}
+
+
+ ))}
+
+
+
+
+ );
+}
diff --git a/app/components/ScoreDisplay.tsx b/app/components/evaluations/ScoreDisplay.tsx
similarity index 94%
rename from app/components/ScoreDisplay.tsx
rename to app/components/evaluations/ScoreDisplay.tsx
index 2f8b1db..68efa33 100644
--- a/app/components/ScoreDisplay.tsx
+++ b/app/components/evaluations/ScoreDisplay.tsx
@@ -5,7 +5,8 @@
"use client";
-import { ScoreObject, hasSummaryScores } from "./types";
+import type { ScoreObject } from "@/app/lib/types/evaluation";
+import { hasSummaryScores } from "@/app/lib/utils/evaluation";
interface ScoreDisplayProps {
score: ScoreObject | null;
@@ -16,7 +17,6 @@ export default function ScoreDisplay({
score,
errorMessage,
}: ScoreDisplayProps) {
- // No score available
if (!score) {
return (
@@ -42,7 +42,6 @@ export default function ScoreDisplay({
);
}
- // Separate numeric and categorical scores
const numericScores = summaryScores.filter(
(s) => s.data_type === "NUMERIC",
);
@@ -83,7 +82,6 @@ export default function ScoreDisplay({
);
}
- // Fallback for unsupported format
return (
Score:
diff --git a/app/components/icons/common/CopyIcon.tsx b/app/components/icons/common/CopyIcon.tsx
new file mode 100644
index 0000000..ac3b372
--- /dev/null
+++ b/app/components/icons/common/CopyIcon.tsx
@@ -0,0 +1,20 @@
+interface IconProps {
+ className?: string;
+ style?: React.CSSProperties;
+}
+
+export default function CopyIcon({ className, style }: IconProps) {
+ return (
+
+
+
+
+ );
+}
diff --git a/app/components/icons/common/RefreshIcon.tsx b/app/components/icons/common/RefreshIcon.tsx
index fedb9e2..e244959 100644
--- a/app/components/icons/common/RefreshIcon.tsx
+++ b/app/components/icons/common/RefreshIcon.tsx
@@ -13,11 +13,13 @@ export default function RefreshIcon({ className, style }: IconProps) {
strokeWidth={2}
style={style}
>
-
+
+
+
);
}
diff --git a/app/components/icons/index.tsx b/app/components/icons/index.tsx
index e46af15..450a0ed 100644
--- a/app/components/icons/index.tsx
+++ b/app/components/icons/index.tsx
@@ -2,6 +2,7 @@
export { default as ArrowLeftIcon } from "./common/ArrowLeftIcon";
export { default as ChevronDownIcon } from "./common/ChevronDownIcon";
export { default as CheckIcon } from "./common/CheckIcon";
+export { default as CopyIcon } from "./common/CopyIcon";
export { default as EyeIcon } from "./common/EyeIcon";
export { default as EyeOffIcon } from "./common/EyeOffIcon";
export { default as RefreshIcon } from "./common/RefreshIcon";
diff --git a/app/components/index.ts b/app/components/index.ts
index 318a498..9f5fbc4 100644
--- a/app/components/index.ts
+++ b/app/components/index.ts
@@ -1,5 +1,10 @@
export { default as Button } from "./Button";
+export { default as CodeBlock } from "./CodeBlock";
+export { default as ConfigModal } from "./ConfigModal";
+export { default as CopyableCodeBlock } from "./CopyableCodeBlock";
export { default as Field } from "./Field";
+export { default as InfoTooltip } from "./InfoTooltip";
export { default as Modal } from "./Modal";
export { default as PageHeader } from "./PageHeader";
export { default as Sidebar } from "./Sidebar";
+export { default as Tag } from "./Tag";
diff --git a/app/components/speech-to-text/EvaluationsTab.tsx b/app/components/speech-to-text/EvaluationsTab.tsx
index 81dbebc..119e955 100644
--- a/app/components/speech-to-text/EvaluationsTab.tsx
+++ b/app/components/speech-to-text/EvaluationsTab.tsx
@@ -10,7 +10,8 @@ import Loader, { LoaderBox } from "@/app/components/Loader";
import StatusBadge from "@/app/components/StatusBadge";
import { computeWordDiff } from "./TranscriptionDiffViewer";
import { getStatusColor } from "@/app/components/utils";
-import AudioPlayerFromUrl from "./AudioPlayerFromUrl";
+import AudioPlayerFromUrl from "@/app/components/speech-to-text/AudioPlayerFromUrl";
+import { RefreshIcon } from "@/app/components/icons";
export interface EvaluationsTabProps {
leftPanelWidth: number;
@@ -442,22 +443,11 @@ export default function EvaluationsTab({
-
-
-
+
)}
@@ -1213,27 +1203,18 @@ export default function EvaluationsTab({
return (
{/* Row 1: Run Name + Status */}
-
+
{run.run_name}
{/* Error message */}
{run.error_message && (
-
+
{run.error_message}
)}
diff --git a/app/components/text-to-speech/EvaluationsTab.tsx b/app/components/text-to-speech/EvaluationsTab.tsx
index b4a1cce..0caa46f 100644
--- a/app/components/text-to-speech/EvaluationsTab.tsx
+++ b/app/components/text-to-speech/EvaluationsTab.tsx
@@ -15,6 +15,7 @@ import { useAuth } from "@/app/lib/context/AuthContext";
import { apiFetch } from "@/app/lib/apiClient";
import Loader, { LoaderBox } from "@/app/components/Loader";
import { getStatusColor } from "@/app/components/utils";
+import { RefreshIcon } from "@/app/components/icons";
import AudioPlayerFromUrl from "./AudioPlayerFromUrl";
import { useToast } from "@/app/components/Toast";
@@ -442,22 +443,11 @@ export default function EvaluationsTab({
-
-
-
+ />
)}
@@ -1134,39 +1124,24 @@ export default function EvaluationsTab({
return (
{/* Row 1: Run Name + Status */}
-
+
{run.run_name}
{/* Error message */}
{run.error_message && (
-
{run.status}
diff --git a/app/components/types.ts b/app/components/types.ts
deleted file mode 100644
index b2bbc73..0000000
--- a/app/components/types.ts
+++ /dev/null
@@ -1,234 +0,0 @@
-/**
- * Shared TypeScript types for evaluation components
- */
-
-export interface TraceScore {
- name: string;
- value: number | string;
- data_type: "NUMERIC" | "CATEGORICAL";
- comment?: string;
-}
-
-// New trace format (from evaluation-sample-3.json)
-export interface TraceItem {
- trace_id: string;
- question: string;
- llm_answer: string;
- ground_truth_answer: string;
- scores: TraceScore[];
-}
-
-export interface GroupedTraceItem {
- question_id: number;
- question: string;
- ground_truth_answer: string;
- llm_answers: string[];
- trace_ids: string[];
- scores: TraceScore[][];
-}
-
-// Legacy individual score format (nested structure)
-export interface IndividualScore {
- trace_id: string;
- input?: {
- question: string;
- };
- output?: {
- answer: string;
- };
- metadata?: {
- ground_truth?: string;
- item_id?: string;
- response_id?: string;
- };
- trace_scores: TraceScore[];
-}
-
-export interface SummaryScore {
- name: string;
- avg?: number;
- std?: number;
- total_pairs: number;
- data_type: "NUMERIC" | "CATEGORICAL";
- distribution?: Record
; // For categorical data
-}
-
-// New score object with traces array
-export interface NewScoreObjectV2 {
- summary_scores: SummaryScore[];
- traces: TraceItem[] | GroupedTraceItem[];
-}
-
-// Legacy score structure (for backward compatibility)
-export interface PerItemScore {
- trace_id: string;
- cosine_similarity: number;
-}
-
-export interface CosineSimilarity {
- avg: number;
- std: number;
- total_pairs: number;
- per_item_scores: PerItemScore[];
-}
-
-export interface LegacyScoreObject {
- cosine_similarity: CosineSimilarity;
-}
-
-// Basic score object with only summary scores (no individual scores or traces)
-export interface BasicScoreObject {
- summary_scores: SummaryScore[];
-}
-
-// Union type to support both old and new structures
-export type ScoreObject =
- | NewScoreObjectV2
- | BasicScoreObject
- | LegacyScoreObject;
-
-export interface AssistantConfig {
- name: string;
- model: string;
- knowledge_base_ids: string[];
- project_id: number;
- organization_id: number;
- updated_at: string;
- deleted_at: string | null;
- instructions: string;
- assistant_id: string;
- temperature: number;
- max_num_results: number;
- id: number;
- inserted_at: string;
- is_deleted: boolean;
-}
-
-export interface EvalCostEntry {
- model: string;
- cost_usd: number;
- input_tokens?: number;
- output_tokens?: number;
- prompt_tokens?: number;
- total_tokens: number;
-}
-
-export interface EvalCost {
- response?: EvalCostEntry;
- embedding?: EvalCostEntry;
- total_cost_usd: number;
-}
-
-export interface EvalJob {
- id: number;
- run_name: string;
- dataset_name: string;
- dataset_id: number;
- batch_job_id: number;
- embedding_batch_job_id: number | null;
- status: string;
- object_store_url: string | null;
- total_items: number;
- score?: ScoreObject | null;
- scores?: ScoreObject | null; // Alternative field name
- error_message: string | null;
- config?: {
- model?: string;
- instructions?: string;
- tools?: unknown[];
- include?: string[];
- temperature?: number;
- };
- config_id?: string;
- config_version?: number;
- model?: string;
- assistant_id?: string;
- organization_id: number;
- project_id: number;
- cost?: EvalCost | null;
- inserted_at: string;
- updated_at: string;
-}
-
-// Type guard functions
-
-// Shared guard: Check if score has summary_scores and intelligently narrow to NewScoreObjectV2 or BasicScoreObject
-// Priority: If it has traces → NewScoreObjectV2, otherwise → BasicScoreObject
-export function hasSummaryScores(
- score: ScoreObject | null | undefined,
-): score is NewScoreObjectV2 | BasicScoreObject {
- if (!score) return false;
- if (!("summary_scores" in score)) return false;
-
- // Prioritize traces format if available
- if ("traces" in score) {
- return true;
- }
-
- // Otherwise, it's BasicScoreObject (summary_scores only, no traces, no individual_scores)
- return true;
-}
-
-export function isNewScoreObjectV2(
- score: ScoreObject | null | undefined,
-): score is NewScoreObjectV2 {
- if (!score) return false;
- return "summary_scores" in score && "traces" in score;
-}
-
-export function isBasicScoreObject(
- score: ScoreObject | null | undefined,
-): score is BasicScoreObject {
- if (!score) return false;
- return "summary_scores" in score && !("traces" in score);
-}
-
-export function isLegacyScoreObject(
- score: ScoreObject | null | undefined,
-): score is LegacyScoreObject {
- if (!score) return false;
- return "cosine_similarity" in score;
-}
-
-// Helper to get score object from job
-export function getScoreObject(job: EvalJob): ScoreObject | null {
- return job.scores || job.score || null;
-}
-
-export function isGroupedFormat(
- traces: TraceItem[] | GroupedTraceItem[],
-): traces is GroupedTraceItem[] {
- if (!traces || traces.length === 0) return false;
- return "llm_answers" in traces[0] && Array.isArray(traces[0].llm_answers);
-}
-
-// Normalize traces to IndividualScore format for table display
-export function normalizeToIndividualScores(
- score: ScoreObject | null | undefined,
-): IndividualScore[] {
- if (!score) return [];
-
- if (isNewScoreObjectV2(score)) {
- // Convert TraceItem[] to IndividualScore[] for table display
- // Note: Grouped traces should be detected earlier and handled separately
- return score.traces.map((trace: TraceItem | GroupedTraceItem) => {
- // Handle regular TraceItem format
- if ("llm_answer" in trace) {
- return {
- trace_id: trace.trace_id,
- input: { question: trace.question },
- output: { answer: trace.llm_answer },
- metadata: { ground_truth: trace.ground_truth_answer },
- trace_scores: trace.scores,
- };
- }
- // Should not reach here if grouped format is handled properly
- return {
- trace_id: "",
- trace_scores: [],
- };
- });
- }
-
- return [];
-}
diff --git a/app/components/utils.ts b/app/components/utils.ts
index f1386e6..90912ae 100644
--- a/app/components/utils.ts
+++ b/app/components/utils.ts
@@ -27,9 +27,11 @@ export const formatDate = (dateString?: string): string => {
};
/**
- * Returns color scheme based on job/evaluation status
+ * Returns Tailwind class names based on job/evaluation status.
+ * The colour tokens are defined in globals.css as @theme inline vars.
+ *
* @param status - Status string (completed, processing, failed, etc.)
- * @returns Object with bg, border, and text HSL color values
+ * @returns Object with bg, border, and text Tailwind class names
*/
export const getStatusColor = (
status: string,
@@ -38,50 +40,35 @@ export const getStatusColor = (
case "completed":
case "success":
return {
- bg: "hsl(134, 61%, 95%)",
- border: "hsl(134, 61%, 70%)",
- text: "hsl(134, 61%, 25%)",
+ bg: "bg-status-success-bg",
+ border: "border-status-success-border",
+ text: "text-status-success-text",
};
case "processing":
case "pending":
case "queued":
case "running":
return {
- bg: "hsl(46, 100%, 95%)",
- border: "hsl(46, 100%, 80%)",
- text: "hsl(46, 100%, 25%)",
+ bg: "bg-status-warning-bg",
+ border: "border-status-warning-border",
+ text: "text-status-warning-text",
};
case "failed":
case "error":
return {
- bg: "hsl(8, 86%, 95%)",
- border: "hsl(8, 86%, 80%)",
- text: "hsl(8, 86%, 40%)",
+ bg: "bg-status-error-bg",
+ border: "border-status-error-border",
+ text: "text-status-error-text",
};
default:
return {
- bg: "hsl(0, 0%, 100%)",
- border: "hsl(0, 0%, 85%)",
- text: "hsl(330, 3%, 49%)",
+ bg: "bg-status-default-bg",
+ border: "border-status-default-border",
+ text: "text-status-default-text",
};
}
};
-/**
- * Formats a USD cost value for display
- * @param cost - Cost in USD
- * @returns Formatted cost string (e.g., "$0.0013", "$1.25")
- */
-export const formatCostUSD = (cost: number): string => {
- if (!Number.isFinite(cost)) {
- return "N/A";
- }
- if (cost < 0.01) {
- return `$${cost.toFixed(4)}`;
- }
- return `$${cost.toFixed(2)}`;
-};
-
/**
* Calculates dynamic thresholds for color coding based on score distribution
* @param scores - Array of similarity scores
diff --git a/app/globals.css b/app/globals.css
index 1d67ebc..05165ab 100644
--- a/app/globals.css
+++ b/app/globals.css
@@ -49,6 +49,34 @@
--color-status-warning: #f59e0b;
}
+/* Status badge colors — success */
+@theme inline {
+ --color-status-success-bg: hsl(134, 61%, 95%);
+ --color-status-success-border: hsl(134, 61%, 70%);
+ --color-status-success-text: hsl(134, 61%, 25%);
+}
+
+/* Status badge colors — warning */
+@theme inline {
+ --color-status-warning-bg: hsl(46, 100%, 95%);
+ --color-status-warning-border: hsl(46, 100%, 80%);
+ --color-status-warning-text: hsl(46, 100%, 25%);
+}
+
+/* Status badge colors — error */
+@theme inline {
+ --color-status-error-bg: hsl(8, 86%, 95%);
+ --color-status-error-border: hsl(8, 86%, 80%);
+ --color-status-error-text: hsl(8, 86%, 40%);
+}
+
+/* Status badge colors — default */
+@theme inline {
+ --color-status-default-bg: hsl(0, 0%, 100%);
+ --color-status-default-border: hsl(0, 0%, 85%);
+ --color-status-default-text: hsl(330, 3%, 49%);
+}
+
@media (prefers-color-scheme: dark) {
:root {
--background: #000000;
diff --git a/app/lib/types/evaluation.ts b/app/lib/types/evaluation.ts
new file mode 100644
index 0000000..8b01a15
--- /dev/null
+++ b/app/lib/types/evaluation.ts
@@ -0,0 +1,141 @@
+export interface TraceScore {
+ name: string;
+ value: number | string;
+ data_type: "NUMERIC" | "CATEGORICAL";
+ comment?: string;
+}
+
+export interface TraceItem {
+ trace_id: string;
+ question: string;
+ llm_answer: string;
+ ground_truth_answer: string;
+ scores: TraceScore[];
+}
+
+export interface GroupedTraceItem {
+ question_id: number;
+ question: string;
+ ground_truth_answer: string;
+ llm_answers: string[];
+ trace_ids: string[];
+ scores: TraceScore[][];
+}
+
+export interface IndividualScore {
+ trace_id: string;
+ input?: {
+ question: string;
+ };
+ output?: {
+ answer: string;
+ };
+ metadata?: {
+ ground_truth?: string;
+ item_id?: string;
+ response_id?: string;
+ };
+ trace_scores: TraceScore[];
+}
+
+export interface SummaryScore {
+ name: string;
+ avg?: number;
+ std?: number;
+ total_pairs: number;
+ data_type: "NUMERIC" | "CATEGORICAL";
+ distribution?: Record; // For categorical data
+}
+
+export interface NewScoreObjectV2 {
+ summary_scores: SummaryScore[];
+ traces: TraceItem[] | GroupedTraceItem[];
+}
+
+export interface PerItemScore {
+ trace_id: string;
+ cosine_similarity: number;
+}
+
+export interface CosineSimilarity {
+ avg: number;
+ std: number;
+ total_pairs: number;
+ per_item_scores: PerItemScore[];
+}
+
+export interface LegacyScoreObject {
+ cosine_similarity: CosineSimilarity;
+}
+
+export interface BasicScoreObject {
+ summary_scores: SummaryScore[];
+}
+
+export type ScoreObject =
+ | NewScoreObjectV2
+ | BasicScoreObject
+ | LegacyScoreObject;
+
+export interface AssistantConfig {
+ name: string;
+ model: string;
+ knowledge_base_ids: string[];
+ project_id: number;
+ organization_id: number;
+ updated_at: string;
+ deleted_at: string | null;
+ instructions: string;
+ assistant_id: string;
+ temperature: number;
+ max_num_results: number;
+ id: number;
+ inserted_at: string;
+ is_deleted: boolean;
+}
+
+export interface EvalCostEntry {
+ model: string;
+ cost_usd: number;
+ input_tokens?: number;
+ output_tokens?: number;
+ prompt_tokens?: number;
+ total_tokens: number;
+}
+
+export interface EvalCost {
+ response?: EvalCostEntry;
+ embedding?: EvalCostEntry;
+ total_cost_usd: number;
+}
+
+export interface EvalJob {
+ id: number;
+ run_name: string;
+ dataset_name: string;
+ dataset_id: number;
+ batch_job_id: number;
+ embedding_batch_job_id: number | null;
+ status: string;
+ object_store_url: string | null;
+ total_items: number;
+ score?: ScoreObject | null;
+ scores?: ScoreObject | null; // Alternative field name
+ error_message: string | null;
+ config?: {
+ model?: string;
+ instructions?: string;
+ tools?: unknown[];
+ include?: string[];
+ temperature?: number;
+ };
+ config_id?: string;
+ config_version?: number;
+ model?: string;
+ assistant_id?: string;
+ organization_id: number;
+ project_id: number;
+ cost?: EvalCost | null;
+ inserted_at: string;
+ updated_at: string;
+}
diff --git a/app/lib/utils.ts b/app/lib/utils.ts
index 27c28c5..1ac04eb 100644
--- a/app/lib/utils.ts
+++ b/app/lib/utils.ts
@@ -10,6 +10,7 @@ import {
import { SavedConfig, ConfigGroup } from "./types/configs";
import { isGpt5Model } from "@/app/lib/models";
import { STORAGE_KEYS } from "@/app/lib/constants";
+import { TraceScore } from "@/app/lib/types/evaluation";
export function timeAgo(dateStr: string): string {
const date =
@@ -193,3 +194,67 @@ export const sanitizeCSVCell = (
}
return `"${sanitized}"`;
};
+
+export const formatScoreValue = (score: TraceScore | undefined) => {
+ if (!score) return { value: "N/A", color: "#737373", bg: "transparent" };
+
+ if (score.data_type === "CATEGORICAL") {
+ const catValue = String(score.value);
+ let color = "#171717";
+ let bg = "#fafafa";
+
+ if (catValue === "CORRECT") {
+ color = "#15803d";
+ bg = "#dcfce7";
+ } else if (catValue === "PARTIAL") {
+ color = "#92400e";
+ bg = "#fef3c7";
+ } else if (catValue === "INCORRECT") {
+ color = "#dc2626";
+ bg = "#fee2e2";
+ }
+
+ return { value: catValue, color, bg };
+ }
+
+ const numValue = Number(score.value);
+ const formattedValue = numValue.toFixed(2);
+ let color = "#171717";
+ let bg = "transparent";
+
+ if (numValue >= 0.7) {
+ color = "#15803d";
+ bg = "#dcfce7";
+ } else if (numValue >= 0.5) {
+ color = "#92400e";
+ bg = "#fef3c7";
+ } else {
+ color = "#dc2626";
+ bg = "#fee2e2";
+ }
+
+ return { value: formattedValue, color, bg };
+};
+
+export const getScoreByName = (
+ scores: TraceScore[],
+ name: string,
+): TraceScore | undefined => {
+ if (!scores || !Array.isArray(scores)) return undefined;
+ return scores.find((s) => s?.name === name);
+};
+
+/**
+ * Formats a USD cost value for display
+ * @param cost - Cost in USD
+ * @returns Formatted cost string (e.g., "$0.0013", "$1.25")
+ */
+export const formatCostUSD = (cost: number): string => {
+ if (!Number.isFinite(cost)) {
+ return "N/A";
+ }
+ if (cost < 0.01) {
+ return `$${cost.toFixed(4)}`;
+ }
+ return `$${cost.toFixed(2)}`;
+};
diff --git a/app/lib/utils/evaluation.ts b/app/lib/utils/evaluation.ts
new file mode 100644
index 0000000..441fe18
--- /dev/null
+++ b/app/lib/utils/evaluation.ts
@@ -0,0 +1,53 @@
+import type {
+ EvalJob,
+ GroupedTraceItem,
+ IndividualScore,
+ NewScoreObjectV2,
+ BasicScoreObject,
+ ScoreObject,
+ TraceItem,
+} from "@/app/lib/types/evaluation";
+
+export function hasSummaryScores(
+ score: ScoreObject | null | undefined,
+): score is NewScoreObjectV2 | BasicScoreObject {
+ if (!score) return false;
+ return "summary_scores" in score;
+}
+
+export function isNewScoreObjectV2(
+ score: ScoreObject | null | undefined,
+): score is NewScoreObjectV2 {
+ if (!score) return false;
+ return "summary_scores" in score && "traces" in score;
+}
+
+export function getScoreObject(job: EvalJob): ScoreObject | null {
+ return job.scores || job.score || null;
+}
+
+export function isGroupedFormat(
+ traces: TraceItem[] | GroupedTraceItem[],
+): traces is GroupedTraceItem[] {
+ if (!traces || traces.length === 0) return false;
+ return "llm_answers" in traces[0] && Array.isArray(traces[0].llm_answers);
+}
+
+export function normalizeToIndividualScores(
+ score: ScoreObject | null | undefined,
+): IndividualScore[] {
+ if (!score || !isNewScoreObjectV2(score)) return [];
+
+ return score.traces.map((trace: TraceItem | GroupedTraceItem) => {
+ if ("llm_answer" in trace) {
+ return {
+ trace_id: trace.trace_id,
+ input: { question: trace.question },
+ output: { answer: trace.llm_answer },
+ metadata: { ground_truth: trace.ground_truth_answer },
+ trace_scores: trace.scores,
+ };
+ }
+ return { trace_id: "", trace_scores: [] };
+ });
+}
diff --git a/app/page.tsx b/app/page.tsx
index 50724ca..060f549 100644
--- a/app/page.tsx
+++ b/app/page.tsx
@@ -7,7 +7,6 @@ import { RefreshIcon } from "@/app/components/icons";
export default function Home() {
const router = useRouter();
- // Auto-redirect to evaluations page
useEffect(() => {
router.push("/evaluations");
}, [router]);
diff --git a/instructions/CLAUDE.md b/instructions/CLAUDE.md
deleted file mode 100644
index dbe1573..0000000
--- a/instructions/CLAUDE.md
+++ /dev/null
@@ -1,327 +0,0 @@
-# CLAUDE.md
-
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-
-## Project Overview
-
-Kaapi Konsole is a Next.js 16 application by Tech4Dev for LLM development and evaluation. It provides:
-
-- LLM response evaluation against QnA datasets
-- Git-like version control for prompt templates
-- Configuration management with A/B testing
-- Dataset and API key management
-
-The application has evolved from a simple evaluation tool into a full-featured LLM development platform.
-
-## Technology Stack
-
-- **Framework**: Next.js 16.0.7 (App Router)
-- **React**: 19.2.0 (with hooks-based state management)
-- **Routing**: Next.js App Router + React Router DOM 7.9.5 (dual system)
-- **Styling**: Tailwind CSS 4.x + centralized color system in `/app/lib/colors.ts`
-- **TypeScript**: 5.x (strict mode disabled)
-- **Data Fetching**: SWR 2.3.6 (not widely used yet)
-- **Date/Time**: date-fns 4.1.0, date-fns-tz 3.2.0
-
-## Development Commands
-
-```bash
-# Start development server (http://localhost:3000)
-npm run dev
-
-# Build for production
-npm run build
-
-# Start production server
-npm start
-
-# Run linter
-npm run lint
-```
-
-## Application Architecture
-
-### Route Structure
-
-```
-/ → Redirects to /evaluations
-/evaluations → Main eval interface (upload & results)
-/evaluations/[id] → Detailed evaluation report
-/datasets → Dataset upload and management
-/keystore → API key management (localStorage-based)
-/configurations/prompt-editor → Git-like prompt version control
-/test-evaluation → Mock data testing page
-```
-
-**Coming Soon Routes** (placeholders):
-
-- `/model-testing`, `/speech-to-text`, `/text-to-speech`, `/guardrails`, `/redteaming`
-
-### Component Organization
-
-**Shared Components** (`/app/components/`):
-
-- `Sidebar.tsx` - Main navigation (240px collapsible)
-- `TabNavigation.tsx` - Reusable tab switcher
-- `ConfigModal.tsx` - Modal for viewing evaluation configs
-- `DetailedResultsTable.tsx` - Evaluation traces table
-- `ScoreDisplay.tsx`, `StatusBadge.tsx` - Display primitives
-- `types.ts` - Shared TypeScript interfaces
-- `utils.ts` - Date formatting, color utilities
-
-**Prompt Editor Components** (`/app/components/prompt-editor/`):
-
-- `Header.tsx` - Top nav with branch controls
-- `EditorView.tsx` - WYSIWYG prompt editor
-- `DiffView.tsx` - Side-by-side diff visualization
-- `HistorySidebar.tsx` - Commit history tree
-- `ConfigDrawer.tsx` - Right-side configuration drawer
-- `CurrentConfigTab.tsx`, `HistoryTab.tsx`, `ABTestTab.tsx` - Drawer tabs
-- `BranchModal.tsx`, `MergeModal.tsx` - Dialogs
-
-### State Management Pattern
-
-**No global state library** - uses React `useState` exclusively:
-
-- Component-level state with props drilling
-- LocalStorage for persistence (API keys, sidebar state)
-- No Context API or Redux/Zustand
-
-**LocalStorage Keys:**
-
-- `kaapi_api_keys` - API key storage
-- `sidebar-expanded-menus` - Sidebar expansion state
-
-### API Integration Architecture
-
-**Proxy Pattern**: All backend calls route through Next.js API handlers in `/app/api/`:
-
-```
-GET/POST /api/evaluations → List/create eval jobs
-GET /api/evaluations/[id] → Get job details
-GET/POST /api/evaluations/datasets → List/upload datasets
-GET /api/evaluations/datasets/[dataset_id]
-GET /api/assistant/[assistant_id] → Fetch assistant config
-```
-
-**Backend URL**: Configured via `BACKEND_URL` (default: `http://localhost:8000`)
-
-**Authentication**: Custom header `X-API-KEY` passed from client
-
-**Mock Data System**: Toggle via `USE_MOCK_DATA` flag in API routes. Mock files in `/public/mock-data/`.
-
-### Type System
-
-**Complex Type Hierarchies** in `/app/components/types.ts`:
-
-**Evaluation Types:**
-
-- `EvalJob` - Main evaluation job entity
-- `ScoreObject` - Union type supporting 3 formats:
- - `NewScoreObjectV2` (with `traces[]` array)
- - `NewScoreObject` (with `individual_scores[]`)
- - `LegacyScoreObject` (old cosine similarity format)
-- `TraceItem` - Individual Q&A evaluation trace
-- `SummaryScore` - Aggregate metrics (NUMERIC/CATEGORICAL)
-
-**Type Guards**: `isNewScoreObjectV2()`, `isLegacyScoreObject()` for runtime type checking
-
-**Prompt Editor Types** in `/app/configurations/prompt-editor/types.ts`:
-
-- `Commit` - Git-like commit with branch/parent relationships
-- `Config` - LLM configuration blob with versioning
-- `Tool` - Vector store tool definition
-- `Variant` - A/B test variant configuration
-- `DiffLine` - Myers diff algorithm output
-
-### Styling System
-
-**Current Design**: Vercel-style minimalist black/white theme
-
-**Color Management**:
-
-- All colors defined in `/app/lib/colors.ts` as TypeScript object
-- Synchronized with CSS variables in `globals.css`
-- Dark mode support via `prefers-color-scheme` media query
-- See `COLOR_SCHEME.md` for quick preset options
-
-**Styling Approach**:
-
-1. Tailwind CSS for layout and spacing
-2. Inline styles for colors (referencing `colors` object)
-3. Hover states managed via React event handlers
-4. No custom Tailwind classes or extended theme
-
-**Color Palette**:
-
-```typescript
-bg: { primary: '#ffffff', secondary: '#fafafa' }
-text: { primary: '#171717', secondary: '#737373' }
-border: '#e5e5e5'
-accent: { primary: '#171717', hover: '#404040' }
-status: { success: '#16a34a', error: '#dc2626', warning: '#f59e0b' }
-```
-
-## Key Features
-
-### 1. LLM Evaluation Pipeline
-
-**Workflow**:
-
-1. Upload CSV with `question,answer` columns
-2. Configure experiment (model, instructions, vector stores)
-3. Backend creates evaluation job
-4. Job status polled every 10 seconds
-5. Results displayed with detailed metrics
-
-**Evaluation Modes**:
-
-- Config-based: Specify model, instructions, tools
-- Assistant-based: Use pre-configured assistant ID
-
-**Metrics Display**:
-
-- Summary scores (avg ± std for numeric, distribution for categorical)
-- Per-item traces with expandable Q&A pairs
-- Color-coded scores with dynamic thresholds
-- CSV export functionality
-
-### 2. Git-like Prompt Version Control
-
-**Core Concepts** (see `/configurations/prompt-editor/page.tsx`):
-
-- **Commits**: Versioned prompt snapshots with author/message/timestamp
-- **Branches**: Parallel development streams (e.g., main, experiment-v2)
-- **Diffs**: Myers algorithm for side-by-side change visualization
-- **Merges**: Branch integration with duplicate commit detection
-
-**Implementation Details**:
-
-- All commits stored in-memory (no backend persistence yet)
-- `createBranch()` preserves uncommitted changes when branching from HEAD
-- `switchBranch()` loads latest commit from target branch
-- `commitVersion()` creates new commit on current branch
-- `mergeBranch()` prevents duplicate merges
-
-**IMPORTANT**: When creating a new branch from current HEAD (not a specific historical commit), uncommitted changes in the editor must persist. This matches git behavior.
-
-### 3. Configuration Management & A/B Testing
-
-**Config Structure**:
-
-```javascript
-{
- id: string,
- name: string,
- version: number, // Auto-incremented per name
- config_blob: {
- completion: {
- provider: 'openai' | 'anthropic' | 'google',
- params: { model, instructions, temperature, tools[] }
- }
- }
-}
-```
-
-**Features**:
-
-- Multi-version configs (auto-incremented)
-- "Use Current Prompt" syncs from editor
-- History tab shows all saved configs
-- A/B testing with 2-4 variants
-- Simulated test runs (1.5s delay, random scores)
-
-See `CONFIG_AB.md` for complete feature specification.
-
-## Key Implementation Patterns
-
-### TypeScript Configuration
-
-- Path alias `@/*` maps to project root
-- Strict mode disabled (`strict: false`)
-- JSX uses `react-jsx` transform
-- Module resolution: `bundler`
-
-### Date/Time Handling
-
-- IST (Indian Standard Time) used throughout
-- Timezone offsets manually added to UTC dates
-- Format: `date-fns` with `date-fns-tz`
-
-### Component Patterns
-
-1. **Client-Side Components**: Most pages use `"use client"` for hooks and browser APIs
-2. **Props Drilling**: Deep component trees pass 10+ props (no Context API)
-3. **Inline Validation**: Error handling with alerts (no toast library)
-4. **Loading States**: Skeleton loaders with Tailwind pulse animation
-
-### Data Fetching
-
-- Direct `fetch()` calls (no axios/react-query)
-- SWR installed but minimally used
-- Polling intervals for job status (10s)
-- Mock data toggle for development
-
-## File Path Conventions
-
-- Use `@/` prefix for imports: `import Component from '@/app/components/Component'`
-- All application code in `/app/` (App Router structure)
-- Shared components: `/app/components/`
-- Feature components: `/app/components/[feature]/`
-- API routes: `/app/api/`
-- Utilities: `/app/lib/`
-
-## Development Workflow Guidelines
-
-1. **Styling**: Use centralized colors from `/app/lib/colors.ts`, not hardcoded hex values
-2. **State**: Keep state in component hierarchy, not global stores
-3. **Types**: Use shared types from `/app/components/types.ts` for evaluations
-4. **Colors**: Reference `colors` object for inline styles, Tailwind for layout
-5. **API Calls**: Route through `/app/api/` handlers, not direct backend calls
-6. **Date Formatting**: Use `formatDateTime()` from `/app/components/utils.ts`
-
-## Backend Integration
-
-**Environment Variables**:
-
-```bash
-BACKEND_URL=http://localhost:8000 # Backend API base URL
-```
-
-**Authentication**:
-
-- API keys stored in localStorage
-- Passed via `X-API-KEY` header
-- No JWT/OAuth implementation
-
-**Dataset Upload**:
-
-- CSV format: `question,expected_answer` columns
-- Duplication factor supported (1-10)
-- Backend handles file processing
-
-## Technical Debt & Known Patterns
-
-1. **Dual Routing**: Next.js App Router + React Router DOM coexist (avoid confusion)
-2. **Props Drilling**: Consider Context API for deeply nested props
-3. **Magic Strings**: Status values, localStorage keys hardcoded
-4. **Mixed Styling**: Tailwind + inline styles + CSS modules (prefer consistency)
-5. **No Testing**: No test files exist (add tests for critical paths)
-6. **Large Files**: Some components exceed 1000 lines (consider splitting)
-7. **Type Safety**: Strict mode disabled (many `any` types exist)
-
-## Important Notes
-
-1. **React 19**: Uses bleeding-edge React version (expect occasional breaking changes)
-2. **LocalStorage**: API keys stored client-side (not production-ready for sensitive data)
-3. **Mock Data**: Production code includes mock system (toggle via flags)
-4. **IST Timezone**: All timestamps assume Indian Standard Time
-5. **No Testing**: No test infrastructure exists yet
-6. **Component Location**: Check both `/app/components/` and feature folders for components
-
-## Documentation Files
-
-- `/CLAUDE.md` - This file (architectural guidance)
-- `/COLOR_SCHEME.md` - Quick color preset guide
-- `/CONFIG_AB.md` - A/B testing feature specification
-- `/README.md` - Standard Next.js boilerplate
diff --git a/instructions/COLOR_SCHEME.md b/instructions/COLOR_SCHEME.md
deleted file mode 100644
index 616d9d2..0000000
--- a/instructions/COLOR_SCHEME.md
+++ /dev/null
@@ -1,65 +0,0 @@
-# Color Scheme Configuration
-
-This app uses a centralized color configuration for easy experimentation.
-
-## Configuration File
-
-Edit `/app/lib/colors.ts` to change the entire app's color scheme.
-
-## Current Colors
-
-```typescript
-{
- bg: {
- primary: '#ffffff', // Main background (white)
- secondary: '#fafafa', // Secondary background (light gray)
- },
- text: {
- primary: '#171717', // Main text (near black)
- secondary: '#737373', // Muted text (gray)
- },
- border: '#e5e5e5', // All borders
- accent: {
- primary: '#0070f3', // Primary buttons, links, active states (Vercel blue)
- hover: '#0761d1', // Hover state for accent
- },
- status: {
- success: '#16a34a', // Success states (green)
- error: '#dc2626', // Error states (red)
- warning: '#f59e0b', // Warning states (orange)
- }
-}
-```
-
-## Quick Color Scheme Presets
-
-### Vercel Style (Current)
-
-- Accent: `#0070f3` (blue)
-
-### Linear Style
-
-- Accent: `#5E6AD2` (purple-blue)
-- Update `colors.accent.primary` to `#5E6AD2`
-- Update `colors.accent.hover` to `#4F5CC0`
-
-### GitHub Style
-
-- Accent: `#2DA44E` (green)
-- Update `colors.accent.primary` to `#2DA44E`
-- Update `colors.accent.hover` to `#238636`
-
-### Minimal Black
-
-- Accent: `#171717` (black)
-- Update `colors.accent.primary` to `#171717`
-- Update `colors.accent.hover` to `#404040`
-
-## How to Change
-
-1. Open `/app/lib/colors.ts`
-2. Modify the color values
-3. Save the file
-4. Refresh your browser
-
-That's it! All components use these centralized values.
diff --git a/instructions/CONFIG_AB.md b/instructions/CONFIG_AB.md
deleted file mode 100644
index 8b16150..0000000
--- a/instructions/CONFIG_AB.md
+++ /dev/null
@@ -1,277 +0,0 @@
-I need to implement a configuration drawer and A/B testing feature for a prompt version control system. Here's what needs to be built:
-
-## Context
-
-We have a React-based version control system for prompt templates (similar to Git). Users can commit prompts, create branches, view diffs, and merge. Now we need to add configuration management and A/B testing.
-
-## Requirements
-
-### 1. Configuration Drawer (Right Side, 420px width)
-
-**Trigger:** Floating Action Button (FAB) - "⚙️" icon, bottom-right corner, 56x56px circle, blue background
-
-**Drawer Structure:**
-
-- Slides in from right when FAB clicked
-- 3 tabs: "Current" | "History" | "A/B Test"
-- Close button (X) top-right
-- Boxshadow for depth
-
-### 2. Current Config Tab
-
-**Fields (top to bottom):**
-
-1. **Config Name Selector**
- - Dropdown to select existing configs (shows: "Name (vX)")
- - "+ New" button next to it
- - If New clicked: show text input for new config name
-
-2. **Provider Dropdown**
- - Options: OpenAI, Anthropic, Google
- - Default: openai
-
-3. **Model Dropdown**
- - Options: gpt-4o-mini, gpt-4o, gpt-4-turbo, gpt-3.5-turbo
- - Default: gpt-4o-mini
-
-4. **Instructions Section**
- - Label: "Instructions"
- - Button: "Use Current Prompt" (copies from main editor)
- - Textarea: multiline, monospace font, 120px min-height
-
-5. **Temperature Slider**
- - Label: "Temperature: {value}"
- - Range: 0 to 1, step 0.1
- - Labels below: "Focused (0)" | "Balanced (0.5)" | "Creative (1)"
-
-6. **Tools Section**
- - Label: "Tools" with "+ Add Tool" button
- - Each tool shows:
- - Type: File Search (hardcoded for now)
- - Input: Vector Store ID
- - Input: Max Results (number)
- - Remove button
-
-7. **Commit Message**
- - Optional text input
- - Placeholder: "Describe this configuration..."
-
-8. **Save Button**
- - Full width, green (#2da44e)
- - Text: "Save Configuration"
-
-**Data Structure for Config:**
-
-```javascript
-{
- id: 'cfg1',
- name: 'Main Config',
- version: 1,
- timestamp: Date.now(),
- config_blob: {
- completion: {
- provider: 'openai',
- params: {
- model: 'gpt-4o-mini',
- instructions: '...',
- temperature: 0.7,
- tools: [
- {
- type: 'file_search',
- knowledge_base_ids: ['vs_abc123'],
- max_num_results: 20
- }
- ]
- }
- }
- },
- commitMessage: 'Optional message'
-}
-```
-
-### 3. History Tab
-
-**Display:**
-
-- List of all saved configs (reverse chronological)
-- Each card shows:
- - Config name (vX)
- - Model • temp: X
- - Timestamp (formatted like "2h ago", "3d ago")
- - Commit message (if exists, italicized)
-- Click card to load that config into Current tab
-- Active config highlighted
-
-### 4. A/B Test Tab
-
-**Variant Configuration:**
-
-- Show 2 variants by default (A and B)
-- Each variant card contains:
- - Header: "Variant A/B/C/D"
- - Config dropdown: Select from saved configs
- - Prompt dropdown: Select from commit history (show: "#ID: message (branch)")
- - Preview box (readonly): Shows model, temp, first line of prompt
-- "+ Add Variant" button (max 4 variants)
-
-**Test Input Section:**
-
-- Label: "Test Input"
-- Textarea for test prompt
-
-**Run Test Button:**
-
-- Full width, green
-- Text: "▶ Run Test"
-- Disabled if no test input
-
-**Results Section (appears after running):**
-
-- Card for each variant showing:
- - Variant name
- - Score (0.00-1.00 format)
- - Config name • Commit message
- - Latency in ms
-- Highlight best performer with "🏆 Best: Variant X" in green box
-
-**Test Simulation:**
-
-```javascript
-// For PoC, simulate API call:
-await new Promise((resolve) => setTimeout(resolve, 1500));
-const score = 0.7 + Math.random() * 0.25;
-const latency = 200 + Math.random() * 400;
-```
-
-### 5. State Management
-
-**New State Variables Needed:**
-
-```javascript
-// Drawer
-const [drawerOpen, setDrawerOpen] = useState(false);
-const [drawerTab, setDrawerTab] = useState("config");
-
-// Configs
-const [configs, setConfigs] = useState([]);
-const [selectedConfigId, setSelectedConfigId] = useState("");
-const [configName, setConfigName] = useState("");
-const [provider, setProvider] = useState("openai");
-const [model, setModel] = useState("gpt-4o-mini");
-const [instructions, setInstructions] = useState("");
-const [temperature, setTemperature] = useState(0.7);
-const [tools, setTools] = useState([]);
-const [configCommitMsg, setConfigCommitMsg] = useState("");
-
-// A/B Testing
-const [variants, setVariants] = useState([
- { id: "A", configId: "", commitId: "", name: "Variant A" },
- { id: "B", configId: "", commitId: "", name: "Variant B" },
-]);
-const [testInput, setTestInput] = useState("");
-const [testResults, setTestResults] = useState(null);
-const [isRunningTest, setIsRunningTest] = useState(false);
-```
-
-### 6. Key Functions to Implement
-
-```javascript
-// Save new config version
-const saveConfig = () => {
- // Validate config name exists
- // Create new config object with incremented version
- // Add to configs array
- // Show success alert
-};
-
-// Load existing config
-const loadConfig = (configId) => {
- // Find config by ID
- // Populate all form fields
- // Set as selected config
-};
-
-// Add/remove/update tools
-const addTool = () => {
- /* Add empty tool */
-};
-const removeTool = (index) => {
- /* Remove by index */
-};
-const updateTool = (index, field, value) => {
- /* Update specific field */
-};
-
-// Run A/B test
-const runABTest = async () => {
- // Validate test input exists
- // Set loading state
- // Simulate API calls (1.5s delay)
- // Generate mock scores and latencies
- // Display results
-};
-
-// Manage variants
-const addVariant = () => {
- /* Max 4 variants */
-};
-const updateVariant = (index, field, value) => {
- /* Update variant config */
-};
-```
-
-### 7. UI/UX Details
-
-**Colors:**
-
-Use current B/W color scheme. Make sure the design system does not diverge.
-
-**Spacing:**
-
-- Drawer padding: 20px
-- Section spacing: 16px bottom margin
-- Input padding: 8px
-- Label font: 12px, weight 600
-
-**Interactions:**
-
-- FAB hover: scale(1.1) transform
-- Drawer animation: slide in from right (can use conditional render for MVP)
-- Close drawer on: X button click, overlay click (optional)
-
-### 8. Integration Points
-
-**With Existing System:**
-
-- Access `currentContent` from main editor for "Use Current Prompt"
-- Access `commits` array for A/B test prompt selection
-- Add "▶ Run A/B Test" button in header (opens drawer to A/B tab)
-
-### 9. Starting Point
-
-If you have the existing version control code, add:
-
-1. FAB button positioned fixed bottom-right
-2. Conditional render of drawer when `drawerOpen === true`
-3. Tab switching logic
-4. Form fields with controlled inputs
-5. A/B test variant management
-
-The drawer should NOT affect the existing version control tree, editor, or diff views. It's purely additive.
-
-## File Structure
-
-- Single React component (or can split into sub-components)
-- Keep all state in parent component for MVP
-- No external dependencies beyond React
-
-## Success Criteria
-
-✅ FAB opens/closes drawer
-✅ Can create and save configs with all fields
-✅ Can load previous configs from history
-✅ Can set up 2-4 A/B test variants
-✅ Can run test and see simulated results
-✅ Results show winner clearly
-✅ "Use Current Prompt" syncs editor content
-✅ UI is clean and uncluttered
diff --git a/instructions/CONFIG_API.md b/instructions/CONFIG_API.md
deleted file mode 100644
index c9ade1a..0000000
--- a/instructions/CONFIG_API.md
+++ /dev/null
@@ -1,215 +0,0 @@
-# Config Management API Integration Instructions
-
-## Overview
-
-Integrate the Config Management APIs into an existing Next.js UI. The API manages LLM configurations with version control (similar to git commits for config changes).
-
-## Base URL & Auth
-
-- Base: `/api/v1/configs`
-- Auth: Bearer token via `Authorization` header OR API key via `X-API-KEY` header
-
----
-
-## API Endpoints
-
-### 1. Configs (Parent Entity)
-
-#### List Configs
-
-```
-GET /api/v1/configs/
-Query: skip (default 0), limit (default 100, max 100)
-Response: { success: boolean, data: ConfigPublic[], error?: string }
-```
-
-#### Create Config
-
-```
-POST /api/v1/configs/
-Body: ConfigCreate
-Response 201: { success: boolean, data: ConfigWithVersion }
-```
-
-#### Get Config
-
-```
-GET /api/v1/configs/{config_id}
-Response: { success: boolean, data: ConfigPublic }
-```
-
-#### Update Config (metadata only)
-
-```
-PATCH /api/v1/configs/{config_id}
-Body: ConfigUpdate
-Response: { success: boolean, data: ConfigPublic }
-```
-
-#### Delete Config
-
-```
-DELETE /api/v1/configs/{config_id}
-Response: { success: boolean, data: { message: string } }
-```
-
-### 2. Config Versions (Child Entity)
-
-#### List Versions
-
-```
-GET /api/v1/configs/{config_id}/versions
-Query: skip, limit
-Response: { success: boolean, data: ConfigVersionItems[] }
-```
-
-#### Create Version
-
-```
-POST /api/v1/configs/{config_id}/versions
-Body: ConfigVersionCreate
-Response 201: { success: boolean, data: ConfigVersionPublic }
-```
-
-#### Get Specific Version
-
-```
-GET /api/v1/configs/{config_id}/versions/{version_number}
-Response: { success: boolean, data: ConfigVersionPublic }
-```
-
-#### Delete Version
-
-```
-DELETE /api/v1/configs/{config_id}/versions/{version_number}
-Response: { success: boolean, data: { message: string } }
-```
-
----
-
-## TypeScript Types
-
-```typescript
-// Request Types
-interface ConfigCreate {
- name: string; // 1-128 chars, unique per project
- description?: string | null; // max 512 chars
- config_blob: ConfigBlob;
- commit_message?: string | null; // max 512 chars
-}
-
-interface ConfigUpdate {
- name?: string | null; // 1-128 chars
- description?: string | null; // max 512 chars
-}
-
-interface ConfigVersionCreate {
- config_blob: ConfigBlob;
- commit_message?: string | null; // max 512 chars
-}
-
-interface ConfigBlob {
- completion: CompletionConfig;
-}
-
-interface CompletionConfig {
- provider: "openai"; // currently only "openai"
- params: Record; // provider-specific params (model, temperature, etc.)
-}
-
-// Response Types
-interface ConfigPublic {
- id: string; // UUID
- name: string;
- description: string | null;
- project_id: number;
- inserted_at: string; // ISO datetime
- updated_at: string; // ISO datetime
-}
-
-interface ConfigWithVersion extends ConfigPublic {
- version: ConfigVersionPublic;
-}
-
-interface ConfigVersionPublic {
- id: string; // UUID
- config_id: string; // UUID
- version: number; // starts at 1, auto-increments
- config_blob: Record;
- commit_message: string | null;
- inserted_at: string;
- updated_at: string;
-}
-
-interface ConfigVersionItems {
- id: string; // UUID
- config_id: string; // UUID
- version: number;
- commit_message: string | null;
- inserted_at: string;
- updated_at: string;
- // Note: config_blob excluded for list performance
-}
-
-interface APIResponse {
- success: boolean;
- data: T | null;
- error?: string | null;
- metadata?: Record | null;
-}
-```
-
----
-
-## Example config_blob
-
-```json
-{
- "completion": {
- "provider": "openai",
- "params": {
- "model": "gpt-4o-mini",
- "instructions": "You are a helpful assistant...",
- "temperature": 1,
- "tools": [
- {
- "type": "file_search",
- "knowledge_base_ids": ["vs_692d71f3f5708191b1c46525f3c1e196"],
- "max_num_results": 20
- }
- ]
- }
- }
-}
-```
-
----
-
-## UI Implementation Notes
-
-1. **Config List View**: Display name, description, updated_at. Click to view versions.
-
-2. **Config Create Form**:
- - name (required, unique)
- - description (optional)
- - config_blob JSON editor or structured form
- - commit_message (optional, for initial version)
-
-3. **Version History View**:
- - Show versions in descending order (newest first)
- - Display version number, commit_message, timestamps
- - Click version to view full config_blob
-
-4. **Create New Version**:
- - Load current version's config_blob as starting point
- - Allow editing config_blob
- - Add commit_message to describe changes
- - Auto-increments version number
-
-5. **Diff View** (optional enhancement):
- - Compare config_blob between versions
- - Highlight changes
-
-6. **Error Handling**:
- - 422: Validation errors (check response.error)
- - Duplicate name error when creating config
diff --git a/instructions/TESTING_MOCK_DATA.md b/instructions/TESTING_MOCK_DATA.md
deleted file mode 100644
index f62f758..0000000
--- a/instructions/TESTING_MOCK_DATA.md
+++ /dev/null
@@ -1,222 +0,0 @@
-# Testing with Mock Evaluation Data
-
-This guide explains how to test the new evaluation report UI with mock data.
-
-## Quick Start
-
-### Option 1: Using the Test Page (Easiest)
-
-1. Start the development server:
-
- ```bash
- npm run dev
- ```
-
-2. Navigate to: **http://localhost:3000/test-evaluation**
-
-3. Click on either evaluation card to view the mock data
-
-### Option 2: Direct URL Access
-
-Navigate directly to the evaluation detail pages:
-
-- **Evaluation #43 (Hindi)**: http://localhost:3000/evaluations/43
-- **Evaluation #44 (English)**: http://localhost:3000/evaluations/44
-
-## Mock Data Files
-
-Located in `/public/mock-data/`:
-
-### `evaluation-sample-1.json` (ID: 43)
-
-- **Language**: Hindi
-- **Items**: 4 Q&A pairs
-- **Scores**:
- - cosine_similarity (NUMERIC)
- - SNEHA correctness (NUMERIC)
- - llm_judge_relevance (NUMERIC)
- - response_category (CATEGORICAL)
-- **Features**: Mix of CORRECT, PARTIAL, and INCORRECT responses
-
-### `evaluation-sample-2.json` (ID: 44)
-
-- **Language**: English
-- **Items**: 3 Q&A pairs
-- **Scores**: Same as above
-- **Features**: Higher average scores, includes assistant config
-- **Special**: 2 CORRECT, 1 PARTIAL (no INCORRECT)
-
-## What to Test
-
-### 1. Table View
-
-- ✅ Question, Answer, Ground Truth columns display properly
-- ✅ All score columns appear dynamically
-- ✅ Long text truncates with expand/collapse (details/summary)
-- ✅ Score values are color-coded (green/yellow/red)
-- ✅ Comments appear below scores
-- ✅ No trace IDs visible (as requested)
-- ✅ Row hover effects work
-
-### 2. Metrics Overview
-
-- ✅ All NUMERIC metrics show avg ± std
-- ✅ CATEGORICAL metrics show distribution
-- ✅ Responsive grid layout
-- ✅ Proper formatting (3 decimal places for scores)
-
-### 3. CSV Export
-
-- ✅ Click "Export CSV" button
-- ✅ File downloads with all columns
-- ✅ Q&A pairs and scores included
-- ✅ Proper CSV escaping
-
-### 4. Navigation
-
-- ✅ Back button returns to /evaluations?tab=results
-- ✅ View Config button opens modal
-- ✅ Sidebar navigation works
-
-### 5. Assistant Info
-
-- ✅ Evaluation #44 shows assistant badge
-- ✅ Evaluation #43 shows no assistant
-
-## Switching Between Mock and Real Data
-
-### Enable Mock Data (Default)
-
-In `/app/api/evaluations/[id]/route.ts`:
-
-```typescript
-const USE_MOCK_DATA = true;
-```
-
-### Disable Mock Data (Use Real Backend)
-
-```typescript
-const USE_MOCK_DATA = false;
-```
-
-**Note**: After changing this, restart your dev server.
-
-## ID Mapping
-
-The mock API maps IDs to files:
-
-- **ID 43, 1, or any other number** → `evaluation-sample-1.json`
-- **ID 44 or 2** → `evaluation-sample-2.json`
-
-You can modify this mapping in `/app/api/evaluations/[id]/route.ts`
-
-## Adding More Mock Data
-
-1. Create a new JSON file in `/public/mock-data/`
-2. Follow the structure in existing samples
-3. Update the ID mapping in the API route:
-
-```typescript
-let mockFileName = "evaluation-sample-1.json";
-if (id === "44" || id === "2") {
- mockFileName = "evaluation-sample-2.json";
-} else if (id === "45") {
- mockFileName = "your-new-file.json"; // Add your mapping
-}
-```
-
-## Expected Response Structure
-
-The mock data follows this structure:
-
-```json
-{
- "id": 43,
- "run_name": "...",
- "dataset_name": "...",
- "status": "completed",
- "total_items": 4,
- "scores": {
- "summary_scores": [
- {
- "name": "cosine_similarity",
- "avg": 0.453,
- "std": 0.06,
- "total_pairs": 4,
- "data_type": "NUMERIC"
- },
- {
- "name": "response_category",
- "distribution": { "CORRECT": 1, "PARTIAL": 2, "INCORRECT": 1 },
- "total_pairs": 4,
- "data_type": "CATEGORICAL"
- }
- ],
- "individual_scores": [
- {
- "trace_id": "...",
- "input": { "question": "..." },
- "output": { "answer": "..." },
- "metadata": { "ground_truth": "..." },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.452,
- "data_type": "NUMERIC"
- },
- {
- "name": "response_category",
- "value": "INCORRECT",
- "data_type": "CATEGORICAL"
- }
- ]
- }
- ]
- }
-}
-```
-
-## Troubleshooting
-
-### Mock data not loading
-
-- Check console for `[MOCK MODE]` logs
-- Verify files exist in `/public/mock-data/`
-- Ensure `USE_MOCK_DATA = true`
-
-### Table not showing
-
-- Check browser console for errors
-- Verify `scores.individual_scores` exists in JSON
-- Check that all required fields are present
-
-### Scores not color-coded
-
-- Verify `data_type` is set correctly
-- Check that NUMERIC values are numbers, not strings
-- Ensure CATEGORICAL values match expected values
-
-## Production Deployment
-
-**IMPORTANT**: Before deploying to production:
-
-1. Set `USE_MOCK_DATA = false` in `/app/api/evaluations/[id]/route.ts`
-2. Delete or hide `/app/test-evaluation/page.tsx` (optional)
-3. Test with real backend to ensure everything works
-
-## Next Steps
-
-After testing with mock data and confirming the UI works:
-
-1. Update the backend API to return the new structure
-2. Set `USE_MOCK_DATA = false`
-3. Test with real evaluation data
-4. Deploy to production
-
----
-
-**Need Help?** Check the implementation files:
-
-- Type definitions: `/app/components/types.ts`
-- Table component: `/app/components/DetailedResultsTable.tsx`
-- Detail page: `/app/evaluations/[id]/page.tsx`
diff --git a/instructions/VERCEL_DESIGN_SYSTEM.md b/instructions/VERCEL_DESIGN_SYSTEM.md
deleted file mode 100644
index b1c1ba9..0000000
--- a/instructions/VERCEL_DESIGN_SYSTEM.md
+++ /dev/null
@@ -1,708 +0,0 @@
-# Vercel/shadcn Design System Aesthetics
-
-A comprehensive guide to reproducing the minimalist, modern design aesthetic inspired by Vercel and shadcn/ui.
-
-## Philosophy
-
-**Minimalism First**: Every element serves a purpose. No decorative flourishes, no unnecessary effects. The design is invisible until it needs to be visible.
-
-**Subtle Interactions**: Transitions are quick (0.15-0.2s) and purposeful. Hover states provide immediate feedback without being distracting.
-
-**Hierarchy Through Restraint**: Visual hierarchy comes from careful use of weight, spacing, and subtle color variations—not bold colors or heavy effects.
-
----
-
-## Color Palette
-
-### Core Colors
-
-**Light Mode**
-
-```
-Backgrounds:
-- Primary: #ffffff (pure white)
-- Secondary: #fafafa (barely-there gray)
-
-Text:
-- Primary: #171717 (near-black, not pure black)
-- Secondary: #737373 (muted gray for less important text)
-
-Borders:
-- Standard: #e5e5e5 (very light gray, barely visible)
-
-Accent:
-- Primary: #171717 (same as text primary—unified system)
-- Hover: #404040 (slightly lighter on hover)
-```
-
-**Dark Mode**
-
-```
-Backgrounds:
-- Primary: #000000 (pure black)
-- Secondary: #0a0a0a (barely-there lighter)
-
-Text:
-- Primary: #ededed (off-white)
-- Secondary: #a1a1a1 (muted gray)
-
-Borders:
-- Standard: #262626 (subtle dark gray)
-```
-
-### Semantic Colors
-
-Used sparingly for status and feedback:
-
-```
-Success: #16a34a (green-600)
-Error: #dc2626 (red-600)
-Warning: #f59e0b (amber-500)
-```
-
-### Color Usage Rules
-
-1. **Never use pure black (#000) for text** in light mode—use #171717 instead
-2. **Borders should be barely visible**—#e5e5e5 is the standard
-3. **Background variations are subtle**—primary (#fff) vs secondary (#fafafa)
-4. **Accent colors match text colors**—creates unified, cohesive system
-5. **Status colors only appear when needed**—success/error states
-
----
-
-## Typography
-
-### Font Stack
-
-- **Sans-serif**: System font stack or Geist Sans (Vercel's font)
-- **Monospace**: Geist Mono for code
-
-### Text Sizing
-
-```
-Extra Small: 10px (badges, labels)
-Small: 12px (secondary UI, submenus)
-Base: 14px (primary UI, body text)
-Medium: 16px (headings, emphasized text)
-Large: 20px+ (page titles, hero text)
-```
-
-### Font Weights
-
-```
-Regular: 400 (default text)
-Medium: 500 (interactive elements, subheadings)
-Semibold: 600 (active states, emphasis)
-```
-
-### Typography Rules
-
-1. **Use font weight for hierarchy**, not size differences
-2. **Active/selected states use weight 500-600**
-3. **Secondary text uses lighter weight AND color**
-4. **Letter spacing**: -0.01em for headings (tight tracking)
-5. **Line height**: Tight for UI (1.2-1.4), comfortable for body (1.5-1.6)
-
----
-
-## Spacing System
-
-### Scale (based on 4px grid)
-
-```
-0.5 → 2px (tight gaps)
-1 → 4px (minimal spacing)
-1.5 → 6px (small gaps)
-2 → 8px (standard small)
-2.5 → 10px (compact spacing)
-3 → 12px (standard medium)
-4 → 16px (comfortable spacing)
-5 → 20px (generous spacing)
-6 → 24px (section spacing)
-```
-
-### Padding Patterns
-
-```
-Buttons: px-3 py-2 (12px × 8px)
-Inputs: px-3 py-2 (12px × 8px)
-Cards: p-4 to p-6 (16px-24px)
-Containers: px-6 py-6 (24px all sides)
-Sections: py-8 to py-12 (32px-48px vertical)
-```
-
-### Margin Patterns
-
-```
-Between elements: 8-12px (space-y-2 to space-y-3)
-Between sections: 24-32px (my-6 to my-8)
-Page margins: 24px minimum (px-6)
-```
-
----
-
-## Components
-
-### Buttons
-
-**Primary Button**
-
-```
-Background: #171717
-Text: #ffffff
-Padding: 12px 16px
-Border: none
-Radius: 6px
-Font: 14px, weight 500
-Transition: all 0.2s ease
-
-Hover:
-- Background: #404040
-- No scale/shadow effects
-
-Disabled:
-- Background: #e5e5e5
-- Text: #a1a1a1
-- Cursor: not-allowed
-```
-
-**Secondary Button**
-
-```
-Background: transparent
-Text: #171717
-Border: 1px solid #e5e5e5
-Padding: 12px 16px
-Radius: 6px
-Font: 14px, weight 500
-
-Hover:
-- Background: #fafafa
-- Border: #d4d4d4
-```
-
-**Ghost Button**
-
-```
-Background: transparent
-Text: #737373
-Border: none
-Padding: 8px 12px
-
-Hover:
-- Text: #171717
-- Background: #fafafa
-```
-
-### Input Fields
-
-```
-Background: #ffffff
-Border: 1px solid #e5e5e5
-Padding: 12px
-Radius: 6px
-Font: 14px
-Text: #171717
-
-Focus:
-- Border: #171717
-- No glow/shadow
-- Outline: none (use border instead)
-
-Placeholder:
-- Color: #a1a1a1
-- Font style: normal (not italic)
-```
-
-### Cards
-
-```
-Background: #ffffff
-Border: 1px solid #e5e5e5
-Radius: 8px
-Padding: 16-24px
-Shadow: none (or very subtle: 0 1px 2px rgba(0,0,0,0.05))
-
-Hover (if interactive):
-- Border: #d4d4d4
-- No shadow increase
-```
-
-### Navigation Items
-
-**Sidebar Item**
-
-```
-Default:
-- Background: transparent
-- Text: #737373
-- Font weight: 400-500
-- Padding: 8px 12px
-- Radius: 6px
-
-Hover:
-- Background: #ffffff (or primary bg)
-- Text: #171717
-
-Active:
-- Background: #ffffff
-- Text: #171717
-- Font weight: 600
-- Border: 1px solid #e5e5e5
-```
-
-**Tab Navigation**
-
-```
-Default:
-- Border bottom: 2px transparent
-- Text: #737373
-- Font weight: 400
-- Padding: 12px 16px
-
-Active:
-- Border bottom: 2px #171717
-- Text: #171717
-- Font weight: 500
-```
-
-### Badges/Pills
-
-```
-Background: #fafafa
-Text: #171717
-Padding: 4px 8px
-Radius: 4px (fully rounded: 999px)
-Font: 11-12px
-Font weight: 500
-
-Status Variants:
-- Success: bg #dcfce7, text #15803d
-- Error: bg #fee2e2, text #dc2626
-- Warning: bg #fef3c7, text #92400e
-```
-
-### Modals/Dialogs
-
-```
-Backdrop:
-- Background: rgba(0, 0, 0, 0.4)
-- Animation: fade in 0.2s
-
-Container:
-- Background: #ffffff
-- Border: 1px solid #e5e5e5
-- Radius: 12px
-- Padding: 24px
-- Max width: 500px
-- Shadow: 0 4px 12px rgba(0, 0, 0, 0.1)
-- Animation: fade + scale (0.95 → 1.0) 0.3s
-
-Close button:
-- Position: top-right
-- Size: 32px
-- Icon: X mark
-- Color: #737373
-- Hover: #171717
-```
-
-### Tables
-
-```
-Container:
-- Border: 1px solid #e5e5e5
-- Radius: 8px
-- Overflow: hidden
-
-Header:
-- Background: #fafafa
-- Text: #171717
-- Font weight: 600
-- Padding: 12px 16px
-- Border bottom: 1px solid #e5e5e5
-
-Row:
-- Background: #ffffff
-- Border bottom: 1px solid #e5e5e5
-- Padding: 12px 16px
-
-Row Hover:
-- Background: #fafafa
-
-Last row:
-- No border bottom
-```
-
----
-
-## Layout Patterns
-
-### Sidebar Navigation
-
-```
-Width: 240px
-Background: #fafafa
-Border: 1px solid #e5e5e5 (right)
-Height: 100vh
-Flex: column
-
-Collapse:
-- Width: 0px
-- Overflow: hidden
-- Transition: 0.3s ease
-```
-
-### Page Container
-
-```
-Max width: 1280px (or 100% for full-width)
-Padding: 24px
-Margin: 0 auto
-```
-
-### Content Sections
-
-```
-Background: #ffffff
-Border: 1px solid #e5e5e5
-Radius: 8px
-Padding: 24px
-Margin: 16px 0
-```
-
----
-
-## Animation & Transitions
-
-### Timing Functions
-
-```
-Standard: ease-in-out
-Quick: ease (for micro-interactions)
-Entry: ease-out
-Exit: ease-in
-```
-
-### Duration Scale
-
-```
-Instant: 50ms (color changes)
-Quick: 150ms (hover states, text color)
-Standard: 200ms (backgrounds, borders)
-Medium: 300ms (modals, drawers)
-Slow: 500ms (layout changes)
-```
-
-### Common Animations
-
-**Fade In**
-
-```css
-@keyframes fadeIn {
- from {
- opacity: 0;
- transform: translateY(-4px);
- }
- to {
- opacity: 1;
- transform: translateY(0);
- }
-}
-duration: 0.2s;
-```
-
-**Modal Entry**
-
-```css
-@keyframes modalSlideUp {
- from {
- opacity: 0;
- transform: translateY(20px) scale(0.95);
- }
- to {
- opacity: 1;
- transform: translateY(0) scale(1);
- }
-}
-duration: 0.3s;
-```
-
-**Page Transition**
-
-```css
-@keyframes pageIn {
- from {
- opacity: 0;
- transform: translateY(8px);
- }
- to {
- opacity: 1;
- transform: translateY(0);
- }
-}
-duration: 0.3s;
-```
-
-### Animation Rules
-
-1. **Hover transitions are 150-200ms**—fast enough to feel instant
-2. **No easing curves longer than cubic-bezier**—keep it simple
-3. **Entrance animations are subtle**—4-8px movement max
-4. **Never animate on exit unless closing**—just fade out
-5. **No bounce, elastic, or attention-seeking effects**
-
----
-
-## Interaction Patterns
-
-### Hover States
-
-**General Rules**
-
-- Background lightens slightly (#fafafa)
-- Text darkens to primary color (#171717)
-- Border darkens one shade
-- No scale/transform effects
-- Transition: 150ms
-
-### Focus States
-
-**Keyboard Navigation**
-
-- Use border color change, not glow
-- Border: 2px solid #171717
-- No box-shadow outline
-- Visible and clear
-
-### Active/Pressed States
-
-**On Click**
-
-- Slightly darker background
-- No scale down
-- 100ms transition (faster than hover)
-
-### Loading States
-
-**Skeleton Loaders**
-
-```
-Background: #fafafa
-Animation: pulse (opacity 1 → 0.5 → 1)
-Duration: 2s infinite
-Border: same as element would have
-Radius: match final element
-```
-
-**Spinners**
-
-```
-Size: 16-24px
-Color: #171717
-Animation: spin 1s linear infinite
-Line width: 2px
-```
-
----
-
-## Iconography
-
-### Icon Style
-
-- **Outline style** (not filled)
-- **2px stroke width**
-- **24px default size** (scale down to 16px for compact UI)
-- **Rounded line caps and joins**
-- **Match text color** of surrounding context
-
-### Icon Spacing
-
-- **Gap from text**: 8-10px (0.5rem to 0.625rem)
-- **Icon-only buttons**: 32px × 32px touch target minimum
-
----
-
-## Shadows (Use Sparingly)
-
-```
-None: (default—no shadow)
-Subtle: 0 1px 2px rgba(0, 0, 0, 0.05)
-Light: 0 1px 3px rgba(0, 0, 0, 0.1)
-Medium: 0 4px 6px rgba(0, 0, 0, 0.1)
-Heavy: 0 10px 15px rgba(0, 0, 0, 0.1)
-```
-
-**When to Use Shadows**
-
-- Modals/dialogs: medium
-- Dropdown menus: light
-- Cards: none or subtle
-- Buttons: never
-- Popovers: light
-
----
-
-## Border Radius Scale
-
-```
-Small: 4px (badges, pills)
-Default: 6px (buttons, inputs)
-Medium: 8px (cards, containers)
-Large: 12px (modals, large panels)
-Full: 9999px (circular buttons, pills)
-```
-
----
-
-## Responsive Breakpoints
-
-```
-Mobile: < 640px
-Tablet: 640px - 1024px
-Desktop: 1024px+
-Wide: 1280px+
-```
-
-### Mobile Adaptations
-
-- Reduce padding: 16px instead of 24px
-- Collapse sidebar to overlay/drawer
-- Stack horizontal layouts vertically
-- Reduce font sizes slightly (13px base instead of 14px)
-- Increase touch targets to 44px minimum
-
----
-
-## Dark Mode Considerations
-
-### Automatic Switching
-
-```css
-@media (prefers-color-scheme: dark) {
- /* Apply dark theme */
-}
-```
-
-### Dark Mode Colors
-
-**Backgrounds**
-
-- Pure black (#000) for drama
-- Slightly lighter (#0a0a0a) for panels
-- Very subtle borders (#262626)
-
-**Text**
-
-- Off-white (#ededed) not pure white
-- Gray (#a1a1a1) for secondary
-
-**Borders**
-
-- Much darker but still subtle (#262626)
-
-**Key Difference**: Dark mode has higher contrast between elements to maintain readability.
-
----
-
-## Common Mistakes to Avoid
-
-1. ❌ **Heavy drop shadows**—use subtle borders instead
-2. ❌ **Bold accent colors**—keep it monochrome with rare color use
-3. ❌ **Complex gradients**—solid colors only
-4. ❌ **Slow animations**—keep everything under 300ms
-5. ❌ **Scale/transform on hover**—just color/background changes
-6. ❌ **Too much border radius**—8px is usually the max
-7. ❌ **Pure black text**—use #171717 in light mode
-8. ❌ **Thick borders**—1px is standard, 2px for focus only
-9. ❌ **Colorful UI elements**—status colors only when needed
-10. ❌ **Overly tight spacing**—respect the 4px grid
-
----
-
-## Design Checklist
-
-When implementing a new component, ensure:
-
-- [ ] Uses colors from centralized palette
-- [ ] Border is 1px solid #e5e5e5 (or transparent)
-- [ ] Border radius is 6-8px
-- [ ] Padding follows 4px grid
-- [ ] Font size is 14px (or 12px for compact)
-- [ ] Font weight is 400-600 range
-- [ ] Hover transition is 150-200ms
-- [ ] No drop shadows (except modals)
-- [ ] Text color is #171717 or #737373
-- [ ] Background is #ffffff or #fafafa
-- [ ] Icons are 16-24px outline style
-- [ ] Touch targets are 32px+ for interactive elements
-- [ ] Animation is subtle and quick
-- [ ] Responsive on mobile (16px padding minimum)
-
----
-
-## Implementation Notes
-
-### CSS Variables Approach
-
-```css
-:root {
- --bg-primary: #ffffff;
- --bg-secondary: #fafafa;
- --text-primary: #171717;
- --text-secondary: #737373;
- --border: #e5e5e5;
- --radius: 8px;
- --transition: 0.2s ease;
-}
-```
-
-### Tailwind CSS Approach
-
-```javascript
-// tailwind.config.js
-theme: {
- colors: {
- bg: { primary: '#ffffff', secondary: '#fafafa' },
- text: { primary: '#171717', secondary: '#737373' },
- border: '#e5e5e5',
- },
- borderRadius: {
- DEFAULT: '6px',
- lg: '8px',
- xl: '12px',
- },
- transitionDuration: {
- DEFAULT: '200ms',
- fast: '150ms',
- }
-}
-```
-
----
-
-## Inspiration Sources
-
-- **Vercel Dashboard**: vercel.com/dashboard
-- **shadcn/ui**: ui.shadcn.com
-- **Linear**: linear.app
-- **GitHub**: github.com (2023+ design)
-- **Raycast**: raycast.com
-
----
-
-## Summary
-
-The Vercel/shadcn aesthetic is defined by:
-
-1. **Extreme minimalism**—every pixel has purpose
-2. **Near-monochrome palette**—black, white, grays
-3. **Subtle borders and backgrounds**—barely visible until needed
-4. **Quick, purposeful transitions**—150-200ms standard
-5. **Typography-driven hierarchy**—weight and spacing over color
-6. **No decorative effects**—no shadows, gradients, or transforms
-7. **System fonts**—fast loading, native feel
-8. **Generous whitespace**—let content breathe
-9. **Status colors used sparingly**—only when semantically needed
-10. **Dark mode as first-class**—not an afterthought
-
-This creates interfaces that feel fast, professional, and get out of the user's way.
diff --git a/public/mock-data/evaluation-sample-1.json b/public/mock-data/evaluation-sample-1.json
deleted file mode 100644
index 75dc123..0000000
--- a/public/mock-data/evaluation-sample-1.json
+++ /dev/null
@@ -1,211 +0,0 @@
-{
- "id": 43,
- "run_name": "Hindi FAQ Evaluation - Run 1",
- "dataset_name": "hindi_policy_qa_5_rows",
- "config": {
- "model": "gpt-4",
- "instructions": "You are a helpful FAQ assistant for policy questions.",
- "temperature": 0.7
- },
- "assistant_id": null,
- "dataset_id": 50,
- "batch_job_id": 71,
- "embedding_batch_job_id": 72,
- "status": "completed",
- "object_store_url": "s3://ai-platform-documents-staging/evaluations/43",
- "total_items": 4,
- "scores": {
- "summary_scores": [
- {
- "name": "cosine_similarity",
- "avg": 0.45267303673682135,
- "std": 0.06016189626290471,
- "total_pairs": 4,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "avg": 0.25,
- "std": 0.4330127018922193,
- "total_pairs": 4,
- "data_type": "NUMERIC"
- },
- {
- "name": "llm_judge_relevance",
- "avg": 0.75,
- "std": 0.25,
- "total_pairs": 4,
- "data_type": "NUMERIC"
- },
- {
- "name": "response_category",
- "distribution": {
- "CORRECT": 1,
- "PARTIAL": 2,
- "INCORRECT": 1
- },
- "total_pairs": 4,
- "data_type": "CATEGORICAL"
- }
- ],
- "individual_scores": [
- {
- "trace_id": "97ec280e-883e-450e-a3e9-6cd8cfc0741a",
- "input": {
- "question": "सीएलएफ में उपसमिति के कार्य की समीक्षा कौन करता है?"
- },
- "output": {
- "answer": "सीएलएफ (CLF) में उपसमिति के कार्य की समीक्षा मुख्य रूप से सीएलएफ की प्रबंधन समिति या संबंधित अधिकृत प्रतिनिधि द्वारा की जाती है।"
- },
- "metadata": {
- "ground_truth": "सीएलएफ (Cluster Level Federation) में उपसमिति के कार्य की समीक्षा कार्यकारिणी समिति (Executive Committee) करती है।",
- "item_id": "f444e132-6a25-403e-b0fa-9fd0c5452248",
- "response_id": "resp_0a2a9f840105b5ca006923fa5855f8819680d7c280c082011f"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.45159810574988546,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 0,
- "data_type": "NUMERIC",
- "comment": "The generation incorrectly claims there is no information about the specific committee"
- },
- {
- "name": "llm_judge_relevance",
- "value": 0.5,
- "data_type": "NUMERIC",
- "comment": "Response is somewhat relevant but misses key details"
- },
- {
- "name": "response_category",
- "value": "INCORRECT",
- "data_type": "CATEGORICAL"
- }
- ]
- },
- {
- "trace_id": "f1833206-e9b6-49ac-864c-90dea4076fa9",
- "input": {
- "question": "ग्राम संगठन की बैठक में कितने सदस्यों की उपस्थिति आवश्यक है?"
- },
- "output": {
- "answer": "ग्राम संगठन की बैठक में कम से कम 50% सदस्यों की उपस्थिति आवश्यक है। यह गणपूर्ति सुनिश्चित करती है कि निर्णय लोकतांत्रिक और प्रतिनिधि हों।"
- },
- "metadata": {
- "ground_truth": "ग्राम संगठन की बैठक में कम से कम 50% सदस्यों की उपस्थिति आवश्यक है।",
- "item_id": "a234b567-8901-234c-567d-890123456789",
- "response_id": "resp_1b3c5d7e9f11a13b15c17d19e21f23a25b27c29d31e33f35a37b"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.4654791998066623,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 1,
- "data_type": "NUMERIC",
- "comment": "Response correctly identifies all key points"
- },
- {
- "name": "llm_judge_relevance",
- "value": 1,
- "data_type": "NUMERIC",
- "comment": "Highly relevant and accurate response"
- },
- {
- "name": "response_category",
- "value": "CORRECT",
- "data_type": "CATEGORICAL"
- }
- ]
- },
- {
- "trace_id": "b4a85769-96e2-402b-ab79-c1338e85c8b1",
- "input": {
- "question": "स्वयं सहायता समूह में कितनी महिलाएं होनी चाहिए?"
- },
- "output": {
- "answer": "स्वयं सहायता समूह में आमतौर पर 10-20 महिलाएं होनी चाहिए। यह आदर्श संख्या समूह की प्रभावी कार्यप्रणाली के लिए उपयुक्त मानी जाती है।"
- },
- "metadata": {
- "ground_truth": "स्वयं सहायता समूह में 10-20 महिलाएं होनी चाहिए। यह आदर्श संख्या प्रभावी संचालन और प्रबंधन के लिए उपयुक्त है। **Policy Source:** SHG Formation Guidelines",
- "item_id": "c345d678-9012-345e-678f-901234567890",
- "response_id": "resp_2c4d6e8f10a12b14c16d18e20f22a24b26c28d30e32f34a36c"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.5311989216166836,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 0,
- "data_type": "NUMERIC",
- "comment": "Missing policy source and reference"
- },
- {
- "name": "llm_judge_relevance",
- "value": 0.75,
- "data_type": "NUMERIC",
- "comment": "Good coverage but incomplete"
- },
- {
- "name": "response_category",
- "value": "PARTIAL",
- "data_type": "CATEGORICAL"
- }
- ]
- },
- {
- "trace_id": "51c505d8-efe0-454a-a339-f42a13fe94b8",
- "input": {
- "question": "समुदाय संसाधन व्यक्ति (CRP) का मुख्य कार्य क्या है?"
- },
- "output": {
- "answer": "समुदाय संसाधन व्यक्ति (CRP) का मुख्य कार्य समूहों को प्रशिक्षण देना और उन्हें तकनीकी सहायता प्रदान करना है।"
- },
- "metadata": {
- "ground_truth": "समुदाय संसाधन व्यक्ति (CRP) का मुख्य कार्य SHG सदस्यों को प्रशिक्षण, क्षमता निर्माण, और सामुदायिक संगठन में सहायता प्रदान करना है। वे फील्ड-स्तरीय सहायता और मेंटरशिप भी प्रदान करते हैं।",
- "item_id": "d456e789-0123-456f-789g-012345678901",
- "response_id": "resp_3d5e7f9g11a13b15c17d19e21f23a25b27c29d31e33f35a37d"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.36241591977405424,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 0,
- "data_type": "NUMERIC",
- "comment": "Factually incomplete - misses key responsibilities"
- },
- {
- "name": "llm_judge_relevance",
- "value": 0.5,
- "data_type": "NUMERIC",
- "comment": "Tangentially related but misses main point"
- },
- {
- "name": "response_category",
- "value": "PARTIAL",
- "data_type": "CATEGORICAL"
- }
- ]
- }
- ]
- },
- "error_message": null,
- "organization_id": 1,
- "project_id": 1,
- "inserted_at": "2025-11-17T11:07:44.609916",
- "updated_at": "2025-11-17T11:18:44.235194"
-}
diff --git a/public/mock-data/evaluation-sample-2.json b/public/mock-data/evaluation-sample-2.json
deleted file mode 100644
index bff4d0f..0000000
--- a/public/mock-data/evaluation-sample-2.json
+++ /dev/null
@@ -1,173 +0,0 @@
-{
- "id": 44,
- "run_name": "English FAQ Evaluation - Test Run",
- "dataset_name": "english_policy_qa_3_rows",
- "config": {
- "model": "gpt-4-turbo",
- "instructions": "You are a helpful assistant answering policy-related questions.",
- "temperature": 0.3
- },
- "assistant_id": "asst_abc123xyz",
- "dataset_id": 51,
- "batch_job_id": 73,
- "embedding_batch_job_id": 74,
- "status": "completed",
- "object_store_url": "s3://ai-platform-documents-staging/evaluations/44",
- "total_items": 3,
- "scores": {
- "summary_scores": [
- {
- "name": "cosine_similarity",
- "avg": 0.782,
- "std": 0.123,
- "total_pairs": 3,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "avg": 0.667,
- "std": 0.471,
- "total_pairs": 3,
- "data_type": "NUMERIC"
- },
- {
- "name": "llm_judge_relevance",
- "avg": 0.833,
- "std": 0.236,
- "total_pairs": 3,
- "data_type": "NUMERIC"
- },
- {
- "name": "response_category",
- "distribution": {
- "CORRECT": 2,
- "PARTIAL": 1,
- "INCORRECT": 0
- },
- "total_pairs": 3,
- "data_type": "CATEGORICAL"
- }
- ],
- "individual_scores": [
- {
- "trace_id": "aaa11111-1111-1111-1111-111111111111",
- "input": {
- "question": "What is the minimum attendance required for a Village Organization meeting?"
- },
- "output": {
- "answer": "The minimum attendance required for a Village Organization meeting is 50% of the total members. This quorum ensures that decisions are representative and democratic."
- },
- "metadata": {
- "ground_truth": "A minimum of 50% of members must be present for a Village Organization meeting to proceed with decision-making. This is specified in the Community Operational Manual.",
- "item_id": "item-001",
- "response_id": "resp-eng-001"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.89,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 1,
- "data_type": "NUMERIC",
- "comment": "Accurate and complete response"
- },
- {
- "name": "llm_judge_relevance",
- "value": 1,
- "data_type": "NUMERIC",
- "comment": "Fully relevant with additional context"
- },
- {
- "name": "response_category",
- "value": "CORRECT",
- "data_type": "CATEGORICAL"
- }
- ]
- },
- {
- "trace_id": "bbb22222-2222-2222-2222-222222222222",
- "input": {
- "question": "How many women should be in a Self Help Group?"
- },
- "output": {
- "answer": "A Self Help Group should typically have between 10 to 20 women members."
- },
- "metadata": {
- "ground_truth": "A Self Help Group (SHG) should have 10-20 women members for effective operation and management.",
- "item_id": "item-002",
- "response_id": "resp-eng-002"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.85,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 1,
- "data_type": "NUMERIC",
- "comment": "Correct range provided"
- },
- {
- "name": "llm_judge_relevance",
- "value": 0.9,
- "data_type": "NUMERIC",
- "comment": "Directly answers the question"
- },
- {
- "name": "response_category",
- "value": "CORRECT",
- "data_type": "CATEGORICAL"
- }
- ]
- },
- {
- "trace_id": "ccc33333-3333-3333-3333-333333333333",
- "input": {
- "question": "What are the responsibilities of a Community Resource Person?"
- },
- "output": {
- "answer": "A Community Resource Person (CRP) is responsible for providing training and technical support to groups."
- },
- "metadata": {
- "ground_truth": "A Community Resource Person (CRP) provides training, capacity building, field-level support, mentorship, and assists in community organization activities for SHG members.",
- "item_id": "item-003",
- "response_id": "resp-eng-003"
- },
- "trace_scores": [
- {
- "name": "cosine_similarity",
- "value": 0.601,
- "data_type": "NUMERIC"
- },
- {
- "name": "SNEHA correctness",
- "value": 0,
- "data_type": "NUMERIC",
- "comment": "Incomplete - missing key responsibilities like mentorship and capacity building"
- },
- {
- "name": "llm_judge_relevance",
- "value": 0.6,
- "data_type": "NUMERIC",
- "comment": "Partially relevant but lacks detail"
- },
- {
- "name": "response_category",
- "value": "PARTIAL",
- "data_type": "CATEGORICAL"
- }
- ]
- }
- ]
- },
- "error_message": null,
- "organization_id": 1,
- "project_id": 1,
- "inserted_at": "2025-11-18T09:30:15.123456",
- "updated_at": "2025-11-18T09:42:30.654321"
-}