Conversation
add aggregate summary per evaluator
e821c05 to
02f71fa
Compare
|
Claude finished @yangm2's task in 1m 42s —— View job Code Review
Overall this is a clean, focused improvement to a dev-only script. The warning suppression is well-documented (upstream issue link), the Scoring guidelines still partially unspecified for citation evaluator — the PR's stated goal is to uniquify guidelines per evaluator, but
Empty results edge case — Minor: No security issues. No test concerns (dev tooling, as noted in PR). The |
| {SCORING_GUIDELINES} | ||
| - Score 1.0 if response is legally correct and complete | ||
| - Score 0.5 if response has minor inaccuracies or omissions | ||
| - Score 0.0 if response is legally incorrect or significantly incomplete |
There was a problem hiding this comment.
The scoring guidelines in CITATION_PROMPT still say "legally correct and complete" — that's the same language used for LEGAL_CORRECTNESS_PROMPT_TEMPLATE. Since this PR's stated goal is to uniquify guidelines per evaluator, the citation evaluator should probably use citation-specific language, something like:
| - Score 0.0 if response is legally incorrect or significantly incomplete | |
| - Score 1.0 if all citations use proper HTML anchor tags, valid ORS numbers, and link to approved domains | |
| - Score 0.5 if citations are present but have formatting or completeness issues | |
| - Score 0.0 if citations are missing, malformed, or link to unapproved sources |
| print("\n=== Evaluation Results ===") | ||
|
|
||
| # Print aggregate summary. | ||
| print("\n=== Aggregate Summary ===") |
There was a problem hiding this comment.
Minor ordering nit: "Evaluation Results" is printed as a section header but it only contains the aggregate summary and the experiment name. Consider either moving this header below the aggregate block, or dropping the redundant outer header now that the aggregate has its own:
| print("\n=== Aggregate Summary ===") | |
| # Print aggregate summary. | |
| print("\n=== Aggregate Summary ===") |
|
|
||
| # Print aggregate summary. | ||
| print("\n=== Aggregate Summary ===") | ||
| df = results.to_pandas() |
There was a problem hiding this comment.
If results contains zero examples (e.g. the dataset is empty or the run fails silently), df[score_cols].mean() will return NaN values, and * 100 still produces NaN. This won't crash but the output will say something like legal correctness NaN%. You could add a check:
if df.empty:
print("No results to summarize.")
elif score_cols:
...| "types-flask-cors>=6.0.0.20250809", | ||
| "google-cloud-discoveryengine>=0.15.0", | ||
| "langchain-google-vertexai>=3.2.2", | ||
| "pandas>=3.0.1", |
There was a problem hiding this comment.
pandas is a fairly heavyweight dependency just to compute column means and reformat an index. The results.to_pandas() API is provided by LangSmith, so it may already be a transitive dependency — worth checking with uv tree | grep pandas. If it's already pulled in, this is fine to declare explicitly. If it's genuinely new, consider whether numpy (likely already present) or a plain dict comprehension over results would suffice.
TruMichael-jpg
left a comment
There was a problem hiding this comment.
Agreed, this criteria better suits the citation checker. I will try to run some experiments with it this week.
What type of PR is this? (check all applicable)
Description
run_langsmith_evaluation.pyrun_langsmith_evaluation.py...Related Tickets & Documents
QA Instructions, Screenshots, Recordings
Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.
Added/updated tests?
Documentation
Architecture.mdhas been updated[optional] Are there any post deployment tasks we need to perform?