More run evaluator fixes by yangm2 · Pull Request #297 · codeforpdx/tenantfirstaid

yangm2 · 2026-03-04T23:34:55Z

What type of PR is this? (check all applicable)

Description

Suppress noisy warnings when running run_langsmith_evaluation.py

Add Aggregate Summary to run_langsmith_evaluation.py ...

=== Evaluation Results ===

=== Aggregate Summary ===
legal correctness    62.5%
appropriate tone    100.0%

uniquify and adjust scoring guidelines per evaluator
also capture VERTEX_AI_DATASTORE external dependency in experiment metadata

Related Tickets & Documents

Related Issue #
Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

Added/updated tests?

Yes
No, and this is why: development tools, not production
I need help with writing tests

Documentation

If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

add aggregate summary per evaluator

github-actions · 2026-03-04T23:55:21Z

Claude finished @yangm2's task in 1m 42s —— View job

Code Review

Gather context and understand the PR
Read changed files
Review code quality, security, performance, testing, documentation
Post final review

Overall this is a clean, focused improvement to a dev-only script. The warning suppression is well-documented (upstream issue link), the continuous=True additions improve score fidelity, and the aggregate summary is a helpful usability win. A few things worth addressing:

Scoring guidelines still partially unspecified for citation evaluator — the PR's stated goal is to uniquify guidelines per evaluator, but CITATION_PROMPT's scoring guidelines still say "legally correct and complete" (same as the legal correctness evaluator). The <Reminders> block correctly says "evaluate citation accuracy" but the scoring criteria don't reflect that. See inline comment. Fix this →

pandas as a new dev dependency — pandas>=3.0.1 is heavy for computing column averages on a results dataframe. Since results.to_pandas() is the LangSmith-provided API, pandas may already be a transitive dep. If so, declaring it explicitly is fine. If it's genuinely new, a plain loop over the results or numpy (if already present) would be lighter. See inline comment.

Empty results edge case — df[score_cols].mean() * 100 silently produces NaN if the dataset is empty. Worth an if df.empty guard. See inline comment.

Minor: "=== Evaluation Results ===" is now a redundant outer header — the aggregate section has its own header, and the only other output is the experiment name line. Consider dropping it or restructuring slightly. See inline comment.

No security issues. No test concerns (dev tooling, as noted in PR). The client=ls_client and RAG Data Store metadata additions are solid improvements.

github-actions · 2026-03-04T23:56:43Z

backend/scripts/langsmith_evaluators.py

-    {SCORING_GUIDELINES}
+    - Score 1.0 if response is legally correct and complete
+    - Score 0.5 if response has minor inaccuracies or omissions
+    - Score 0.0 if response is legally incorrect or significantly incomplete


The scoring guidelines in CITATION_PROMPT still say "legally correct and complete" — that's the same language used for LEGAL_CORRECTNESS_PROMPT_TEMPLATE. Since this PR's stated goal is to uniquify guidelines per evaluator, the citation evaluator should probably use citation-specific language, something like:

Suggested change

- Score 0.0 if response is legally incorrect or significantly incomplete

- Score 1.0 if all citations use proper HTML anchor tags, valid ORS numbers, and link to approved domains

- Score 0.5 if citations are present but have formatting or completeness issues

- Score 0.0 if citations are missing, malformed, or link to unapproved sources

github-actions · 2026-03-04T23:56:48Z

backend/scripts/run_langsmith_evaluation.py

    print("\n=== Evaluation Results ===")
+
+    # Print aggregate summary.
+    print("\n=== Aggregate Summary ===")


Minor ordering nit: "Evaluation Results" is printed as a section header but it only contains the aggregate summary and the experiment name. Consider either moving this header below the aggregate block, or dropping the redundant outer header now that the aggregate has its own:

Suggested change

print("\n=== Aggregate Summary ===")

# Print aggregate summary.

print("\n=== Aggregate Summary ===")

github-actions · 2026-03-04T23:56:54Z

backend/scripts/run_langsmith_evaluation.py

+
+    # Print aggregate summary.
+    print("\n=== Aggregate Summary ===")
+    df = results.to_pandas()


If results contains zero examples (e.g. the dataset is empty or the run fails silently), df[score_cols].mean() will return NaN values, and * 100 still produces NaN. This won't crash but the output will say something like legal correctness NaN%. You could add a check:

if df.empty: print("No results to summarize.") elif score_cols: ...

github-actions · 2026-03-04T23:57:00Z

backend/pyproject.toml

    "types-flask-cors>=6.0.0.20250809",
    "google-cloud-discoveryengine>=0.15.0",
    "langchain-google-vertexai>=3.2.2",
+    "pandas>=3.0.1",


pandas is a fairly heavyweight dependency just to compute column means and reformat an index. The results.to_pandas() API is provided by LangSmith, so it may already be a transitive dependency — worth checking with uv tree | grep pandas. If it's already pulled in, this is fine to declare explicitly. If it's genuinely new, consider whether numpy (likely already present) or a plain dict comprehension over results would suffice.

TruMichael-jpg

Agreed, this criteria better suits the citation checker. I will try to run some experiments with it this week.

yangm2 self-assigned this Mar 4, 2026

yangm2 added enhancement New feature or request backend Bot implementation and other backend concerns dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Mar 4, 2026

yangm2 added 3 commits March 4, 2026 15:54

suppress noisy warnings

34f3893

add aggregate summary per evaluator

uniquify scoring guidelines

2b62649

fix formatting

02f71fa

yangm2 force-pushed the more-run-evaluator-fixes branch from e821c05 to 02f71fa Compare March 4, 2026 23:55

github-actions bot reviewed Mar 4, 2026

View reviewed changes

accept scoring guidelines that Claude recommended

2e8f03f

yangm2 marked this pull request as ready for review March 5, 2026 00:08

yangm2 requested review from TruMichael-jpg and dan-moncada March 5, 2026 00:08

TruMichael-jpg approved these changes Mar 5, 2026

View reviewed changes

yangm2 merged commit 56cb81b into codeforpdx:main Mar 6, 2026
6 checks passed

yangm2 deleted the more-run-evaluator-fixes branch March 6, 2026 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More run evaluator fixes#297

More run evaluator fixes#297
yangm2 merged 4 commits intocodeforpdx:mainfrom
yangm2:more-run-evaluator-fixes

yangm2 commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

github-actions bot Mar 4, 2026

Uh oh!

github-actions bot Mar 4, 2026

Uh oh!

github-actions bot Mar 4, 2026

Uh oh!

github-actions bot Mar 4, 2026

Uh oh!

TruMichael-jpg left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    - Score 0.0 if response is legally incorrect or significantly incomplete
+    - Score 1.0 if all citations use proper HTML anchor tags, valid ORS numbers, and link to approved domains
+    - Score 0.5 if citations are present but have formatting or completeness issues
+    - Score 0.0 if citations are missing, malformed, or link to unapproved sources

	print("\n=== Aggregate Summary ===")
	# Print aggregate summary.
	print("\n=== Aggregate Summary ===")

Conversation

yangm2 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

Added/updated tests?

Documentation

[optional] Are there any post deployment tasks we need to perform?

Uh oh!

github-actions bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

github-actions bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

TruMichael-jpg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yangm2 commented Mar 4, 2026 •

edited

Loading

github-actions bot commented Mar 4, 2026 •

edited

Loading