Skip to content

More run evaluator fixes#297

Merged
yangm2 merged 4 commits intocodeforpdx:mainfrom
yangm2:more-run-evaluator-fixes
Mar 6, 2026
Merged

More run evaluator fixes#297
yangm2 merged 4 commits intocodeforpdx:mainfrom
yangm2:more-run-evaluator-fixes

Conversation

@yangm2
Copy link
Contributor

@yangm2 yangm2 commented Mar 4, 2026

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Infrastructure
  • Maintenance

Description

  • Suppress noisy warnings when running run_langsmith_evaluation.py
  • Add Aggregate Summary to run_langsmith_evaluation.py ...
    === Evaluation Results ===
    
    === Aggregate Summary ===
    legal correctness    62.5%
    appropriate tone    100.0%
    
  • uniquify and adjust scoring guidelines per evaluator
  • also capture VERTEX_AI_DATASTORE external dependency in experiment metadata

Related Tickets & Documents

  • Related Issue #
  • Closes #

QA Instructions, Screenshots, Recordings

Please replace this line with instructions on how to test your changes, a note on the devices and browsers this has been tested on, as well as any relevant images for UI changes.

Added/updated tests?

  • Yes
  • No, and this is why: development tools, not production
  • I need help with writing tests

Documentation

  • If this PR changes the system architecture, Architecture.md has been updated

[optional] Are there any post deployment tasks we need to perform?

@yangm2 yangm2 self-assigned this Mar 4, 2026
@yangm2 yangm2 added enhancement New feature or request backend Bot implementation and other backend concerns dependencies Pull requests that update a dependency file python:uv Pull requests that update python:uv code labels Mar 4, 2026
@yangm2 yangm2 force-pushed the more-run-evaluator-fixes branch from e821c05 to 02f71fa Compare March 4, 2026 23:55
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Claude finished @yangm2's task in 1m 42s —— View job


Code Review

  • Gather context and understand the PR
  • Read changed files
  • Review code quality, security, performance, testing, documentation
  • Post final review

Overall this is a clean, focused improvement to a dev-only script. The warning suppression is well-documented (upstream issue link), the continuous=True additions improve score fidelity, and the aggregate summary is a helpful usability win. A few things worth addressing:

Scoring guidelines still partially unspecified for citation evaluator — the PR's stated goal is to uniquify guidelines per evaluator, but CITATION_PROMPT's scoring guidelines still say "legally correct and complete" (same as the legal correctness evaluator). The <Reminders> block correctly says "evaluate citation accuracy" but the scoring criteria don't reflect that. See inline comment. Fix this →

pandas as a new dev dependencypandas>=3.0.1 is heavy for computing column averages on a results dataframe. Since results.to_pandas() is the LangSmith-provided API, pandas may already be a transitive dep. If so, declaring it explicitly is fine. If it's genuinely new, a plain loop over the results or numpy (if already present) would be lighter. See inline comment.

Empty results edge casedf[score_cols].mean() * 100 silently produces NaN if the dataset is empty. Worth an if df.empty guard. See inline comment.

Minor: "=== Evaluation Results ===" is now a redundant outer header — the aggregate section has its own header, and the only other output is the experiment name line. Consider dropping it or restructuring slightly. See inline comment.

No security issues. No test concerns (dev tooling, as noted in PR). The client=ls_client and RAG Data Store metadata additions are solid improvements.

{SCORING_GUIDELINES}
- Score 1.0 if response is legally correct and complete
- Score 0.5 if response has minor inaccuracies or omissions
- Score 0.0 if response is legally incorrect or significantly incomplete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoring guidelines in CITATION_PROMPT still say "legally correct and complete" — that's the same language used for LEGAL_CORRECTNESS_PROMPT_TEMPLATE. Since this PR's stated goal is to uniquify guidelines per evaluator, the citation evaluator should probably use citation-specific language, something like:

Suggested change
- Score 0.0 if response is legally incorrect or significantly incomplete
- Score 1.0 if all citations use proper HTML anchor tags, valid ORS numbers, and link to approved domains
- Score 0.5 if citations are present but have formatting or completeness issues
- Score 0.0 if citations are missing, malformed, or link to unapproved sources

print("\n=== Evaluation Results ===")

# Print aggregate summary.
print("\n=== Aggregate Summary ===")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor ordering nit: "Evaluation Results" is printed as a section header but it only contains the aggregate summary and the experiment name. Consider either moving this header below the aggregate block, or dropping the redundant outer header now that the aggregate has its own:

Suggested change
print("\n=== Aggregate Summary ===")
# Print aggregate summary.
print("\n=== Aggregate Summary ===")


# Print aggregate summary.
print("\n=== Aggregate Summary ===")
df = results.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If results contains zero examples (e.g. the dataset is empty or the run fails silently), df[score_cols].mean() will return NaN values, and * 100 still produces NaN. This won't crash but the output will say something like legal correctness NaN%. You could add a check:

if df.empty:
    print("No results to summarize.")
elif score_cols:
    ...

"types-flask-cors>=6.0.0.20250809",
"google-cloud-discoveryengine>=0.15.0",
"langchain-google-vertexai>=3.2.2",
"pandas>=3.0.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas is a fairly heavyweight dependency just to compute column means and reformat an index. The results.to_pandas() API is provided by LangSmith, so it may already be a transitive dependency — worth checking with uv tree | grep pandas. If it's already pulled in, this is fine to declare explicitly. If it's genuinely new, consider whether numpy (likely already present) or a plain dict comprehension over results would suffice.

@yangm2 yangm2 marked this pull request as ready for review March 5, 2026 00:08
Copy link

@TruMichael-jpg TruMichael-jpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this criteria better suits the citation checker. I will try to run some experiments with it this week.

@yangm2 yangm2 merged commit 56cb81b into codeforpdx:main Mar 6, 2026
6 checks passed
@yangm2 yangm2 deleted the more-run-evaluator-fixes branch March 6, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend Bot implementation and other backend concerns dependencies Pull requests that update a dependency file enhancement New feature or request python:uv Pull requests that update python:uv code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants