LENS is a role-aware multi-agent grading pipeline for clinical summaries. The same summary is scored in parallel by three role-specific agents:
PhysicianTriage NurseBedside Nurse
Each role scores the summary across 8 rubric dimensions on a 1-5 scale. The system then computes a role-level weighted overall score, a cross-role overall score, and an Orchestrator Disagreement view that shows how far the three role scores differ on each dimension.
- Parallel scoring by three role-specific agents
- Shared 8-dimension LENS rubric
- Two scoring modes:
llm: OpenAI model-based scoringheuristic: local baseline scoring without API calls
- Per-role weighted overall scoring based on questionnaire-derived role priors
- Orchestrator validation, disagreement mapping, and score aggregation
- Human-readable and JSON outputs
- Input a clinical summary.
- Load rubric definitions and role configurations.
- Run the three role agents in parallel.
- Validate each role scorecard.
- Build an
Orchestrator Disagreementmap for all 8 dimensions. - Aggregate the role outputs into:
- per-role scores
- per-role overall scores
- final overall score across roles
config/lens_rubric.jsonDefines the 8 rubric dimensions and evaluation focus.config/roles.jsonDefines the three role agents, persona metadata, andw_priorweights.grading_pipeline/cli.pyCommand-line entrypoint and human-readable output formatting.grading_pipeline/orchestrator.pyRuns the multi-agent pipeline, validation, disagreement mapping, and aggregation.grading_pipeline/llm_scoring.pyLLM-based scoring logic.grading_pipeline/scoring.pyHeuristic baseline scoring and score utilities.grading_pipeline/openai_client.pyMinimal OpenAI Responses API client.tests/Input-validation and orchestrator tests.
- Python 3.10+
- OpenAI API key for
llmmode
This project uses the Python standard library only. There is no dependency installation step at the moment.
If you want to run the LLM pipeline, you must use your own OpenAI API key.
Create a file named .env in the project root.
Project root:
- same folder as
README.md - same folder as
config/ - same folder as
grading_pipeline/
Expected file location:
LENS Project/.envAdd the following line to .env:
OPENAI_API_KEY=your_openai_api_key_hereOptional override:
OPENAI_BASE_URL=https://api.openai.com/v1/responsesYou can use .env.example as the template:
cp .env.example .envImportant notes:
.envis already ignored by git and should not be committed.- If you run with
--engine heuristic, no API key is required. - The code reads
OPENAI_API_KEYfrom.envfirst, then falls back to your shell environment.
Run with the default LLM mode:
python -m grading_pipeline --summary "Your summary here"Run with the heuristic baseline:
python -m grading_pipeline --engine heuristic --summary "Your summary here"Use a summary file:
python -m grading_pipeline --summary-file path/to/summary.txtOutput JSON instead of the human-readable report:
python -m grading_pipeline --summary "Your summary here" --format json --prettySelect a specific model:
python -m grading_pipeline --model gpt-4o-mini --summary "Your summary here"Adjust the disagreement threshold:
python -m grading_pipeline --gap-threshold 0.5 --summary "Your summary here"The CLI validates summary input before the scoring pipeline runs.
The summary must:
- be provided through
--summaryor--summary-file - not be empty
- not be whitespace only
- be at least
30characters after trimming whitespace
If the summary is invalid, the CLI exits with a non-zero code and no scoring call is made.
The human-readable output includes:
- role-by-role scores for all 8 dimensions
- a weighted
Overallscore for each role Orchestrator Disagreementshowing score gaps per dimension- final
Overall Scoreacross all three roles
Example output shape:
----------------------------------------
Role-Aware Multi-Agent Grading Pipeline:
----------------------------------------
Physician:
Factual Accuracy: 5.0
Relevant Chronic Problem Coverage: 4.0
...
Overall: 4.12
----------------------------------------
Triage Nurse:
...
----------------------------------------
Bedside Nurse:
...
----------------------------------------
----------------------------------------
Orchestrator Disagreement:
----------------------------------------
Factual Accuracy: 1.0
Relevant Chronic Problem Coverage: 0.0
...
----------------------------------------
Overall Score: 4.0
Each role has its own prior weights in config/roles.json.
Role-level overall score:
Role Overall = weighted average of the 8 dimension scores
Cross-role overall score:
Overall Score = average of the 3 role overall scores
Disagreement per dimension:
Gap = highest agent score - lowest agent score
Run the test suite:
pytest -qCurrent tests cover:
- CLI summary input validation
- disagreement-map correctness
- validation and repair behavior
- conditional adjudication behavior
- weighted aggregation behavior
The current implementation includes:
- parallel three-role scoring
- role-aware weighting
- strict input validation
- orchestrator disagreement reporting
- weighted final score aggregation
- human-readable report formatting for demo and presentation use
