Validator: Answer relevance custom LLM judge#109
Conversation
📝 WalkthroughWalkthroughThis PR introduces ChangesAnswer Relevance Custom LLM Validator with Multi-Tenant Prompt Management
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
backend/app/alembic/versions/008_add_answer_relevance_prompt.py (1)
35-49: ⚡ Quick winAdd a composite index for the tenant-scoped list pattern.
Line 35-49 only adds single-column indexes. For queries filtered by
organization_id+project_idand ordered bycreated_at, id, a composite index will scale better.Suggested migration change
op.create_index( "idx_answer_relevance_prompt_is_active", "answer_relevance_prompt", ["is_active"], ) + op.create_index( + "idx_answer_relevance_prompt_tenant_created_id", + "answer_relevance_prompt", + ["organization_id", "project_id", "created_at", "id"], + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/alembic/versions/008_add_answer_relevance_prompt.py` around lines 35 - 49, The migration currently only creates single-column indexes via op.create_index for "idx_answer_relevance_prompt_org", "idx_answer_relevance_prompt_project", and "idx_answer_relevance_prompt_is_active" on the answer_relevance_prompt table; add a composite index for the tenant-scoped list pattern to support queries filtered by organization_id + project_id and ordered by created_at, id by creating a new composite index (e.g. name it "idx_answer_relevance_prompt_org_project_created_at_id") on columns ["organization_id","project_id","created_at","id"]; also ensure the corresponding downgrade drops that composite index (and keep or remove the single-column org/project indexes as desired) so the migration is reversible.backend/app/api/routes/guardrails.py (1)
133-142: ⚡ Quick winAvoid DB lookup when
prompt_templateis already provided inline.Currently,
custom_prompt_idtriggers a fetch unconditionally. Guarding on missingprompt_templatewould reduce unnecessary I/O and avoid overriding explicit runtime templates.Proposed patch
elif isinstance(validator, AnswerRelevanceCustomLLMSafetyValidatorConfig): - if validator.custom_prompt_id is not None: + if ( + validator.custom_prompt_id is not None + and not validator.prompt_template + ): prompt_config = answer_relevance_prompt_crud.get( session=session, id=validator.custom_prompt_id, organization_id=payload.organization_id, project_id=payload.project_id, ) validator.prompt_template = prompt_config.prompt_template🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/api/routes/guardrails.py` around lines 133 - 142, The code unconditionally looks up a DB prompt when validator.custom_prompt_id is set and then overwrites validator.prompt_template; change the logic in the AnswerRelevanceCustomLLMSafetyValidatorConfig branch to only call answer_relevance_prompt_crud.get (using session, payload.organization_id, payload.project_id) when validator.custom_prompt_id is present AND validator.prompt_template is missing/empty, so any inline-provided validator.prompt_template is preserved and unnecessary I/O is avoided; after the conditional fetch assign validator.prompt_template only from the retrieved prompt_config.backend/app/tests/validators/test_answer_relevance_custom_llm.py (1)
113-155: 💤 Low valueOptional: add non-dict JSON / non-string-field cases.
Consider adding tests for inputs like
validator._validate("123"),validator._validate("null"), and{"query": 1, "answer": "x"}so the parsing edge cases (raised on the validator file) stay covered going forward.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/tests/validators/test_answer_relevance_custom_llm.py` around lines 113 - 155, Add tests to cover non-dict JSON and non-string field cases so parsing edge cases remain covered: extend backend/app/tests/validators/test_answer_relevance_custom_llm.py with new test functions that call validator._validate on JSON primitives (e.g., "123", "null") and on a JSON object with non-string field types (e.g., {"query": 1, "answer": "x"}), and assert they return FailResult (using isinstance(result, FailResult)) and include appropriate error messages where relevant; reference validator._validate and existing test patterns (e.g., test_fails_with_non_json_input, test_fails_with_missing_query_key) to mirror structure and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app/api/docs/answer_relevance_prompts/create_prompt.md`:
- Around line 19-25: The two fenced code blocks in create_prompt.md (the blocks
beginning with the lines "Query: {query} ... Answer only YES or NO." and the one
starting "You are evaluating a maternal health assistant.") need explicit
language identifiers to satisfy markdownlint MD040; update both opening fences
from ``` to ```text so each block reads ```text and leave the block contents
unchanged.
In `@backend/app/api/docs/guardrails/run_guardrails.md`:
- Line 11: Update the docs for the answer_relevance_custom_llm operation to
explicitly state the precedence and mutual-exclusivity behavior when both
custom_prompt_id and prompt_template are provided: specify whether they are
mutually exclusive (reject requests containing both) or define a deterministic
precedence rule (e.g., "custom_prompt_id takes precedence over prompt_template
if both are set"), and show a short example of the accepted input JSON
{"query":"...", "answer":"..."} with the chosen behavior. Ensure the text
mentions the parameter names custom_prompt_id and prompt_template and that
OPENAI_API_KEY is required.
In `@backend/app/core/validators/answer_relevance_custom_llm.py`:
- Around line 44-57: In _validate (in answer_relevance_custom_llm.py) guard
against non-dict JSON and non-string fields by first verifying the result of
json.loads(value) is a dict and returning FailResult if not, then extract query
and answer and ensure both are instances of str before calling .strip(); if
either is missing or not a string (or empty after strip) return FailResult with
the existing error messages. This prevents AttributeError from .get/.strip on
non-dict or non-str values while preserving the current
ValidationResult/FailResult flow.
In `@backend/app/core/validators/README.md`:
- Around line 519-525: The fenced code block containing the prompt that starts
with "Query: {query}" and ends with "Answer only YES or NO." should be annotated
with a language to satisfy markdownlint MD040; update the opening fence from ```
to ```text for that block (the block that contains the lines "Query: {query}"
and "Answer: {answer}") so the README.md stays lint-clean and consistent with
other fenced blocks.
---
Nitpick comments:
In `@backend/app/alembic/versions/008_add_answer_relevance_prompt.py`:
- Around line 35-49: The migration currently only creates single-column indexes
via op.create_index for "idx_answer_relevance_prompt_org",
"idx_answer_relevance_prompt_project", and
"idx_answer_relevance_prompt_is_active" on the answer_relevance_prompt table;
add a composite index for the tenant-scoped list pattern to support queries
filtered by organization_id + project_id and ordered by created_at, id by
creating a new composite index (e.g. name it
"idx_answer_relevance_prompt_org_project_created_at_id") on columns
["organization_id","project_id","created_at","id"]; also ensure the
corresponding downgrade drops that composite index (and keep or remove the
single-column org/project indexes as desired) so the migration is reversible.
In `@backend/app/api/routes/guardrails.py`:
- Around line 133-142: The code unconditionally looks up a DB prompt when
validator.custom_prompt_id is set and then overwrites validator.prompt_template;
change the logic in the AnswerRelevanceCustomLLMSafetyValidatorConfig branch to
only call answer_relevance_prompt_crud.get (using session,
payload.organization_id, payload.project_id) when validator.custom_prompt_id is
present AND validator.prompt_template is missing/empty, so any inline-provided
validator.prompt_template is preserved and unnecessary I/O is avoided; after the
conditional fetch assign validator.prompt_template only from the retrieved
prompt_config.
In `@backend/app/tests/validators/test_answer_relevance_custom_llm.py`:
- Around line 113-155: Add tests to cover non-dict JSON and non-string field
cases so parsing edge cases remain covered: extend
backend/app/tests/validators/test_answer_relevance_custom_llm.py with new test
functions that call validator._validate on JSON primitives (e.g., "123", "null")
and on a JSON object with non-string field types (e.g., {"query": 1, "answer":
"x"}), and assert they return FailResult (using isinstance(result, FailResult))
and include appropriate error messages where relevant; reference
validator._validate and existing test patterns (e.g.,
test_fails_with_non_json_input, test_fails_with_missing_query_key) to mirror
structure and assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 04d6d867-db71-48df-ab0c-a3c4b234f1bc
📒 Files selected for processing (24)
backend/app/alembic/versions/008_add_answer_relevance_prompt.pybackend/app/api/API_USAGE.mdbackend/app/api/docs/answer_relevance_prompts/create_prompt.mdbackend/app/api/docs/answer_relevance_prompts/delete_prompt.mdbackend/app/api/docs/answer_relevance_prompts/get_prompt.mdbackend/app/api/docs/answer_relevance_prompts/list_prompts.mdbackend/app/api/docs/answer_relevance_prompts/update_prompt.mdbackend/app/api/docs/guardrails/run_guardrails.mdbackend/app/api/main.pybackend/app/api/routes/answer_relevance_prompts.pybackend/app/api/routes/guardrails.pybackend/app/core/enum.pybackend/app/core/validators/README.mdbackend/app/core/validators/answer_relevance_custom_llm.pybackend/app/core/validators/config/answer_relevance_custom_llm_safety_validator_config.pybackend/app/crud/answer_relevance_prompt.pybackend/app/models/config/answer_relevance_prompt.pybackend/app/schemas/answer_relevance_prompt.pybackend/app/schemas/guardrail_config.pybackend/app/tests/test_answer_relevance_prompts_api.pybackend/app/tests/test_answer_relevance_prompts_api_integration.pybackend/app/tests/test_llm_validators.pybackend/app/tests/test_validate_with_guard.pybackend/app/tests/validators/test_answer_relevance_custom_llm.py
| ``` | ||
| Query: {query} | ||
| Answer: {answer} | ||
|
|
||
| Does the answer fully satisfy the query and constraints? | ||
| Answer only YES or NO. | ||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks to satisfy markdownlint.
Both fenced blocks should declare a language (e.g., text) to clear MD040 warnings.
Proposed patch
-```
+```text
Query: {query}
Answer: {answer}
Does the answer fully satisfy the query and constraints?
Answer only YES or NO.@@
- +text
You are evaluating a maternal health assistant.
Query: {query}
Answer: {answer}
Does the answer directly address the maternal health query with accurate information?
Answer only YES or NO.
Also applies to: 30-37
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 19-19: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app/api/docs/answer_relevance_prompts/create_prompt.md` around lines
19 - 25, The two fenced code blocks in create_prompt.md (the blocks beginning
with the lines "Query: {query} ... Answer only YES or NO." and the one starting
"You are evaluating a maternal health assistant.") need explicit language
identifiers to satisfy markdownlint MD040; update both opening fences from ```
to ```text so each block reads ```text and leave the block contents unchanged.
| - For `ban_list`, `ban_list_id` can be resolved to `banned_words` from tenant ban list configs. | ||
| - For `topic_relevance`, `topic_relevance_config_id` is required and is resolved to `configuration` + `prompt_schema_version` from tenant topic relevance configs. Requires `OPENAI_API_KEY` to be configured; returns a validation failure with an explicit error if missing. | ||
| - For `llm_critic`, `OPENAI_API_KEY` must be configured; returns `success=false` with an explicit error if missing. | ||
| - For `answer_relevance_custom_llm`, `input` must be a JSON string `{"query": "...", "answer": "..."}`. Pass `custom_prompt_id` to use a tenant-stored prompt template, or `prompt_template` inline. Requires `OPENAI_API_KEY`. |
There was a problem hiding this comment.
Clarify precedence/mutual exclusivity for custom_prompt_id vs prompt_template.
Line 11 says “or”, but doesn’t define behavior if clients send both. Please document whether they are mutually exclusive or which one wins.
Suggested doc tweak
-- For `answer_relevance_custom_llm`, `input` must be a JSON string `{"query": "...", "answer": "..."}`. Pass `custom_prompt_id` to use a tenant-stored prompt template, or `prompt_template` inline. Requires `OPENAI_API_KEY`.
+- For `answer_relevance_custom_llm`, `input` must be a JSON string `{"query": "...", "answer": "..."}`. Use `custom_prompt_id` for a tenant-stored prompt template or `prompt_template` inline, and document the behavior when both are provided (mutually exclusive vs precedence). Requires `OPENAI_API_KEY`.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app/api/docs/guardrails/run_guardrails.md` at line 11, Update the
docs for the answer_relevance_custom_llm operation to explicitly state the
precedence and mutual-exclusivity behavior when both custom_prompt_id and
prompt_template are provided: specify whether they are mutually exclusive
(reject requests containing both) or define a deterministic precedence rule
(e.g., "custom_prompt_id takes precedence over prompt_template if both are
set"), and show a short example of the accepted input JSON {"query":"...",
"answer":"..."} with the chosen behavior. Ensure the text mentions the parameter
names custom_prompt_id and prompt_template and that OPENAI_API_KEY is required.
| def _validate(self, value: str, metadata: dict = None) -> ValidationResult: | ||
| try: | ||
| data = json.loads(value) | ||
| query = data.get("query", "") | ||
| answer = data.get("answer", "") | ||
| except (json.JSONDecodeError, TypeError): | ||
| return FailResult( | ||
| error_message="Input must be a JSON string with 'query' and 'answer' fields." | ||
| ) | ||
|
|
||
| if not query.strip() or not answer.strip(): | ||
| return FailResult( | ||
| error_message="Both 'query' and 'answer' fields must be non-empty." | ||
| ) |
There was a problem hiding this comment.
Guard against non-dict JSON and non-string field values.
json.loads(value) may return a non-dict (e.g., 123, null, [...], "str"), and even on a dict, query/answer may be non-string (null, numbers). In both cases the subsequent .get(...) / .strip() calls raise an unhandled AttributeError, bypassing the intended FailResult and propagating as an exception.
🛡️ Suggested defensive parsing
- try:
- data = json.loads(value)
- query = data.get("query", "")
- answer = data.get("answer", "")
- except (json.JSONDecodeError, TypeError):
- return FailResult(
- error_message="Input must be a JSON string with 'query' and 'answer' fields."
- )
-
- if not query.strip() or not answer.strip():
+ try:
+ data = json.loads(value)
+ except (json.JSONDecodeError, TypeError):
+ return FailResult(
+ error_message="Input must be a JSON string with 'query' and 'answer' fields."
+ )
+
+ if not isinstance(data, dict):
+ return FailResult(
+ error_message="Input must be a JSON object with 'query' and 'answer' fields."
+ )
+
+ query = data.get("query", "")
+ answer = data.get("answer", "")
+ if not isinstance(query, str) or not isinstance(answer, str):
+ return FailResult(
+ error_message="'query' and 'answer' must be strings."
+ )
+
+ if not query.strip() or not answer.strip():
return FailResult(
error_message="Both 'query' and 'answer' fields must be non-empty."
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _validate(self, value: str, metadata: dict = None) -> ValidationResult: | |
| try: | |
| data = json.loads(value) | |
| query = data.get("query", "") | |
| answer = data.get("answer", "") | |
| except (json.JSONDecodeError, TypeError): | |
| return FailResult( | |
| error_message="Input must be a JSON string with 'query' and 'answer' fields." | |
| ) | |
| if not query.strip() or not answer.strip(): | |
| return FailResult( | |
| error_message="Both 'query' and 'answer' fields must be non-empty." | |
| ) | |
| def _validate(self, value: str, metadata: dict = None) -> ValidationResult: | |
| try: | |
| data = json.loads(value) | |
| except (json.JSONDecodeError, TypeError): | |
| return FailResult( | |
| error_message="Input must be a JSON string with 'query' and 'answer' fields." | |
| ) | |
| if not isinstance(data, dict): | |
| return FailResult( | |
| error_message="Input must be a JSON object with 'query' and 'answer' fields." | |
| ) | |
| query = data.get("query", "") | |
| answer = data.get("answer", "") | |
| if not isinstance(query, str) or not isinstance(answer, str): | |
| return FailResult( | |
| error_message="'query' and 'answer' must be strings." | |
| ) | |
| if not query.strip() or not answer.strip(): | |
| return FailResult( | |
| error_message="Both 'query' and 'answer' fields must be non-empty." | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app/core/validators/answer_relevance_custom_llm.py` around lines 44 -
57, In _validate (in answer_relevance_custom_llm.py) guard against non-dict JSON
and non-string fields by first verifying the result of json.loads(value) is a
dict and returning FailResult if not, then extract query and answer and ensure
both are instances of str before calling .strip(); if either is missing or not a
string (or empty after strip) return FailResult with the existing error
messages. This prevents AttributeError from .get/.strip on non-dict or non-str
values while preserving the current ValidationResult/FailResult flow.
| ``` | ||
| Query: {query} | ||
| Answer: {answer} | ||
|
|
||
| Does the answer fully satisfy the query and constraints? | ||
| Answer only YES or NO. | ||
| ``` |
There was a problem hiding this comment.
Add a language to the fenced default prompt block (markdownlint MD040).
This keeps docs lint-clean and consistent with the rest of the file.
Proposed patch
-```
+```text
Query: {query}
Answer: {answer}
Does the answer fully satisfy the query and constraints?
Answer only YES or NO.</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 519-519: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app/core/validators/README.md` around lines 519 - 525, The fenced
code block containing the prompt that starts with "Query: {query}" and ends with
"Answer only YES or NO." should be annotated with a language to satisfy
markdownlint MD040; update the opening fence from ``` to ```text for that block
(the block that contains the lines "Query: {query}" and "Answer: {answer}") so
the README.md stays lint-clean and consistent with other fenced blocks.
Summary
Target issue is #PLEASE_TYPE_ISSUE_NUMBER
Explain the motivation for making this change. What existing problem does the pull request solve?
answer_relevance_custom_llmevaluates whether an LLM's answer is relevant to a user query using an LLM as judge (YES/NO).(/guardrails/answer_relevance_prompts): full CRUD endpoints (multi-tenant, X-API-KEY auth) for NGOs to store, version, and manage domain-specific evaluation prompts. Prompts are validated at write time to enforce both {query} and {answer} placeholders. Reference a stored prompt at runtime via custom_prompt_id in the validator config.Added files:
app/models/config/answer_relevance_prompt.py — SQLModel table answer_relevance_prompt scoped to organization_id + project_id
app/crud/answer_relevance_prompt.py — standard CRUD
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.