Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ D. Assign a Score:
- **TRUE**: The agent delivered a complete and correct solution that accomplishes the user's entire goal. The user does not need to take further action or ask follow-up questions to get what they originally asked for.
- **FALSE**: The agent failed to complete one or more parts of the task, provided an incorrect/incomplete result, or left the user's goal unresolved.

**Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is still the default template text (asking to "Please add an informative description...") and doesn't describe the actual change being made to this evaluator prompt or link any issues. Please update the PR description so reviewers/releasers can understand intent and impact.

Copilot uses AI. Check for mistakes.

**Note on direct/factual queries:** When the user asks a straightforward factual, yes/no, or verification question (e.g., "What is the capital of France?"), a correct and direct answer fully completes the task. No additional elaboration, context, or "actionable information" beyond the accurate answer is required.
Comment on lines +72 to +74
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt instructs the model to output a JSON object (and this evaluator runs with response_format: {type: json_object}), but the guidance here reinforces using TRUE/FALSE tokens. If the model follows that literally (e.g., unquoted TRUE), it is invalid JSON and will fail parsing. Consider updating the prompt language/examples to use JSON booleans (true/false) or explicitly quote the value (e.g., "true"/"false") consistently throughout.

Copilot uses AI. Check for mistakes.

OUTPUT FORMAT
=============
Output a JSON object with these keys:
Expand All @@ -88,7 +92,7 @@ SCORING EXAMPLES
### SUCCESS: TRUE - Example A

CONVERSATION_HISTORY:
User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.
User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, the user request was updated to include "detailed day-by-day", but the EXPECTED OUTPUT later in the same example still summarizes task_requirements as just a "3-day Paris itinerary" (missing the added constraint). Align the expected task_requirements/explanation wording with the updated user query to keep the example internally consistent.

Suggested change
User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.

Copilot uses AI. Check for mistakes.

AGENT_RESPONSE:
**Day 1:**
Expand Down Expand Up @@ -147,20 +151,70 @@ EXPECTED OUTPUT:
"success": TRUE
}

### SUCCESS: TRUE - Example C

CONVERSATION_HISTORY:
User: Which is better for a beginner, Python or JavaScript?

AGENT_RESPONSE:
Both are excellent choices for beginners, and the best pick depends on your goals:

**Python**
- Simpler, more readable syntax — great for learning fundamentals
- Strong in data science, AI/ML, and scripting
- Widely used in academia and backend development

**JavaScript**
- Essential for web development (front-end and back-end with Node.js)
- Immediate visual feedback when building websites
- Huge ecosystem of frameworks (React, Vue, etc.)

If you're interested in data or automation, start with Python. If you want to build websites or interactive apps, JavaScript is the way to go. Many developers eventually learn both.

EXPECTED OUTPUT:
{
"explanation": "The question is subjective with no single correct answer. The agent provided a balanced comparison covering syntax, use cases, and ecosystems for both languages with actionable guidance.",
"details": {
"task_requirements": "Subjective comparison of Python vs JavaScript for beginners",
"delivered_outcome": "Balanced pros/cons for each language with context-dependent recommendation",
"completion_gaps": "None"
},
"success": TRUE
}

### SUCCESS: TRUE - Example D

CONVERSATION_HISTORY:
User: Is the boiling point of water 50°C?

AGENT_RESPONSE:
No, the boiling point of water is 100°C at standard atmospheric pressure, not 50°C.

EXPECTED OUTPUT:
{
"explanation": "The user asked a factual verification question. The agent correctly identified the error and provided the accurate value. No further elaboration is required.",
"details": {
"task_requirements": "Verify whether the boiling point of water is 50°C",
"delivered_outcome": "Correct answer provided: boiling point is 100°C, not 50°C",
"completion_gaps": "None"
},
"success": TRUE
}

### SUCCESS: FALSE - Example A

CONVERSATION_HISTORY:
User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.
User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.

AGENT_RESPONSE:
I'd recommend visiting the Louvre and Eiffel Tower. Paris has great food - try French pastries and wine. The Metro is convenient for getting around.

EXPECTED OUTPUT:
{
"explanation": "Vague suggestions provided but no actual day-by-day itinerary created. Missing specific landmarks, restaurants, and detailed schedule.",
"explanation": "Vague suggestions provided but no actual detailed day-by-day itinerary created as user requested. Missing specific landmarks, restaurants, and detailed schedule.",
"details": {
"task_requirements": "3-day Paris itinerary with cultural landmarks and local cuisine",
"delivered_outcome": "General recommendations without structured itinerary",
"task_requirements": "A detailed 3-day Paris itinerary with cultural landmarks and local cuisine",
"delivered_outcome": "General recommendations without structured itinerary as asked",
"completion_gaps": "No day-by-day schedule, no specific restaurant recommendations, no detailed cultural landmarks"
},
"success": FALSE
Expand Down Expand Up @@ -214,6 +268,8 @@ KEY PRINCIPLES
3. **Requirement Matching**: Does the outcome match what was specifically requested?
4. **Completeness**: Are all components of the task addressed?
5. **Actionability**: Is the deliverable usable in its current form?
6. **Subjective Tolerance**: For opinion-based, comparison, or open-ended questions, a well-reasoned response that addresses the query is sufficient — a single definitive answer is not required.
7. **Conciseness Accepted**: A correct, direct answer to a factual or yes/no question is complete on its own — do not penalize brevity when the response fully addresses the query.

Remember: A task can be understood correctly and approached properly but still fail if the final outcome doesn't meet requirements.

Expand Down