Update task completion prompty to handle subjective and factual questions#45575
Update task completion prompty to handle subjective and factual questions#45575salma-elshafey wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Task Completion evaluator prompt to better score completion for subjective/open-ended questions and for direct factual/verification questions, expanding the rubric and examples to reflect the intended scoring behavior.
Changes:
- Added scoring notes clarifying how to judge subjective/comparison queries vs direct factual/yes/no queries.
- Expanded scoring examples to include a subjective comparison and a factual verification case.
- Added two new “Key Principles” to avoid penalizing balanced subjective answers and concise factual answers.
You can also share your feedback on Copilot code review. Take the survey.
| - **TRUE**: The agent delivered a complete and correct solution that accomplishes the user's entire goal. The user does not need to take further action or ask follow-up questions to get what they originally asked for. | ||
| - **FALSE**: The agent failed to complete one or more parts of the task, provided an incorrect/incomplete result, or left the user's goal unresolved. | ||
|
|
||
| **Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation. |
There was a problem hiding this comment.
The PR description is still the default template text (asking to "Please add an informative description...") and doesn't describe the actual change being made to this evaluator prompt or link any issues. Please update the PR description so reviewers/releasers can understand intent and impact.
| **Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation. | ||
|
|
||
| **Note on direct/factual queries:** When the user asks a straightforward factual, yes/no, or verification question (e.g., "What is the capital of France?"), a correct and direct answer fully completes the task. No additional elaboration, context, or "actionable information" beyond the accurate answer is required. |
There was a problem hiding this comment.
The prompt instructs the model to output a JSON object (and this evaluator runs with response_format: {type: json_object}), but the guidance here reinforces using TRUE/FALSE tokens. If the model follows that literally (e.g., unquoted TRUE), it is invalid JSON and will fail parsing. Consider updating the prompt language/examples to use JSON booleans (true/false) or explicitly quote the value (e.g., "true"/"false") consistently throughout.
|
|
||
| CONVERSATION_HISTORY: | ||
| User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine. | ||
| User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine. |
There was a problem hiding this comment.
In this example, the user request was updated to include "detailed day-by-day", but the EXPECTED OUTPUT later in the same example still summarizes task_requirements as just a "3-day Paris itinerary" (missing the added constraint). Align the expected task_requirements/explanation wording with the updated user query to keep the example internally consistent.
| User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine. | |
| User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine. |
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines