Skip to content

Update task completion prompty to handle subjective and factual questions#45575

Open
salma-elshafey wants to merge 1 commit intomainfrom
selshafey/fix_task_completion_eval
Open

Update task completion prompty to handle subjective and factual questions#45575
salma-elshafey wants to merge 1 commit intomainfrom
selshafey/fix_task_completion_eval

Conversation

@salma-elshafey
Copy link
Contributor

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings March 8, 2026 12:48
@salma-elshafey salma-elshafey requested a review from a team as a code owner March 8, 2026 12:48
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Task Completion evaluator prompt to better score completion for subjective/open-ended questions and for direct factual/verification questions, expanding the rubric and examples to reflect the intended scoring behavior.

Changes:

  • Added scoring notes clarifying how to judge subjective/comparison queries vs direct factual/yes/no queries.
  • Expanded scoring examples to include a subjective comparison and a factual verification case.
  • Added two new “Key Principles” to avoid penalizing balanced subjective answers and concise factual answers.

You can also share your feedback on Copilot code review. Take the survey.

- **TRUE**: The agent delivered a complete and correct solution that accomplishes the user's entire goal. The user does not need to take further action or ask follow-up questions to get what they originally asked for.
- **FALSE**: The agent failed to complete one or more parts of the task, provided an incorrect/incomplete result, or left the user's goal unresolved.

**Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is still the default template text (asking to "Please add an informative description...") and doesn't describe the actual change being made to this evaluator prompt or link any issues. Please update the PR description so reviewers/releasers can understand intent and impact.

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +74
**Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation.

**Note on direct/factual queries:** When the user asks a straightforward factual, yes/no, or verification question (e.g., "What is the capital of France?"), a correct and direct answer fully completes the task. No additional elaboration, context, or "actionable information" beyond the accurate answer is required.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt instructs the model to output a JSON object (and this evaluator runs with response_format: {type: json_object}), but the guidance here reinforces using TRUE/FALSE tokens. If the model follows that literally (e.g., unquoted TRUE), it is invalid JSON and will fail parsing. Consider updating the prompt language/examples to use JSON booleans (true/false) or explicitly quote the value (e.g., "true"/"false") consistently throughout.

Copilot uses AI. Check for mistakes.

CONVERSATION_HISTORY:
User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.
User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, the user request was updated to include "detailed day-by-day", but the EXPECTED OUTPUT later in the same example still summarizes task_requirements as just a "3-day Paris itinerary" (missing the added constraint). Align the expected task_requirements/explanation wording with the updated user query to keep the example internally consistent.

Suggested change
User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants