From 5867b91ba39fc37f7f1cb8ca85f463422eb305df Mon Sep 17 00:00:00 2001
From: salma-elshafey <selshafey@microsoft.com>
Date: Sun, 8 Mar 2026 14:47:07 +0200
Subject: [PATCH] Update task completion prompty to handle subjective and
 factual questions

---
 .../_task_completion/task_completion.prompty  | 66 +++++++++++++++++--
 1 file changed, 61 insertions(+), 5 deletions(-)

diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty
index 8af03918e3c3..e53ff1a9c458 100644
--- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty
+++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty
@@ -69,6 +69,10 @@ D. Assign a Score:
    - **TRUE**: The agent delivered a complete and correct solution that accomplishes the user's entire goal. The user does not need to take further action or ask follow-up questions to get what they originally asked for.
    - **FALSE**: The agent failed to complete one or more parts of the task, provided an incorrect/incomplete result, or left the user's goal unresolved.
 
+   **Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation.
+
+   **Note on direct/factual queries:** When the user asks a straightforward factual, yes/no, or verification question (e.g., "What is the capital of France?"), a correct and direct answer fully completes the task. No additional elaboration, context, or "actionable information" beyond the accurate answer is required.
+
 OUTPUT FORMAT
 =============
 Output a JSON object with these keys:
@@ -88,7 +92,7 @@ SCORING EXAMPLES
 ### SUCCESS: TRUE - Example A
 
 CONVERSATION_HISTORY:
-User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.
+User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
 
 AGENT_RESPONSE:
 **Day 1:**
@@ -147,20 +151,70 @@ EXPECTED OUTPUT:
   "success": TRUE
 }
 
+### SUCCESS: TRUE - Example C
+
+CONVERSATION_HISTORY:
+User: Which is better for a beginner, Python or JavaScript?
+
+AGENT_RESPONSE:
+Both are excellent choices for beginners, and the best pick depends on your goals:
+
+**Python**
+- Simpler, more readable syntax — great for learning fundamentals
+- Strong in data science, AI/ML, and scripting
+- Widely used in academia and backend development
+
+**JavaScript**
+- Essential for web development (front-end and back-end with Node.js)
+- Immediate visual feedback when building websites
+- Huge ecosystem of frameworks (React, Vue, etc.)
+
+If you're interested in data or automation, start with Python. If you want to build websites or interactive apps, JavaScript is the way to go. Many developers eventually learn both.
+
+EXPECTED OUTPUT:
+{
+  "explanation": "The question is subjective with no single correct answer. The agent provided a balanced comparison covering syntax, use cases, and ecosystems for both languages with actionable guidance.",
+  "details": {
+    "task_requirements": "Subjective comparison of Python vs JavaScript for beginners",
+    "delivered_outcome": "Balanced pros/cons for each language with context-dependent recommendation",
+    "completion_gaps": "None"
+  },
+  "success": TRUE
+}
+
+### SUCCESS: TRUE - Example D
+
+CONVERSATION_HISTORY:
+User: Is the boiling point of water 50°C?
+
+AGENT_RESPONSE:
+No, the boiling point of water is 100°C at standard atmospheric pressure, not 50°C.
+
+EXPECTED OUTPUT:
+{
+  "explanation": "The user asked a factual verification question. The agent correctly identified the error and provided the accurate value. No further elaboration is required.",
+  "details": {
+    "task_requirements": "Verify whether the boiling point of water is 50°C",
+    "delivered_outcome": "Correct answer provided: boiling point is 100°C, not 50°C",
+    "completion_gaps": "None"
+  },
+  "success": TRUE
+}
+
 ### SUCCESS: FALSE - Example A
 
 CONVERSATION_HISTORY:
-User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.
+User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine.
 
 AGENT_RESPONSE:
 I'd recommend visiting the Louvre and Eiffel Tower. Paris has great food - try French pastries and wine. The Metro is convenient for getting around.
 
 EXPECTED OUTPUT:
 {
-  "explanation": "Vague suggestions provided but no actual day-by-day itinerary created. Missing specific landmarks, restaurants, and detailed schedule.",
+  "explanation": "Vague suggestions provided but no actual detailed day-by-day itinerary created as user requested. Missing specific landmarks, restaurants, and detailed schedule.",
   "details": {
-    "task_requirements": "3-day Paris itinerary with cultural landmarks and local cuisine",
-    "delivered_outcome": "General recommendations without structured itinerary",
+    "task_requirements": "A detailed 3-day Paris itinerary with cultural landmarks and local cuisine",
+    "delivered_outcome": "General recommendations without structured itinerary as asked",
     "completion_gaps": "No day-by-day schedule, no specific restaurant recommendations, no detailed cultural landmarks"
   },
   "success": FALSE
@@ -214,6 +268,8 @@ KEY PRINCIPLES
 3. **Requirement Matching**: Does the outcome match what was specifically requested?
 4. **Completeness**: Are all components of the task addressed?
 5. **Actionability**: Is the deliverable usable in its current form?
+6. **Subjective Tolerance**: For opinion-based, comparison, or open-ended questions, a well-reasoned response that addresses the query is sufficient — a single definitive answer is not required.
+7. **Conciseness Accepted**: A correct, direct answer to a factual or yes/no question is complete on its own — do not penalize brevity when the response fully addresses the query.
 
 Remember: A task can be understood correctly and approached properly but still fail if the final outcome doesn't meet requirements.