From 5867b91ba39fc37f7f1cb8ca85f463422eb305df Mon Sep 17 00:00:00 2001 From: salma-elshafey Date: Sun, 8 Mar 2026 14:47:07 +0200 Subject: [PATCH] Update task completion prompty to handle subjective and factual questions --- .../_task_completion/task_completion.prompty | 66 +++++++++++++++++-- 1 file changed, 61 insertions(+), 5 deletions(-) diff --git a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty index 8af03918e3c3..e53ff1a9c458 100644 --- a/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty +++ b/sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_task_completion/task_completion.prompty @@ -69,6 +69,10 @@ D. Assign a Score: - **TRUE**: The agent delivered a complete and correct solution that accomplishes the user's entire goal. The user does not need to take further action or ask follow-up questions to get what they originally asked for. - **FALSE**: The agent failed to complete one or more parts of the task, provided an incorrect/incomplete result, or left the user's goal unresolved. + **Note on subjective/open-ended queries:** When the user asks a subjective, opinion-based, or comparison question (e.g., "Which is better, X or Y?", "What do you think about…?"), there is no single correct answer. The task is considered **complete** (TRUE) if the agent provides a thoughtful, relevant response that addresses the question with reasonable perspectives or trade-offs — even if it does not give a single definitive recommendation. + + **Note on direct/factual queries:** When the user asks a straightforward factual, yes/no, or verification question (e.g., "What is the capital of France?"), a correct and direct answer fully completes the task. No additional elaboration, context, or "actionable information" beyond the accurate answer is required. + OUTPUT FORMAT ============= Output a JSON object with these keys: @@ -88,7 +92,7 @@ SCORING EXAMPLES ### SUCCESS: TRUE - Example A CONVERSATION_HISTORY: -User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine. +User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine. AGENT_RESPONSE: **Day 1:** @@ -147,20 +151,70 @@ EXPECTED OUTPUT: "success": TRUE } +### SUCCESS: TRUE - Example C + +CONVERSATION_HISTORY: +User: Which is better for a beginner, Python or JavaScript? + +AGENT_RESPONSE: +Both are excellent choices for beginners, and the best pick depends on your goals: + +**Python** +- Simpler, more readable syntax — great for learning fundamentals +- Strong in data science, AI/ML, and scripting +- Widely used in academia and backend development + +**JavaScript** +- Essential for web development (front-end and back-end with Node.js) +- Immediate visual feedback when building websites +- Huge ecosystem of frameworks (React, Vue, etc.) + +If you're interested in data or automation, start with Python. If you want to build websites or interactive apps, JavaScript is the way to go. Many developers eventually learn both. + +EXPECTED OUTPUT: +{ + "explanation": "The question is subjective with no single correct answer. The agent provided a balanced comparison covering syntax, use cases, and ecosystems for both languages with actionable guidance.", + "details": { + "task_requirements": "Subjective comparison of Python vs JavaScript for beginners", + "delivered_outcome": "Balanced pros/cons for each language with context-dependent recommendation", + "completion_gaps": "None" + }, + "success": TRUE +} + +### SUCCESS: TRUE - Example D + +CONVERSATION_HISTORY: +User: Is the boiling point of water 50°C? + +AGENT_RESPONSE: +No, the boiling point of water is 100°C at standard atmospheric pressure, not 50°C. + +EXPECTED OUTPUT: +{ + "explanation": "The user asked a factual verification question. The agent correctly identified the error and provided the accurate value. No further elaboration is required.", + "details": { + "task_requirements": "Verify whether the boiling point of water is 50°C", + "delivered_outcome": "Correct answer provided: boiling point is 100°C, not 50°C", + "completion_gaps": "None" + }, + "success": TRUE +} + ### SUCCESS: FALSE - Example A CONVERSATION_HISTORY: -User: Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine. +User: Plan a detailed day-by-day 3-day itinerary for Paris with cultural landmarks and local cuisine. AGENT_RESPONSE: I'd recommend visiting the Louvre and Eiffel Tower. Paris has great food - try French pastries and wine. The Metro is convenient for getting around. EXPECTED OUTPUT: { - "explanation": "Vague suggestions provided but no actual day-by-day itinerary created. Missing specific landmarks, restaurants, and detailed schedule.", + "explanation": "Vague suggestions provided but no actual detailed day-by-day itinerary created as user requested. Missing specific landmarks, restaurants, and detailed schedule.", "details": { - "task_requirements": "3-day Paris itinerary with cultural landmarks and local cuisine", - "delivered_outcome": "General recommendations without structured itinerary", + "task_requirements": "A detailed 3-day Paris itinerary with cultural landmarks and local cuisine", + "delivered_outcome": "General recommendations without structured itinerary as asked", "completion_gaps": "No day-by-day schedule, no specific restaurant recommendations, no detailed cultural landmarks" }, "success": FALSE @@ -214,6 +268,8 @@ KEY PRINCIPLES 3. **Requirement Matching**: Does the outcome match what was specifically requested? 4. **Completeness**: Are all components of the task addressed? 5. **Actionability**: Is the deliverable usable in its current form? +6. **Subjective Tolerance**: For opinion-based, comparison, or open-ended questions, a well-reasoned response that addresses the query is sufficient — a single definitive answer is not required. +7. **Conciseness Accepted**: A correct, direct answer to a factual or yes/no question is complete on its own — do not penalize brevity when the response fully addresses the query. Remember: A task can be understood correctly and approached properly but still fail if the final outcome doesn't meet requirements.