Python: Foundry Evals integration for Python#4750
Python: Foundry Evals integration for Python#4750alliscode wants to merge 35 commits intomicrosoft:mainfrom
Conversation
a0edd5f to
fe9e621
Compare
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
15d8640 to
aad92ac
Compare
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||||||||||||
901ea59 to
d52c85e
Compare
Merged and refactored eval module per Eduard's PR review: - Merge _eval.py + _local_eval.py into single _evaluation.py - Convert EvalItem from dataclass to regular class - Rename to_dict() to to_eval_data() - Convert _AgentEvalData to TypedDict - Simplify check system: unified async pattern with isawaitable - Parallelize checks and evaluators with asyncio.gather - Add all/any mode to tool_called_check - Fix bool(passed) truthy bug in _coerce_result - Remove deprecated function_evaluator/async_function_evaluator aliases - Remove _MinimalAgent, tighten evaluate_agent signature - Set self.name in __init__ (LocalEvaluator, FoundryEvals) - Limit FoundryEvals to AsyncOpenAI only - Type project_client as AIProjectClient - Remove NotImplementedError continuous eval code - Add evaluation samples in 02-agents/ and 03-workflows/ - Update all imports and tests (167 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use cast(list[Any], x) with type: ignore[redundant-cast] comments to satisfy both mypy (which considers casting Any redundant) and pyright strict mode (which needs explicit casts to narrow Unknown types). Also fix evaluator decorator check_name type annotation to be explicitly str, resolving mypy str|Any|None mismatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…attr - Apply pyupgrade: Sequence from collections.abc, remove forward-ref quotes - Add @overload signatures to evaluator() for proper @evaluator usage - Fix evaluate_workflow sample to use WorkflowBuilder(start_executor=) API - Fix _workflow.py executor.reset() to use getattr pattern for pyright - Remove unused EvalResults forward-ref string in default_factory lambda Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The test_configure_otel_providers_with_env_file_and_vs_code_port test triggers gRPC OTLP exporter creation, but the grpc dependency is optional and not installed by default. Add skipif decorator matching the pattern used by all other gRPC exporter tests in the same file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move module docstrings before imports (after copyright header) - Add -> None return type to all main() and helper functions - Fix line-too-long in multiturn sample conversation data - Add Workflow import for typed return in all_patterns_sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nings - Simplify _ensure_async_result to direct await (async-only clients) - Replace get_event_loop() with get_running_loop() - Narrow _fetch_output_items exception handling to specific types - Add warning log when _filter_tool_evaluators falls back to defaults - Add DeprecationWarning to options alias in Agent.__init__ - Add DeprecationWarning to evaluate_response() - Rename raw key to _raw_arguments in convert_message fallback - Fix evaluate_agent_sample.py: replace evals.select() with FoundryEvals() - Fix evaluate_multiturn_sample.py: use Message/Content/FunctionTool types - Fix evaluate_workflow_sample.py: replace evals.select() with FoundryEvals() - Update test mocks to use AsyncMock for awaited API calls Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add num_repetitions=2 positive test verifying 2×items and 4 agent calls - Add _poll_eval_run tests: timeout, failed, and canceled paths - Add evaluate_traces tests: validation error, response_ids path, trace_ids path - Add evaluate_foundry_target happy-path test with target/query verification Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap implicit string concatenation in parens in evaluate_multiturn_sample.py - Apply ruff formatter to 6 other files with minor formatting drift Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nch) Reverts changes to _agents.py, _agent_executor.py, and _workflow.py back to upstream/main. These fixes are now in a separate PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code fixes: - Fix _normalize_queries inverted condition (single query now replicates to match expected_count) - Fix substring match bug: 'end' in 'backend' matched; use exact set lookup for executor ID filtering - Fix used_available_tools sample: tool_definitions→tools param, use FunctionTool attribute access instead of dict .get() - Add None-check in _resolve_openai_client for misconfigured project - Add Returns section to evaluate_workflow docstring - Cache inspect.signature in @evaluator wrapper (avoid per-item reflection) Architecture: - Extract _evaluate_via_responses as module-level helper; evaluate_traces now calls it directly instead of creating a FoundryEvals instance - Move Foundry-specific typed-content conversion out of core to_eval_data; core now returns plain role/content dicts, FoundryEvals applies AgentEvalConverter in _evaluate_via_dataset Tests: - evaluate_response() deprecation warning emission and delegation - num_repetitions > 1 with expected_output and expected_tool_calls - Mock output_items.list in test_evaluate_calls_evals_api - Update to_eval_data assertions for plain-dict format - Unknown param error now raised at @evaluator decoration time Skipped (separate PR): executor reset loop, xfail removal, options alias Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Revert test_full_conversation.py to upstream/main (the session preservation test was incorrectly changed to assert clearing) - Fix pyright reportUnnecessaryComparison on get_openai_client() None check by adding ignore comment - Fix pyright reportPrivateUsage: add public EvalItem.split_messages() method and use it in FoundryEvals._evaluate_via_dataset instead of accessing private _split_conversation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add try/except guard for non-numeric score in _coerce_result - Add poll_interval minimum bound (0.1s) to prevent tight loops - Add runtime async client check in _resolve_openai_client - Remove _ensure_async_result wrapper (10 call sites → direct await) - Better error message when queries provided without agent - Import-time asserts for evaluator set consistency - Remove 28 redundant @pytest.mark.asyncio decorators - Add doc note about _raw_arguments sensitive data - Tests: tool_called_check mode=any, _normalize_queries branches, _extract_result_counts paths, _extract_per_evaluator, bare check via evaluate_agent, output_items assertion, modulo wrapping, async client check, queries-without-agent error Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code fixes: - Move import-time RuntimeError checks to unit tests (avoids breaking imports for all users on developer set-drift mistake) - _filter_tool_evaluators now raises ValueError when all evaluators require tools but no items have tools (was silently substituting) - Add poll_interval upper bound (60s) to prevent single-iteration sleep - Log exc_info=True in _fetch_output_items for debugging API changes - Fix evaluate() docstring: remove claim about Responses API optimization - Validate target dict has 'type' key in evaluate_foundry_target - Document to_eval_data() limitation: non-text content is omitted Tests: - TestEvaluatorSetConsistency: verify _AGENT/_TOOL subsets of _BUILTIN - TestEvaluateTracesAgentId: agent_id-only path with lookback_hours - TestFilterToolEvaluatorsRaises: ValueError on all-tool no-items - TestEvaluateFoundryTargetValidation: target without 'type' key - Assert items==[] on failed/canceled poll results - Mock output_items.list in response_ids test for full flow - TestAllPassedSubResults: result_counts=None + sub_results delegation and parent failures override sub_results - TestBuildOverallItemEmpty: empty workflow outputs returns None Skipped r5-07 (_raw_arguments length hint): marginal debugging value, could leak content size information. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…s=...) The referenced function doesn't exist; the correct API is evaluate_traces(response_ids=...) from the azure-ai package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove to_eval_data() from EvalItem (dead code after r4-05 JSONL refactor) - Migrate 15 tests from to_eval_data() to split_messages() - Update sample to use split_messages() + Message properties - Remove unimplemented Responses API optimization docstring claim - Update split_messages() docstring to not reference removed method Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The method was never called — evaluate() uses _evaluate_via_dataset, and evaluate_traces() calls _evaluate_via_responses_impl directly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dc01030 to
8a2c237
Compare
…format - Remove import of non-existent _foundry_memory_provider module (incorrectly kept during rebase conflict resolution) - Apply ruff formatter to test_local_eval.py and get-started samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
36c21b1 to
f439746
Compare
The upstream provider-leading client refactor (microsoft#4818) made client= a required parameter on Agent(). Update the three getting-started eval samples to use FoundryChatClient with FOUNDRY_PROJECT_ENDPOINT, matching the standard pattern from 01-get-started samples. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
a74c9d1 to
8d8b6e8
Compare
Replace ~80 lines of manual OpenAI evals API code (create_eval, run_eval, manual polling, raw JSONL params) with FoundryEvals: - evaluate_groundedness() uses FoundryEvals.evaluate() with EvalItem - Remove create_openai_client(), create_eval(), run_eval() functions - Remove openai SDK type imports (DataSourceConfigCustom, etc.) - run_self_reflection_batch creates FoundryEvals instance once, reuses it for all iterations across all prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Migrate all foundry_evals samples from AzureOpenAIResponsesClient to FoundryChatClient - Update env var from AZURE_AI_PROJECT_ENDPOINT to FOUNDRY_PROJECT_ENDPOINT - Use AzureCliCredential consistently across all samples - Fix README.md: correct function names (evaluate_dataset -> FoundryEvals.evaluate, evaluate_responses -> evaluate_traces) - Update self_reflection .env.example and README.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a provider-agnostic evaluation framework to the Python Agent Framework, with both local (no-API) evaluators and an Azure AI Foundry-backed provider, plus end-to-end samples that demonstrate agent and workflow evaluation patterns.
Changes:
- Introduces core evaluation types and orchestration (
EvalItem,EvalResults,evaluate_agent(),evaluate_workflow()) plus local checks (LocalEvaluator,@evaluator). - Adds Azure AI Foundry provider integration (
FoundryEvals) and trace/target evaluation helpers. - Adds/updates evaluation samples (Foundry evals patterns + self-reflection groundedness) and expands test coverage for local evaluation.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py | Migrates groundedness scoring to FoundryEvals and updates batch runner. |
| python/samples/05-end-to-end/evaluation/self_reflection/README.md | Updates self-reflection sample documentation for Foundry Evals usage and env vars. |
| python/samples/05-end-to-end/evaluation/self_reflection/.env.example | Updates env var example to FOUNDRY_PROJECT_ENDPOINT. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py | New sample: evaluate multi-agent workflows with Foundry evaluators. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py | New sample: evaluate existing responses / traces via Foundry. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_multiturn_sample.py | New sample: demonstrate conversation split strategies for eval. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py | New sample: mix LocalEvaluator with Foundry evaluators in one call. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_all_patterns_sample.py | New “kitchen sink” sample covering all evaluation patterns. |
| python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py | New sample: evaluate_agent patterns + direct FoundryEvals.evaluate(). |
| python/samples/05-end-to-end/evaluation/foundry_evals/README.md | New README describing Foundry eval samples and entry points. |
| python/samples/05-end-to-end/evaluation/foundry_evals/.env.example | New env example for Foundry eval samples. |
| python/samples/03-workflows/evaluation/evaluate_workflow.py | New workflow evaluation sample using local checks. |
| python/samples/02-agents/evaluation/evaluate_with_expected.py | New sample demonstrating expected outputs/tool call expectations. |
| python/samples/02-agents/evaluation/evaluate_agent.py | New sample demonstrating basic local evaluation for agents. |
| python/packages/core/tests/core/test_observability.py | Adjusts OTLP exporter-related test skipping. |
| python/packages/core/tests/core/test_local_eval.py | Adds a comprehensive test suite for local eval framework behaviors. |
| python/packages/core/agent_framework/_evaluation.py | Adds the provider-agnostic evaluation framework implementation. |
| python/packages/core/agent_framework/init.py | Re-exports evaluation APIs/types from the package root. |
| python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py | Adds the Foundry-backed FoundryEvals provider + trace/target helpers. |
| python/packages/azure-ai/agent_framework_azure_ai/init.py | Exposes FoundryEvals and helper functions from the azure-ai package. |
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_mixed_sample.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/self_reflection/self_reflection.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_workflow_sample.py
Show resolved
Hide resolved
python/samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
Show resolved
Hide resolved
…nto af-foundry-evals-python
…jectClient AIProjectClient from azure.ai.projects.aio requires an async credential. Switch all foundry_evals samples from azure.identity.AzureCliCredential to azure.identity.aio.AzureCliCredential. Also pass project_client to FoundryChatClient instead of duplicating endpoint+credential. Close credential in self_reflection sample to avoid resource leak. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d266ee2 to
997a379
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
python/packages/azure-ai/agent_framework_azure_ai/_foundry_evals.py
Outdated
Show resolved
Hide resolved
- Convert all Example:: / Typical usage:: code blocks to .. code-block:: python format matching codebase convention (both _evaluation.py and _foundry_evals.py) - Add async pagination in _fetch_output_items via async for (handles large result sets) - Replace hasattr(__aenter__) with isinstance(client, AsyncOpenAI) in _resolve_openai_client - Move AsyncOpenAI import from TYPE_CHECKING to runtime (needed for isinstance) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix tests: use MagicMock(spec=AsyncOpenAI) for project_client mocks (isinstance check now requires proper type, not duck-typing) - Fix tests: replace mock_page.__iter__ with _AsyncPage helper for async for - Fix evaluate_response: auto-extract queries from response messages when query is not provided (previously always raised ValueError) - Add debug logging when skipping internal _-prefixed executor IDs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| Raises: | ||
| ValueError: If the name is not recognized. | ||
| """ | ||
| if name.startswith("builtin."): |
There was a problem hiding this comment.
nit: should we also check if the name is actually one of the builtins?
There was a problem hiding this comment.
Copilot on behalf of alliscode: Good call. We considered strict validation but decided to pass through any builtin.* name as-is since Foundry may add new evaluators that aren't in our local mapping yet. Only unqualified shorthand names (relevance, coherence, etc.) are validated against the known set. Added a comment explaining this rationale — see bb5f546.
| qualified = _resolve_evaluator(name) | ||
| short = name if not name.startswith("builtin.") else name.split(".")[-1] | ||
|
|
||
| entry: dict[str, Any] = { |
There was a problem hiding this comment.
nit: is there a model defined somewhere with all the parameters documented and typed? The dictionary is hard to reason about and creates potential maintenance overhead in the future.
There was a problem hiding this comment.
Copilot on behalf of alliscode: The testing_criteria dict structure is dictated by the OpenAI SDK's evals.create() API — it accepts dict[str, Any] with no typed model yet. We could introduce a TypedDict for internal representation, but it would just wrap the same dict. For now, added an inline comment with a link to the API reference docs. If the SDK adds typed models in the future, we'll adopt them. See bb5f546.
| if qualified in _AGENT_EVALUATORS: | ||
| # Agent evaluators: query/response as conversation arrays | ||
| mapping: dict[str, str] = { | ||
| "query": "{{item.query_messages}}", |
There was a problem hiding this comment.
{{item.query_messages}}
Are these placeholders?
There was a problem hiding this comment.
Copilot on behalf of alliscode: Yes, these are Mustache-style template placeholders used by the OpenAI evals API data_mapping feature. They reference fields in the JSONL data items (e.g. {{item.query_messages}} maps to the query_messages field in each item). Added an inline comment explaining this — see bb5f546.
| if remaining <= 0: | ||
| return EvalResults(provider=provider, eval_id=eval_id, run_id=run_id, status="timeout") | ||
| logger.debug("Eval run %s status: %s (%.0fs remaining)", run_id, run.status, remaining) | ||
| await asyncio.sleep(min(max(poll_interval, 1.0), remaining, 60.0)) |
There was a problem hiding this comment.
If remaining is less than the poll interval, should we still go for another round? Also, there is the 60 second implicit upper bound on the poll interval, which may throw people off.
There was a problem hiding this comment.
Copilot on behalf of alliscode: Yes — if remaining < poll_interval, the min(poll_interval, remaining, 60.0) clamp ensures we still sleep for just the remaining time and do one final check. Added a comment documenting the 60s upper bound rationale: it prevents a single long sleep from consuming the entire timeout, allowing the loop to check status more frequently. See bb5f546.
| await asyncio.sleep(min(max(poll_interval, 1.0), remaining, 60.0)) | ||
|
|
||
|
|
||
| def _extract_result_counts(run: Any) -> dict[str, int] | None: |
There was a problem hiding this comment.
nit: is it possible to narrow down the type of run?
There was a problem hiding this comment.
Copilot on behalf of alliscode: Updated — removed the | Any union entirely. This is a brand new feature so no backward compat needed. The type is now just RunRetrieveResponse. See d05d032.
|
|
||
| def _extract_result_counts(run: Any) -> dict[str, int] | None: | ||
| """Safely extract result_counts from an eval run object.""" | ||
| counts = getattr(run, "result_counts", None) |
There was a problem hiding this comment.
If we can tighten the type, we can get rid of a lot of the getattr in the code.
There was a problem hiding this comment.
Copilot on behalf of alliscode: Agreed — done. Replaced most getattr calls with direct field access using the typed RunRetrieveResponse and PerTestingCriteriaResult types from the OpenAI SDK. Still using getattr for result_counts since the return value is a Pydantic model that needs extraction, but the individual fields (errored/failed/passed/total) are now accessed by known name. See bb5f546.
| Provide this or *openai_client*. | ||
| openai_client: An ``AsyncOpenAI`` client with evals API. | ||
| model_deployment: Model deployment name for the evaluator LLM judge. | ||
| evaluators: Evaluator names (e.g. ``["relevance", "tool_call_accuracy"]``). |
There was a problem hiding this comment.
Question: are all the Foundry evaluators supported by a bare openai client?
There was a problem hiding this comment.
Copilot on behalf of alliscode: Great point — updated. FoundryEvals now accepts client: FoundryChatClient instead of openai_client: AsyncOpenAI. Since the builtin.* evaluators require a Foundry endpoint, the type should reflect that. The project_client fallback is retained for the standalone functions (evaluate_traces, evaluate_foundry_target). All samples updated. See 669d67c.
| poll_interval: float = 5.0, | ||
| timeout: float = 180.0, | ||
| ): | ||
| self.name = "Microsoft Foundry" |
There was a problem hiding this comment.
Does this name matter if the user supplies an openai client?
There was a problem hiding this comment.
Copilot on behalf of alliscode: self.name is used in EvalResults.provider to identify which evaluator produced the results (e.g. in multi-evaluator workflows). It's a display/identification name, not functionally significant to the API. It defaults to Microsoft Foundry because the evaluator uses Foundry's built-in evaluators. A user-supplied OpenAI client still talks to a Foundry endpoint for evals, so the name is accurate regardless.
| [], # no tools expected for simple math | ||
| [], |
There was a problem hiding this comment.
Why are there two empty lists?
There was a problem hiding this comment.
Copilot on behalf of alliscode: They were placeholder empty lists for the two queries. Since no tools are expected, it's cleaner to just omit expected_tool_calls entirely (it's optional). Removed in bb5f546.
- T1: Add comment explaining builtin.* pass-through in _resolve_evaluator
- T2: Add comment referencing OpenAI evals API for testing_criteria dict
- T3: Document Mustache-style {{item.*}} template placeholders
- T4: Document poll loop 60s sleep upper bound rationale
- T5: Narrow run type to RunRetrieveResponse, use typed field access
instead of vars()/getattr dance in _extract_result_counts and
_extract_per_evaluator; use run.error and run.report_url directly
- T6: Clarify openai_client docstring re: Azure Foundry endpoint
- T8: Remove misleading empty expected_tool_calls from sample
- Update tests to match real SDK PerTestingCriteriaResult shape
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RunRetrieveResponse is the correct type — no backward compat needed for a brand new feature. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FoundryEvals now takes client: FoundryChatClient as its primary parameter instead of openai_client: AsyncOpenAI. The builtin.* evaluators require a Foundry endpoint, so the type should reflect that. - FoundryEvals.__init__: client: FoundryChatClient replaces openai_client - evaluate_traces / evaluate_foundry_target: same change - _resolve_openai_client: extracts .client from FoundryChatClient - project_client fallback retained for standalone functions - All samples updated to construct FoundryChatClient and pass as client= - Tests updated (openai_client= → client=) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add evaluation framework with local and Foundry-hosted evaluator support:
Contribution Checklist