[Evaluation] Additional red team e2e tests by slister1001 · Pull Request #45579 · Azure/azure-sdk-for-python

slister1001 · 2026-03-08T18:25:12Z

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…r handling Bug 1 - Status tracking: _determine_run_status now treats 'pending' and 'running' entries as 'failed' instead of 'in_progress'. By the time this method runs the scan is finished, so leftover 'pending' entries (from skipped risk categories or Foundry execution failures) indicate failure, not ongoing work. Bug 2 - Cache key mismatch: _execute_attacks_with_foundry now uses get_attack_objective_from_risk_category() to build the cache lookup key, matching the caching logic in _get_attack_objectives. Previously, ungrounded_attributes objectives were cached under 'isa' but looked up under 'ungrounded_attributes', causing them to be silently skipped. Bug 3 - Evaluation error handling: RAIServiceScorer now detects when the RAI evaluation service returns an error response (properties.outcome == 'error', e.g. ServiceInvocationException) and raises RuntimeError. This causes PyRIT to treat the score as UNDETERMINED instead of using the erroneous passed=False to incorrectly mark the attack as successful, which was inflating ASR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 10 new E2E tests to test_red_team_foundry.py covering: - AzureOpenAIModelConfiguration target via Foundry path - Agent-flagged callback targets (is_agent_target=True) - Agent targets with tool context injection (XPIA) - New v2025_11_15 risk categories (ProtectedMaterial, CodeVulnerability, TaskAdherence) - Agent-only risk category (SensitiveDataLeakage) with agent target - Error path: agent-only risk category rejected for model targets - MultiTurn and Crescendo attack strategies via Foundry path Also adds _validate_attack_details() shared helper for consistent assertion of attack detail structure across all tests (conversation structure, role alternation, risk_category, attack_technique). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

devtools_testutils.config.PROXY_URL changed from a function to a plain string in a recent update. Remove the () call to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds additional end-to-end coverage for RedTeam “Foundry” execution and aligns a few red team internals/outputs with expected contracts and error semantics.

Changes:

Expanded test_red_team_foundry.py with new Foundry e2e scenarios (model-config targets, agent targets, new risk categories, multi-turn strategies, and contract error paths).
Fixed Foundry baseline objective cache lookup keying to use the same risk-category→objective mapping as the generator path.
Treated leftover pending/running statuses as terminal failures when producing final run status, and surfaced RAI evaluation service “error outcome” as undetermined instead of attack success.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_red_team_foundry.py`	Adds substantial Foundry red team e2e coverage across targets, strategies, and risk categories.
`sdk/evaluation/azure-ai-evaluation/tests/conftest.py`	Updates OpenAI/test-proxy routing configuration used by recordings/playback.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py`	Adjusts final run-level status determination semantics after scan completion.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py`	Fixes baseline objective cache key mismatch in Foundry execution path.
`sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py`	Detects evaluation-service error outcomes and raises so PyRIT marks results as undetermined.
`sdk/evaluation/azure-ai-evaluation/CHANGELOG.md`	Documents the bug fixes included in this PR.

Copilot · 2026-03-08T18:30:01Z

sdk/evaluation/azure-ai-evaluation/tests/conftest.py

    """Route requests from the openai package to the test proxy."""
    config = TestProxyConfig(
-        recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL()
+        recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL


PROXY_URL is imported from devtools_testutils.config as a function (see eng/tools/azure-sdk-tools/devtools_testutils/config.py:18-20). Passing proxy_url=PROXY_URL will pass the function object instead of the expected URL string, which will break proxy routing. Call it (PROXY_URL()) or import/define a string constant instead.

Suggested change

recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL

recording_id=get_recording_id(),

recording_mode="record" if is_live() else "playback",

proxy_url=PROXY_URL(),

Copilot · 2026-03-08T18:30:02Z

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

+- Fixed red team scan status stuck at `in_progress` in results.json despite the scan completing, by treating leftover `pending` entries as `failed`.
+- Fixed `ungrounded_attributes` risk category being silently skipped due to a cache key mismatch (`isa` vs `ungrounded_attributes`) in the Foundry execution path.
+- Fixed RAI evaluation service errors (`ServiceInvocationException`) incorrectly inflating attack success rate by treating error responses as undetermined instead of attack success.


The PR description is currently a placeholder and doesn’t explain the changes or link relevant issues. Please replace it with an informative description so reviewers and release tooling can understand intent and scope.

slister1001 and others added 4 commits March 4, 2026 16:05

Add changelog entries for status tracking, cache key, and scoring fixes

3016c92

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix PROXY_URL() TypeError - PROXY_URL is now a str not callable

250ada4

devtools_testutils.config.PROXY_URL changed from a function to a plain string in a recent update. Remove the () call to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 8, 2026 18:25

slister1001 requested a review from a team as a code owner March 8, 2026 18:25

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026

Copilot started reviewing on behalf of slister1001 March 8, 2026 18:25 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Additional red team e2e tests #45579

[Evaluation] Additional red team e2e tests #45579
slister1001 wants to merge 4 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests

slister1001 commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slister1001 commented Mar 8, 2026

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants