Skip to content

[Evaluation] Additional red team e2e tests #45579

Open
slister1001 wants to merge 4 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests
Open

[Evaluation] Additional red team e2e tests #45579
slister1001 wants to merge 4 commits intoAzure:mainfrom
slister1001:foundry-e2e-tests

Conversation

@slister1001
Copy link
Member

Description

Please add an informative description that covers that changes made by the pull request and link all relevant issues.

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

slister1001 and others added 4 commits March 4, 2026 16:05
…r handling

Bug 1 - Status tracking: _determine_run_status now treats 'pending' and
'running' entries as 'failed' instead of 'in_progress'. By the time this
method runs the scan is finished, so leftover 'pending' entries (from
skipped risk categories or Foundry execution failures) indicate failure,
not ongoing work.

Bug 2 - Cache key mismatch: _execute_attacks_with_foundry now uses
get_attack_objective_from_risk_category() to build the cache lookup key,
matching the caching logic in _get_attack_objectives. Previously,
ungrounded_attributes objectives were cached under 'isa' but looked up
under 'ungrounded_attributes', causing them to be silently skipped.

Bug 3 - Evaluation error handling: RAIServiceScorer now detects when the
RAI evaluation service returns an error response (properties.outcome ==
'error', e.g. ServiceInvocationException) and raises RuntimeError. This
causes PyRIT to treat the score as UNDETERMINED instead of using the
erroneous passed=False to incorrectly mark the attack as successful,
which was inflating ASR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 10 new E2E tests to test_red_team_foundry.py covering:
- AzureOpenAIModelConfiguration target via Foundry path
- Agent-flagged callback targets (is_agent_target=True)
- Agent targets with tool context injection (XPIA)
- New v2025_11_15 risk categories (ProtectedMaterial, CodeVulnerability, TaskAdherence)
- Agent-only risk category (SensitiveDataLeakage) with agent target
- Error path: agent-only risk category rejected for model targets
- MultiTurn and Crescendo attack strategies via Foundry path

Also adds _validate_attack_details() shared helper for consistent
assertion of attack detail structure across all tests (conversation
structure, role alternation, risk_category, attack_technique).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
devtools_testutils.config.PROXY_URL changed from a function to a
plain string in a recent update. Remove the () call to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 8, 2026 18:25
@slister1001 slister1001 requested a review from a team as a code owner March 8, 2026 18:25
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds additional end-to-end coverage for RedTeam “Foundry” execution and aligns a few red team internals/outputs with expected contracts and error semantics.

Changes:

  • Expanded test_red_team_foundry.py with new Foundry e2e scenarios (model-config targets, agent targets, new risk categories, multi-turn strategies, and contract error paths).
  • Fixed Foundry baseline objective cache lookup keying to use the same risk-category→objective mapping as the generator path.
  • Treated leftover pending/running statuses as terminal failures when producing final run status, and surfaced RAI evaluation service “error outcome” as undetermined instead of attack success.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/e2etests/test_red_team_foundry.py Adds substantial Foundry red team e2e coverage across targets, strategies, and risk categories.
sdk/evaluation/azure-ai-evaluation/tests/conftest.py Updates OpenAI/test-proxy routing configuration used by recordings/playback.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Adjusts final run-level status determination semantics after scan completion.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_red_team.py Fixes baseline objective cache key mismatch in Foundry execution path.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_foundry/_rai_scorer.py Detects evaluation-service error outcomes and raises so PyRIT marks results as undetermined.
sdk/evaluation/azure-ai-evaluation/CHANGELOG.md Documents the bug fixes included in this PR.

"""Route requests from the openai package to the test proxy."""
config = TestProxyConfig(
recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL()
recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROXY_URL is imported from devtools_testutils.config as a function (see eng/tools/azure-sdk-tools/devtools_testutils/config.py:18-20). Passing proxy_url=PROXY_URL will pass the function object instead of the expected URL string, which will break proxy routing. Call it (PROXY_URL()) or import/define a string constant instead.

Suggested change
recording_id=get_recording_id(), recording_mode="record" if is_live() else "playback", proxy_url=PROXY_URL
recording_id=get_recording_id(),
recording_mode="record" if is_live() else "playback",
proxy_url=PROXY_URL(),

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +9
- Fixed red team scan status stuck at `in_progress` in results.json despite the scan completing, by treating leftover `pending` entries as `failed`.
- Fixed `ungrounded_attributes` risk category being silently skipped due to a cache key mismatch (`isa` vs `ungrounded_attributes`) in the Foundry execution path.
- Fixed RAI evaluation service errors (`ServiceInvocationException`) incorrectly inflating attack success rate by treating error responses as undetermined instead of attack success.
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description is currently a placeholder and doesn’t explain the changes or link relevant issues. Please replace it with an informative description so reviewers and release tooling can understand intent and scope.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants