[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them by aprilk-ms · Pull Request #45574 · Azure/azure-sdk-for-python

aprilk-ms · 2026-03-08T07:31:14Z

Fixes

1. Preserve list values instead of stringifying them in AOAI evaluation

Problem: The _convert_value helper in _get_data_source converted list values to strings via str(), turning [] into the string '[]'. The AOAI API rejected these with:

Error validating file against schema: '[]' is not of type 'array'

Fix: Moved list from the stringify branch to the pass-through branch alongside dict, since both are structured JSON types that should be preserved as native objects for proper JSON serialization.

2. Infer correct schema types for list/dict columns

Problem: Schema generation defaulted all columns/leaves to "string" type in both flat and nested modes, causing schema-data mismatches when data contained lists or dicts.

Fix:

Flat mode: Samples rows (skipping None, NaN, pd.NA, NaT, etc.) to emit the correct JSON Schema type (array, object, or string).
Nested mode: Added leaf_type_map parameter to _build_schema_tree_from_paths so per-leaf types can be inferred from the DataFrame, overriding the default force_leaf_type="string" for paths whose data contains lists or dicts.

3. Fix pass_threshold propagation and zero-threshold logging

Problem: Three related issues with threshold handling:

_build_internal_log_attributes used a truthiness check (if event_data.get("threshold")) which silently dropped threshold=0.
_pass_threshold from evaluator config was not propagated into testing criteria metadata.
PythonGrader/code evaluators that don't emit a threshold column had no way to get one injected from config.

Fix:

Use is not None check so threshold=0 is correctly logged.
Propagate _pass_threshold from evaluator_config into testing_criteria_metadata.
Inject pass_threshold into metric results when the evaluator doesn't emit one, using _is_none_or_nan to also handle NaN threshold values, without overwriting valid evaluator-provided thresholds.

Tests

test_aoai_data_source.py (33 tests pass): Added tests for list/dict value preservation, flat schema type inference, None/NaN/pd.NA null-skipping, nested schema leaf type inference, and leaf_type_map in _build_schema_tree_from_paths; updated 3 existing assertions; added integration test for schema-data alignment.
test_evaluate.py (13 new tests pass): TestBuildInternalLogAttributesThreshold (5 tests), TestExtractTestingCriteriaMetadataPassThreshold (3 tests), TestProcessCriteriaMetricsThresholdInjection (5 tests including NaN injection).

The _convert_value helper in _get_data_source was converting list values to strings via str(), turning [] into '[]'. The AOAI API then rejected these with 'is not of type array' errors. Move list from the stringify branch to the pass-through branch alongside dict, since both are structured JSON types that should be preserved as native objects for proper serialization. Update existing test assertions and add a new test for list/dict value preservation including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The flat schema generator in _generate_data_source_config now samples the first row to emit the correct JSON Schema type (array, object, or string) instead of defaulting everything to string. This ensures the schema aligns with the data produced by _convert_value. Add test for schema type inference and an integration test verifying schema-data alignment for list/dict columns including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Use 'is not None' instead of truthiness check in _build_internal_log_attributes so threshold=0 is not silently dropped. - Propagate _pass_threshold from evaluator_config into testing_criteria_metadata in _extract_testing_criteria_metadata. - Inject pass_threshold into metric results in _process_criteria_metrics when the evaluator (e.g. PythonGrader) does not emit one, without overwriting evaluator-provided thresholds. - Add 12 unit tests covering all three changes including zero-value edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The flat schema generator now scans past None and NaN values to find the first non-null sample for type inference, instead of only checking iloc[0]. This avoids schema-data mismatches when the first row has missing values but later rows contain lists or dicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Fixes Azure OpenAI (AOAI) evaluation payload/schema mismatches by preserving structured JSON values (lists/dicts) in the datasource payload, inferring correct JSON Schema types in flat schema mode, and improving pass_threshold/threshold propagation and logging.

Changes:

Preserve list and dict values in AOAI datasource JSON (stop stringifying them).
Infer "array" / "object" types for list/dict columns when generating flat datasource schemas (skipping null-like samples).
Ensure threshold=0 is logged, propagate _pass_threshold into testing-criteria metadata, and inject missing thresholds from config into metric results.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py	Updates AOAI datasource value conversion and flat schema type inference for list/dict columns.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py	Fixes threshold logging for zero, propagates pass_threshold metadata, and injects thresholds into results when missing.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py	Adds/updates unit tests validating list/dict preservation and flat schema inference behavior.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py	Adds unit tests for threshold logging, pass_threshold propagation, and threshold injection behavior.

You can also share your feedback on Copilot code review. Take the survey.

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py

- Use _is_none_or_nan for threshold injection check so NaN thresholds are also replaced by pass_threshold from config. - Use pd.isna with guard for list/dict when skipping null sentinels (handles pd.NA, NaT, etc. in addition to None and float NaN). - Infer leaf types in nested schema via leaf_type_map parameter on _build_schema_tree_from_paths so nested paths with list/dict data get array/object schema types instead of always defaulting to string. - Add tests for leaf_type_map, nested schema type inference, pd.NA handling, and NaN threshold injection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use line-length=120 from eng/black-pyproject.toml config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026

aprilk-ms and others added 3 commits March 7, 2026 23:40

aprilk-ms marked this pull request as ready for review March 8, 2026 09:20

aprilk-ms requested a review from a team as a code owner March 8, 2026 09:20

Copilot AI review requested due to automatic review settings March 8, 2026 09:20

Copilot started reviewing on behalf of aprilk-ms March 8, 2026 09:20 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

posaninagendra approved these changes Mar 8, 2026

View reviewed changes

Apply black formatting to pass CI checks

ef69f45

Use line-length=120 from eng/black-pyproject.toml config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aprilk-ms force-pushed the fix/aoai-preserve-list-values branch from 11f0bcb to ef69f45 Compare March 9, 2026 20:32

w-javed approved these changes Mar 10, 2026

View reviewed changes

aprilk-ms merged commit 6b06163 into main Mar 10, 2026
21 checks passed

aprilk-ms deleted the fix/aoai-preserve-list-values branch March 10, 2026 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574

[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574
aprilk-ms merged 6 commits intomainfrom
fix/aoai-preserve-list-values

aprilk-ms commented Mar 8, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

aprilk-ms commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

1. Preserve list values instead of stringifying them in AOAI evaluation

2. Infer correct schema types for list/dict columns

3. Fix pass_threshold propagation and zero-threshold logging

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aprilk-ms commented Mar 8, 2026 •

edited

Loading