[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574
Merged
[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574
Conversation
The _convert_value helper in _get_data_source was converting list values to strings via str(), turning [] into '[]'. The AOAI API then rejected these with 'is not of type array' errors. Move list from the stringify branch to the pass-through branch alongside dict, since both are structured JSON types that should be preserved as native objects for proper serialization. Update existing test assertions and add a new test for list/dict value preservation including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The flat schema generator in _generate_data_source_config now samples the first row to emit the correct JSON Schema type (array, object, or string) instead of defaulting everything to string. This ensures the schema aligns with the data produced by _convert_value. Add test for schema type inference and an integration test verifying schema-data alignment for list/dict columns including empty collections. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use 'is not None' instead of truthiness check in _build_internal_log_attributes so threshold=0 is not silently dropped. - Propagate _pass_threshold from evaluator_config into testing_criteria_metadata in _extract_testing_criteria_metadata. - Inject pass_threshold into metric results in _process_criteria_metrics when the evaluator (e.g. PythonGrader) does not emit one, without overwriting evaluator-provided thresholds. - Add 12 unit tests covering all three changes including zero-value edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The flat schema generator now scans past None and NaN values to find the first non-null sample for type inference, instead of only checking iloc[0]. This avoids schema-data mismatches when the first row has missing values but later rows contain lists or dicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes Azure OpenAI (AOAI) evaluation payload/schema mismatches by preserving structured JSON values (lists/dicts) in the datasource payload, inferring correct JSON Schema types in flat schema mode, and improving pass_threshold/threshold propagation and logging.
Changes:
- Preserve
listanddictvalues in AOAI datasource JSON (stop stringifying them). - Infer
"array"/"object"types for list/dict columns when generating flat datasource schemas (skipping null-like samples). - Ensure
threshold=0is logged, propagate_pass_thresholdinto testing-criteria metadata, and inject missing thresholds from config into metric results.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py | Updates AOAI datasource value conversion and flat schema type inference for list/dict columns. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py | Fixes threshold logging for zero, propagates pass_threshold metadata, and injects thresholds into results when missing. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py | Adds/updates unit tests validating list/dict preservation and flat schema inference behavior. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py | Adds unit tests for threshold logging, pass_threshold propagation, and threshold injection behavior. |
You can also share your feedback on Copilot code review. Take the survey.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py
Outdated
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py
Show resolved
Hide resolved
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py
Outdated
Show resolved
Hide resolved
- Use _is_none_or_nan for threshold injection check so NaN thresholds are also replaced by pass_threshold from config. - Use pd.isna with guard for list/dict when skipping null sentinels (handles pd.NA, NaT, etc. in addition to None and float NaN). - Infer leaf types in nested schema via leaf_type_map parameter on _build_schema_tree_from_paths so nested paths with list/dict data get array/object schema types instead of always defaulting to string. - Add tests for leaf_type_map, nested schema type inference, pd.NA handling, and NaN threshold injection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
posaninagendra
approved these changes
Mar 8, 2026
Use line-length=120 from eng/black-pyproject.toml config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
11f0bcb to
ef69f45
Compare
w-javed
approved these changes
Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
1. Preserve list values instead of stringifying them in AOAI evaluation
Problem: The
_convert_valuehelper in_get_data_sourceconvertedlistvalues to strings viastr(), turning[]into the string'[]'. The AOAI API rejected these with:Fix: Moved
listfrom the stringify branch to the pass-through branch alongsidedict, since both are structured JSON types that should be preserved as native objects for proper JSON serialization.2. Infer correct schema types for list/dict columns
Problem: Schema generation defaulted all columns/leaves to
"string"type in both flat and nested modes, causing schema-data mismatches when data contained lists or dicts.Fix:
None,NaN,pd.NA,NaT, etc.) to emit the correct JSON Schema type (array,object, orstring).leaf_type_mapparameter to_build_schema_tree_from_pathsso per-leaf types can be inferred from the DataFrame, overriding the defaultforce_leaf_type="string"for paths whose data contains lists or dicts.3. Fix pass_threshold propagation and zero-threshold logging
Problem: Three related issues with threshold handling:
_build_internal_log_attributesused a truthiness check (if event_data.get("threshold")) which silently droppedthreshold=0._pass_thresholdfrom evaluator config was not propagated into testing criteria metadata.Fix:
is not Nonecheck sothreshold=0is correctly logged._pass_thresholdfromevaluator_configintotesting_criteria_metadata.pass_thresholdinto metric results when the evaluator doesn't emit one, using_is_none_or_nanto also handle NaN threshold values, without overwriting valid evaluator-provided thresholds.Tests
test_aoai_data_source.py(33 tests pass): Added tests for list/dict value preservation, flat schema type inference, None/NaN/pd.NA null-skipping, nested schema leaf type inference, andleaf_type_mapin_build_schema_tree_from_paths; updated 3 existing assertions; added integration test for schema-data alignment.test_evaluate.py(13 new tests pass):TestBuildInternalLogAttributesThreshold(5 tests),TestExtractTestingCriteriaMetadataPassThreshold(3 tests),TestProcessCriteriaMetricsThresholdInjection(5 tests including NaN injection).