Skip to content

[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574

Merged
aprilk-ms merged 6 commits intomainfrom
fix/aoai-preserve-list-values
Mar 10, 2026
Merged

[Evaluation] Fix AOAI evaluation to preserve list values instead of stringifying them#45574
aprilk-ms merged 6 commits intomainfrom
fix/aoai-preserve-list-values

Conversation

@aprilk-ms
Copy link
Member

@aprilk-ms aprilk-ms commented Mar 8, 2026

Fixes

1. Preserve list values instead of stringifying them in AOAI evaluation

Problem: The _convert_value helper in _get_data_source converted list values to strings via str(), turning [] into the string '[]'. The AOAI API rejected these with:

Error validating file against schema: '[]' is not of type 'array'

Fix: Moved list from the stringify branch to the pass-through branch alongside dict, since both are structured JSON types that should be preserved as native objects for proper JSON serialization.

2. Infer correct schema types for list/dict columns

Problem: Schema generation defaulted all columns/leaves to "string" type in both flat and nested modes, causing schema-data mismatches when data contained lists or dicts.

Fix:

  • Flat mode: Samples rows (skipping None, NaN, pd.NA, NaT, etc.) to emit the correct JSON Schema type (array, object, or string).
  • Nested mode: Added leaf_type_map parameter to _build_schema_tree_from_paths so per-leaf types can be inferred from the DataFrame, overriding the default force_leaf_type="string" for paths whose data contains lists or dicts.

3. Fix pass_threshold propagation and zero-threshold logging

Problem: Three related issues with threshold handling:

  • _build_internal_log_attributes used a truthiness check (if event_data.get("threshold")) which silently dropped threshold=0.
  • _pass_threshold from evaluator config was not propagated into testing criteria metadata.
  • PythonGrader/code evaluators that don't emit a threshold column had no way to get one injected from config.

Fix:

  • Use is not None check so threshold=0 is correctly logged.
  • Propagate _pass_threshold from evaluator_config into testing_criteria_metadata.
  • Inject pass_threshold into metric results when the evaluator doesn't emit one, using _is_none_or_nan to also handle NaN threshold values, without overwriting valid evaluator-provided thresholds.

Tests

  • test_aoai_data_source.py (33 tests pass): Added tests for list/dict value preservation, flat schema type inference, None/NaN/pd.NA null-skipping, nested schema leaf type inference, and leaf_type_map in _build_schema_tree_from_paths; updated 3 existing assertions; added integration test for schema-data alignment.
  • test_evaluate.py (13 new tests pass): TestBuildInternalLogAttributesThreshold (5 tests), TestExtractTestingCriteriaMetadataPassThreshold (3 tests), TestProcessCriteriaMetricsThresholdInjection (5 tests including NaN injection).

The _convert_value helper in _get_data_source was converting list values
to strings via str(), turning [] into '[]'. The AOAI API then rejected
these with 'is not of type array' errors.

Move list from the stringify branch to the pass-through branch alongside
dict, since both are structured JSON types that should be preserved as
native objects for proper serialization.

Update existing test assertions and add a new test for list/dict value
preservation including empty collections.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Mar 8, 2026
aprilk-ms and others added 3 commits March 7, 2026 23:40
The flat schema generator in _generate_data_source_config now samples
the first row to emit the correct JSON Schema type (array, object, or
string) instead of defaulting everything to string. This ensures the
schema aligns with the data produced by _convert_value.

Add test for schema type inference and an integration test verifying
schema-data alignment for list/dict columns including empty collections.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use 'is not None' instead of truthiness check in
  _build_internal_log_attributes so threshold=0 is not silently dropped.
- Propagate _pass_threshold from evaluator_config into
  testing_criteria_metadata in _extract_testing_criteria_metadata.
- Inject pass_threshold into metric results in _process_criteria_metrics
  when the evaluator (e.g. PythonGrader) does not emit one, without
  overwriting evaluator-provided thresholds.
- Add 12 unit tests covering all three changes including zero-value
  edge cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The flat schema generator now scans past None and NaN values to find
the first non-null sample for type inference, instead of only checking
iloc[0]. This avoids schema-data mismatches when the first row has
missing values but later rows contain lists or dicts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aprilk-ms aprilk-ms marked this pull request as ready for review March 8, 2026 09:20
@aprilk-ms aprilk-ms requested a review from a team as a code owner March 8, 2026 09:20
Copilot AI review requested due to automatic review settings March 8, 2026 09:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Azure OpenAI (AOAI) evaluation payload/schema mismatches by preserving structured JSON values (lists/dicts) in the datasource payload, inferring correct JSON Schema types in flat schema mode, and improving pass_threshold/threshold propagation and logging.

Changes:

  • Preserve list and dict values in AOAI datasource JSON (stop stringifying them).
  • Infer "array" / "object" types for list/dict columns when generating flat datasource schemas (skipping null-like samples).
  • Ensure threshold=0 is logged, propagate _pass_threshold into testing-criteria metadata, and inject missing thresholds from config into metric results.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate_aoai.py Updates AOAI datasource value conversion and flat schema type inference for list/dict columns.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py Fixes threshold logging for zero, propagates pass_threshold metadata, and injects thresholds into results when missing.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_aoai_data_source.py Adds/updates unit tests validating list/dict preservation and flat schema inference behavior.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_evaluate.py Adds unit tests for threshold logging, pass_threshold propagation, and threshold injection behavior.

You can also share your feedback on Copilot code review. Take the survey.

- Use _is_none_or_nan for threshold injection check so NaN thresholds
  are also replaced by pass_threshold from config.
- Use pd.isna with guard for list/dict when skipping null sentinels
  (handles pd.NA, NaT, etc. in addition to None and float NaN).
- Infer leaf types in nested schema via leaf_type_map parameter on
  _build_schema_tree_from_paths so nested paths with list/dict data
  get array/object schema types instead of always defaulting to string.
- Add tests for leaf_type_map, nested schema type inference, pd.NA
  handling, and NaN threshold injection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use line-length=120 from eng/black-pyproject.toml config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aprilk-ms aprilk-ms force-pushed the fix/aoai-preserve-list-values branch from 11f0bcb to ef69f45 Compare March 9, 2026 20:32
@aprilk-ms aprilk-ms merged commit 6b06163 into main Mar 10, 2026
21 checks passed
@aprilk-ms aprilk-ms deleted the fix/aoai-preserve-list-values branch March 10, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants