CodeMode Driven Generation - Executors - Pydantic Monty - Native Serializers - Schema Generator by fswair · Pull Request #10 · vcoderun/vowel

fswair · 2026-03-19T16:44:45Z

CHANGELOG

codemode_driven_generation

This document summarizes the main features added or improved on this branch.

1) Executor and ExecutionSession protocols

The code execution interface was formalized using Protocols.
The Executor async/sync API was standardized:
- execute(...)
- execute_sync(...)
- create_session(...)
ExecutionSession now compiles/executes setup code once and supports multi-snippet feed execution.
This reduces repeated parse/compile overhead while exploring the same function.
The run_sync helper was hardened for running-loop environments via nest-asyncio.

2) MontyExecutor, DefaultExecutor, MontySession, FallbackSession structures

MontyExecutor was added:
- sandboxed execution via pydantic-monty,
- ResourceLimits support (timeout/memory),
- stdout capture and normalized error typing/messages,
DefaultExecutor was added/improved:
- pure Python exec-based fallback execution,
- last-expression capture (result) and stdout capture.
MontyReplSession (MontySession role) was added:
- one-time setup load, reusable feed-run model.
FallbackSession was added:
- Session-level fallback: if Monty session initialization fails, switch entirely to DefaultSession.
- Snippet-level fallback: if Monty returns ModuleNotFoundError for a snippet, rerun that snippet via fallback executor.
Executor/fallback wiring was simplified through resolve_executors.

3) Main implementation: CodeModeGenerator

Two-phase exploration-guided generation flow:
- Phase 1: behavior exploration (exploration snippets + error snippets)
- Phase 2: spec generation from verified observations
Lazy Agent architecture:
- explorer_agent (ExplorationPlan)
- spec_agent (EvalsSource or EvalsBundle)
Prompt layers were clearly separated:
- exploration prompt: coverage, diversity, duplicate prevention
- spec prompt: expected values from verified outputs only
A refinement loop was added:
- generate -> run -> failure_context -> regenerate
Optional duration injection and a final summary run were added at the end.

4) Runtime hierarchy and utility usage

CodeMode hierarchy:

explore()
generate_spec()
validate_and_fix_spec()
validate_expected_values()
inject_missing_error_cases()
inject_durations() (optional)
validation/refinement with RunEvals

Utilities used:

build_call_code
build_failure_context
validate_and_fix_spec
validate_expected_values
inject_missing_error_cases
inject_durations

5) Cost Manager

Generation/run cost tracking was added for CodeMode.
Features:
- generation_id and run_id lifecycle management,
- step-level usage/cost recording,
- model price resolution (genai-prices or costs.yml),
- atomic/locked JSON persistence,
- generation-level and run-level totals,
- status tracking: running/completed/failed.
The CLI costs command now supports list/by-generation/by-run views.

6) Serializer syntax and YAML-native serializer registry

Top-level serializers registry support was added at EvalsFile level.
Per-eval serializer references are now supported via serializer:.
SerializerSpec was clarified with one-of behavior:
- schema (string or dict)
- serializer (callable import path)
- not both at the same time.
Runtime resolver additions:
- import-path resolution,
- cached imports (_import_path_cached),
- per-eval resolution (_resolve_yaml_serializer_entry).
Precedence between programmatic serializer maps and YAML serializer registry was defined.

7) Spec model / Exploration model separation

Model separation in CodeModeGenerator constructor was formalized:
- spec_model
- exploration_model
use_model_spec output mode was clarified:
- use_model_spec=True: structured output mode (schema/model output via EvalsBundle)
- use_model_spec=False: YAML string output mode (via EvalsSource.yaml_spec)
HIGHLY RECOMMENDED TO KEEP use_model_spec=False.
Model resolution order and env fallback logic were added.
Cost tracking now supports separate model usage across separate steps.

8) Adding executor/fallback executor to utilities

Utility flows were updated to accept executor and fallback executor parameters.
Monty -> Default fallback behavior was generalized in execution-aware paths.
Executor behavior was centralized across run_evals and validation stages.

9) YAML schema generator

Runtime-model-driven schema generation was improved:
- supports top-level fixtures + serializers,
- preserves function-level EvalsMapValue behavior.
Schema cache strategy was updated:
- content-hash-based filename (reduces stale editor cache issues).
File header updates are handled safely via materialize_yaml_with_schema_header.

10) CLI komutları: schema, costs

vowel schema :
- update schema header after YAML + pydantic validation
vowel schema --create [path]:
- direct schema JSON generation
vowel costs:
- --list
- --by-generation
- --by-run
- --generation
- --run

11) module.function -> function alias support

Alias support was added for programmatic mapping resolution:
- function map
- serializer schema map
- serializer function map
Behavior:
- exact match first,
- short-name fallback,
- explicit error for ambiguous reverse short-name mapping.

12) Feedback-guided exploration

A targeted Round-2 exploration flow was added:
- build cluster summaries from Round-1 results,
- generate snippets focused on uncovered behavior classes.
Duplicate/semantic repetition minimization was reinforced at prompt level.
Distinct failure-mode coverage was improved for error snippets.
Additional rounds now measure value via new-behavior counting.

13) Assertion + serializer integration

AssertionEvaluator input context is now serializer-aware.
Assertions now see serialized input for schema, serial_fn, and nested/dict schema modes.
This behavior is covered by regression tests.

14) LLM Judge env-ref improvements

create_llm_judge now supports $ENV_VAR resolution for rubric/model fields.
Missing env refs now produce clearer errors.

15) Examples, documentation, and test coverage

A runnable native serializer + fixture example was added.
README and serializer docs were updated with serializer/assertion context notes.
Meaningful id fields were added to eval cases under examples.
New/updated tests include:
- test_schema
- test_llm_judge_env_refs
- serializer assertion regressions
- YAML/native serializer parsing tests

16) Fixture scope alias support

Fixture scopes now support clearer canonical names:
- case
- eval
- file
Backward-compatible aliases are still accepted:
- function (alias of case)
- module (alias of eval)
- session (alias of file)
At parse time, canonical names are normalized to legacy internal runtime values:
- case -> function
- eval -> module
- file -> session
This keeps existing runtime lifecycle behavior unchanged while allowing more descriptive scope names in YAML.

Note: Old names would be deprecated after v1.0.0

Note

This changelog is based on features observed and validated in code on this branch, without using git history.

Removed Python 3.10 from the test matrix.

Copilot

Pull request overview

This PR introduces a major “CodeMode-driven” generation + execution architecture, adds YAML-native serializer registry support, and expands CLI/schema/cost tooling while updating tests, examples, and docs for the new bundle-based loading model.

Changes:

Adds executor/session protocols with Monty + default execution backends, including fallback behavior and parity/session tests.
Introduces YAML-native serializer registry (serializers:) and serializer-aware assertion contexts, plus schema generation/cache + CLI commands.
Adds CodeMode cost tracking (JSON persistence + CLI inspection) and expands documentation/examples to cover new features.

Reviewed changes

Copilot reviewed 76 out of 78 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
VERSION	Bumps version to 0.4.0.
pyproject.toml	Updates package version/extras and tooling excludes; adds uv workspace config.
pyrightconfig.json	Updates type-checker excludes and relaxes one diagnostic.
pytest.ini	Removes pytest discovery configuration (now relying on defaults/pyproject).
src/vowel/init.py	Updates public exports (bundle loaders, executors, CodeMode, costs) and triggers schema cache creation on import.
src/vowel/ai.py	Writes generated specs with schema header via schema materialization helper.
src/vowel/cli.py	Adds `vowel schema` and `vowel costs` modes, updates JSON export path, and improves some error handling.
src/vowel/context.py	Docstring update for prompt context module.
src/vowel/costs.py	Adds CodeMode cost tracking, pricing resolution, and atomic persistence.
src/vowel/eval_types.py	Adds fixture scope aliases, serializer registry models, and restricted eval helpers (with fallbacks).
src/vowel/evals.py	Makes assertions serializer-aware; adds restricted assertion/type evaluation; enhances env-ref handling for LLM judge config.
src/vowel/mcp_server.py	Switches YAML validation to bundle loader and updates imports accordingly.
src/vowel/runner.py	Adds code sanitization before exec; introduces EvalsBundle source support and executor preferences wiring.
src/vowel/schema.py	Adds runtime-model-driven YAML JSON schema generation and content-hash cache + header materialization.
src/vowel/tdd.py	Adds executor-aware expected-value validation + shared failure-context builder usage; extends agent opts plumb-through.
src/vowel/validation.py	Switches to safe_dump and adds helpers for failure context, duration injection, expected-value validation, and error-case injection.
tests/test_yaml_loading.py	Migrates tests to bundle loaders and adds serializer-registry parsing coverage.
tests/test_tdd_eval_retries.py	Uses shared `build_failure_context` and updates BAD_YAML assertion scenario.
tests/test_session.py	Adds new session-focused tests (default + optional Monty).
tests/test_executor.py	Adds extensive executor backend/factory/parity coverage.
tests/test_serializer.py	Adds YAML-native serializer registry tests and serializer-aware assertion regression tests.
tests/test_schema.py	Adds schema generation + header materialization tests.
tests/test_run_evals.py	Adds short-name/module.function matching tests and executor preference wiring tests.
tests/test_evaluators.py	Adds regression ensuring restricted-assertion fallback preserves compatibility.
tests/test_cli.py	Adds regression test for `--export-json` output shape.
tests/test_import_function.py	Adds regression ensuring local module import doesn’t mutate `sys.path`.
tests/test_llm_judge.py	Docstring update.
tests/test_llm_judge_env_refs.py	Adds env-ref resolution tests for LLM judge rubric/model.
tests/test_llm_integration.py	Adjusts default model and skip condition to OPENROUTER-only; docstring update.
tests/test_fixtures.py	Adds session-scope fixture lifecycle tests and a loader heuristic regression test.
tests/test_generation.py	Removes old standalone test script.
tests/cassettes/test_generate_spec_string.json	Updates recorded YAML spec fixture.
tests/cassettes/test_generate_spec_simple.json	Updates recorded YAML spec fixture.
tests/cassettes/test_generate_palindrome.json	Updates recorded function generation fixture.
tests/cassettes/test_generate_factorial.json	Updates recorded function generation fixture.
tests/cassettes/test_generate_and_run.json	Updates recorded generate-and-run fixture (including coverage field).
tests/cassettes/llm_judge_custom_model.json	Updates recorded judge model field.
README.md	Updates quick-start links and documents name matching + serializer/assertion behavior + new CLI commands.
docs/README.md	Adds CodeMode doc entry.
docs/CLI.md	Documents new schema and costs commands.
docs/FIXTURES.md	Documents fixture scope aliases and normalization mapping.
docs/YAML_SPEC.md	Updates fixture scope docs to preferred names + aliases.
docs/SERIALIZERS.md	Documents YAML-native serializer registry and serializer-aware assertion context.
docs/CODEMODE.md	Adds CodeMode overview and usage guide.
docs/FEEDBACK_GUIDED_EXPLORATION.md	Adds detailed design doc for feedback-guided exploration rounds.
CHANGELOG.md	Adds branch changelog summary.
CLAUDE.md	Adds agent guidance section on critical thinking.
AGENTS.md	Adds agent guidance section on critical thinking.
costs.yml	Adds model pricing table for cost estimation.
examples/basic_usage/evals.yml	Adds case `id` fields for clarity.
examples/fluent_api/evals.yml	Adds case `id` fields for clarity.
examples/fixtures/evals.yml	Adds case `id` fields for clarity.
examples/evaluators/evals.yml	Adds case `id` fields for clarity.
examples/evals/builtins.yml	Adds case `id` fields for clarity.
examples/evals/math.yml	Adds case `id` fields for clarity.
examples/evals/strings.yml	Adds case `id` fields for clarity.
examples/evals/validation.yml	Adds case `id` fields for clarity.
examples/db_fixtures/db.py	Switches to `enable_monitoring(...)` instead of commented logfire configure.
examples/serializers/init.py	Adds package marker for serializer example.
examples/serializers/app.py	Adds example functions under test for serializer + fixture flow.
examples/serializers/defn.py	Adds Query model + serial_fn example for YAML-native registry.
examples/serializers/fixtures.py	Adds sqlite fixture class used by serializer example.
examples/serializers/db_query_evals.yml	Adds runnable YAML-native serializer registry + fixtures example.
db_fixture.yml	Removes old top-level fixture example file.
quality-judge/runner.py	Adds runner script for quality-judge scenario.
quality-judge/evals.py	Adds CodeMode-based generation driver for quality-judge scenario (includes mock path).
quality-judge/largestPathValue_evals.yml	Adds large eval spec fixture for quality-judge scenario.
quality-judge/largest_color_value_judge.yml	Adds judge rubric eval spec for quality-judge scenario.
.github/workflows/tests.yml	Drops Python 3.10 from matrix and installs `[all]` extras in CI.
.gitmodules	Adds `codemode-benchmark` submodule entry.
.gitignore	Ignores benchmark artifacts and an additional links file.
.env.sample	Adds CodeMode model env vars and benchmark model env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_yaml_loading.py

    def test_invalid_yaml_raises_error(self):
        """Test that invalid YAML raises an error."""
        with pytest.raises(Exception):  # noqa: B017
-            load_evals_from_yaml_string("invalid: [unclosed")
+            load_bundle_from_yaml_string("invalid: [unclosed")
+
+        def test_yaml_with_top_level_serializers(self):
+            """Test loading top-level serializer registry and eval references."""


fswair · 2026-03-19T17:00:14Z

src/vowel/evals.py


 MONTY_AVAILABLE = bool(importlib.util.find_spec("pydantic-monty"))



@copilot apply changes based on this feedback

fswair · 2026-03-19T17:01:23Z

src/vowel/evals.py

+                error=str(exc),
+            )
            with suppress(Exception):
                if eval(self.condition, inputs, inputs):
                    return EvaluationReason(


@copilot apply changes based on [this feedback]: restrict builtins don't broke eval fallback.(#10 (comment))

fswair · 2026-03-19T17:03:36Z

src/vowel/eval_types.py

    def evaluate(self, output: Any) -> bool:
-        return isinstance(output, eval(self.type))
+        try:
+            expected = _eval_type_restricted(self.type)
+        except Exception:
+            expected = eval(self.type)
+        return isinstance(output, expected)


@copilot apply changes based on [this feedback]: I am not sure we no longer use IsInstanceCase.evaluate, you can remove it, if it doesn't have use references. (#10 (comment))

fswair · 2026-03-19T17:04:16Z

src/vowel/eval_types.py

    def evaluate(self, input: Any, output: Any) -> bool:
        env = {"input": input, "output": output}
-        return eval(self.assertion, env, env)
+        try:
+            return _eval_assertion_restricted(self.assertion, env)
+        except Exception:
+            return bool(eval(self.assertion, env, env))


@copilot apply changes based on [this feedback]: I am not sure we no longer use AssertionCase.evaluate, you can remove it, if it doesn't have use references.(#10 (comment))

quality-judge/largest_color_value_judge.yml

quality-judge/evals.py

+async def generate_spec_mock(fn: Function):
+    return pathlib.Path(
+        "/Users/mert/Desktop/LIP/evalspec/quality-judge/largestPathValue_evals.yml"
+    ).read_text()


fswair · 2026-03-19T17:04:59Z

src/vowel/cli.py

 dotenv.load_dotenv()
 console = Console()
+COSTS_FILE = Path.home() / ".vowel" / "codemode" / "generation_costs.json"
+


@copilot apply changes based on this feedback

.env.sample

src/vowel/validation.py

+    str | None
+):  # TODO: intead of building call code, consider passing arguments through executor inputs


Copilot · 2026-03-19T17:00:23Z

@fswair I've opened a new pull request, #11, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-03-19T17:01:32Z

@fswair I've opened a new pull request, #12, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-03-19T17:03:44Z

@fswair I've opened a new pull request, #13, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-03-19T17:04:24Z

@fswair I've opened a new pull request, #14, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2026-03-19T17:05:07Z

@fswair I've opened a new pull request, #15, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fswair added 7 commits March 15, 2026 21:27

demo

f5f039c

demo:save

83b84c2

final_commit

60b602f

commit before native serializers

596754e

bump version 0.3.5 -> 0.4.0

bd73afd

Update Python version matrix in tests.yml

49491a5

Removed Python 3.10 from the test matrix.

Change pip install target from 'dev' to 'all'

5d4c6d2

fswair requested a review from Copilot March 19, 2026 16:49

Copilot started reviewing on behalf of fswair March 19, 2026 16:50 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Copilot AI mentioned this pull request Mar 19, 2026

fix: use correct module name in find_spec for MONTY_AVAILABLE #11

Closed

Copilot AI mentioned this pull request Mar 19, 2026

Restrict builtins in AssertionEvaluator eval fallback path #12

Closed

This was referenced Mar 19, 2026

[WIP] [WIP] Addressing feedback on CodeMode driven generation #13

Closed

[WIP] [WIP] Address feedback on CodeMode Driven Generation - Remove unused AssertionCase.evaluate #14

Closed

Copilot AI mentioned this pull request Mar 19, 2026

[WIP] [WIP] Address feedback on CodeMode Driven Generation - Executors improvements #15

Closed

Potential fix for pull request finding

5677432

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fswair merged commit 40c0e9e into main Mar 19, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeMode Driven Generation - Executors - Pydantic Monty - Native Serializers - Schema Generator#10

CodeMode Driven Generation - Executors - Pydantic Monty - Native Serializers - Schema Generator#10
fswair merged 8 commits intomainfrom
codemode_driven_generation

fswair commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

fswair Mar 19, 2026

Uh oh!

fswair Mar 19, 2026

Uh oh!

fswair Mar 19, 2026

Uh oh!

fswair Mar 19, 2026

Uh oh!

Uh oh!

fswair Mar 19, 2026

Uh oh!

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Copilot AI commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		MONTY_AVAILABLE = bool(importlib.util.find_spec("pydantic-monty"))

		str \| None
		): # TODO: intead of building call code, consider passing arguments through executor inputs

Conversation

fswair commented Mar 19, 2026

CHANGELOG