Skip to content

CodeMode Driven Generation - Executors - Pydantic Monty - Native Serializers - Schema Generator#10

Merged
fswair merged 8 commits intomainfrom
codemode_driven_generation
Mar 19, 2026
Merged

CodeMode Driven Generation - Executors - Pydantic Monty - Native Serializers - Schema Generator#10
fswair merged 8 commits intomainfrom
codemode_driven_generation

Conversation

@fswair
Copy link
Collaborator

@fswair fswair commented Mar 19, 2026

CHANGELOG

codemode_driven_generation

This document summarizes the main features added or improved on this branch.

1) Executor and ExecutionSession protocols

  • The code execution interface was formalized using Protocols.
  • The Executor async/sync API was standardized:
    • execute(...)
    • execute_sync(...)
    • create_session(...)
  • ExecutionSession now compiles/executes setup code once and supports multi-snippet feed execution.
  • This reduces repeated parse/compile overhead while exploring the same function.
  • The run_sync helper was hardened for running-loop environments via nest-asyncio.

2) MontyExecutor, DefaultExecutor, MontySession, FallbackSession structures

  • MontyExecutor was added:
    • sandboxed execution via pydantic-monty,
    • ResourceLimits support (timeout/memory),
    • stdout capture and normalized error typing/messages,
  • DefaultExecutor was added/improved:
    • pure Python exec-based fallback execution,
    • last-expression capture (result) and stdout capture.
  • MontyReplSession (MontySession role) was added:
    • one-time setup load, reusable feed-run model.
  • FallbackSession was added:
    • Session-level fallback: if Monty session initialization fails, switch entirely to DefaultSession.
    • Snippet-level fallback: if Monty returns ModuleNotFoundError for a snippet, rerun that snippet via fallback executor.
  • Executor/fallback wiring was simplified through resolve_executors.

3) Main implementation: CodeModeGenerator

  • Two-phase exploration-guided generation flow:
    • Phase 1: behavior exploration (exploration snippets + error snippets)
    • Phase 2: spec generation from verified observations
  • Lazy Agent architecture:
    • explorer_agent (ExplorationPlan)
    • spec_agent (EvalsSource or EvalsBundle)
  • Prompt layers were clearly separated:
    • exploration prompt: coverage, diversity, duplicate prevention
    • spec prompt: expected values from verified outputs only
  • A refinement loop was added:
    • generate -> run -> failure_context -> regenerate
  • Optional duration injection and a final summary run were added at the end.

4) Runtime hierarchy and utility usage

CodeMode hierarchy:

  1. explore()
  2. generate_spec()
  3. validate_and_fix_spec()
  4. validate_expected_values()
  5. inject_missing_error_cases()
  6. inject_durations() (optional)
  7. validation/refinement with RunEvals

Utilities used:

  • build_call_code
  • build_failure_context
  • validate_and_fix_spec
  • validate_expected_values
  • inject_missing_error_cases
  • inject_durations

5) Cost Manager

  • Generation/run cost tracking was added for CodeMode.
  • Features:
    • generation_id and run_id lifecycle management,
    • step-level usage/cost recording,
    • model price resolution (genai-prices or costs.yml),
    • atomic/locked JSON persistence,
    • generation-level and run-level totals,
    • status tracking: running/completed/failed.
  • The CLI costs command now supports list/by-generation/by-run views.

6) Serializer syntax and YAML-native serializer registry

  • Top-level serializers registry support was added at EvalsFile level.
  • Per-eval serializer references are now supported via serializer:.
  • SerializerSpec was clarified with one-of behavior:
    • schema (string or dict)
    • serializer (callable import path)
    • not both at the same time.
  • Runtime resolver additions:
    • import-path resolution,
    • cached imports (_import_path_cached),
    • per-eval resolution (_resolve_yaml_serializer_entry).
  • Precedence between programmatic serializer maps and YAML serializer registry was defined.

7) Spec model / Exploration model separation

  • Model separation in CodeModeGenerator constructor was formalized:
    • spec_model
    • exploration_model
  • use_model_spec output mode was clarified:
    • use_model_spec=True: structured output mode (schema/model output via EvalsBundle)
    • use_model_spec=False: YAML string output mode (via EvalsSource.yaml_spec)
  • HIGHLY RECOMMENDED TO KEEP use_model_spec=False.
  • Model resolution order and env fallback logic were added.
  • Cost tracking now supports separate model usage across separate steps.

8) Adding executor/fallback executor to utilities

  • Utility flows were updated to accept executor and fallback executor parameters.
  • Monty -> Default fallback behavior was generalized in execution-aware paths.
  • Executor behavior was centralized across run_evals and validation stages.

9) YAML schema generator

  • Runtime-model-driven schema generation was improved:
    • supports top-level fixtures + serializers,
    • preserves function-level EvalsMapValue behavior.
  • Schema cache strategy was updated:
    • content-hash-based filename (reduces stale editor cache issues).
  • File header updates are handled safely via materialize_yaml_with_schema_header.

10) CLI komutları: schema, costs

  • vowel schema :
    • update schema header after YAML + pydantic validation
  • vowel schema --create [path]:
    • direct schema JSON generation
  • vowel costs:
    • --list
    • --by-generation
    • --by-run
    • --generation
    • --run

11) module.function -> function alias support

  • Alias support was added for programmatic mapping resolution:
    • function map
    • serializer schema map
    • serializer function map
  • Behavior:
    • exact match first,
    • short-name fallback,
    • explicit error for ambiguous reverse short-name mapping.

12) Feedback-guided exploration

  • A targeted Round-2 exploration flow was added:
    • build cluster summaries from Round-1 results,
    • generate snippets focused on uncovered behavior classes.
  • Duplicate/semantic repetition minimization was reinforced at prompt level.
  • Distinct failure-mode coverage was improved for error snippets.
  • Additional rounds now measure value via new-behavior counting.

13) Assertion + serializer integration

  • AssertionEvaluator input context is now serializer-aware.
  • Assertions now see serialized input for schema, serial_fn, and nested/dict schema modes.
  • This behavior is covered by regression tests.

14) LLM Judge env-ref improvements

  • create_llm_judge now supports $ENV_VAR resolution for rubric/model fields.
  • Missing env refs now produce clearer errors.

15) Examples, documentation, and test coverage

  • A runnable native serializer + fixture example was added.
  • README and serializer docs were updated with serializer/assertion context notes.
  • Meaningful id fields were added to eval cases under examples.
  • New/updated tests include:
    • test_schema
    • test_llm_judge_env_refs
    • serializer assertion regressions
    • YAML/native serializer parsing tests

16) Fixture scope alias support

  • Fixture scopes now support clearer canonical names:
    • case
    • eval
    • file
  • Backward-compatible aliases are still accepted:
    • function (alias of case)
    • module (alias of eval)
    • session (alias of file)
  • At parse time, canonical names are normalized to legacy internal runtime values:
    • case -> function
    • eval -> module
    • file -> session
  • This keeps existing runtime lifecycle behavior unchanged while allowing more descriptive scope names in YAML.

Note: Old names would be deprecated after v1.0.0

Note

This changelog is based on features observed and validated in code on this branch, without using git history.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a major “CodeMode-driven” generation + execution architecture, adds YAML-native serializer registry support, and expands CLI/schema/cost tooling while updating tests, examples, and docs for the new bundle-based loading model.

Changes:

  • Adds executor/session protocols with Monty + default execution backends, including fallback behavior and parity/session tests.
  • Introduces YAML-native serializer registry (serializers:) and serializer-aware assertion contexts, plus schema generation/cache + CLI commands.
  • Adds CodeMode cost tracking (JSON persistence + CLI inspection) and expands documentation/examples to cover new features.

Reviewed changes

Copilot reviewed 76 out of 78 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
VERSION Bumps version to 0.4.0.
pyproject.toml Updates package version/extras and tooling excludes; adds uv workspace config.
pyrightconfig.json Updates type-checker excludes and relaxes one diagnostic.
pytest.ini Removes pytest discovery configuration (now relying on defaults/pyproject).
src/vowel/init.py Updates public exports (bundle loaders, executors, CodeMode, costs) and triggers schema cache creation on import.
src/vowel/ai.py Writes generated specs with schema header via schema materialization helper.
src/vowel/cli.py Adds vowel schema and vowel costs modes, updates JSON export path, and improves some error handling.
src/vowel/context.py Docstring update for prompt context module.
src/vowel/costs.py Adds CodeMode cost tracking, pricing resolution, and atomic persistence.
src/vowel/eval_types.py Adds fixture scope aliases, serializer registry models, and restricted eval helpers (with fallbacks).
src/vowel/evals.py Makes assertions serializer-aware; adds restricted assertion/type evaluation; enhances env-ref handling for LLM judge config.
src/vowel/mcp_server.py Switches YAML validation to bundle loader and updates imports accordingly.
src/vowel/runner.py Adds code sanitization before exec; introduces EvalsBundle source support and executor preferences wiring.
src/vowel/schema.py Adds runtime-model-driven YAML JSON schema generation and content-hash cache + header materialization.
src/vowel/tdd.py Adds executor-aware expected-value validation + shared failure-context builder usage; extends agent opts plumb-through.
src/vowel/validation.py Switches to safe_dump and adds helpers for failure context, duration injection, expected-value validation, and error-case injection.
tests/test_yaml_loading.py Migrates tests to bundle loaders and adds serializer-registry parsing coverage.
tests/test_tdd_eval_retries.py Uses shared build_failure_context and updates BAD_YAML assertion scenario.
tests/test_session.py Adds new session-focused tests (default + optional Monty).
tests/test_executor.py Adds extensive executor backend/factory/parity coverage.
tests/test_serializer.py Adds YAML-native serializer registry tests and serializer-aware assertion regression tests.
tests/test_schema.py Adds schema generation + header materialization tests.
tests/test_run_evals.py Adds short-name/module.function matching tests and executor preference wiring tests.
tests/test_evaluators.py Adds regression ensuring restricted-assertion fallback preserves compatibility.
tests/test_cli.py Adds regression test for --export-json output shape.
tests/test_import_function.py Adds regression ensuring local module import doesn’t mutate sys.path.
tests/test_llm_judge.py Docstring update.
tests/test_llm_judge_env_refs.py Adds env-ref resolution tests for LLM judge rubric/model.
tests/test_llm_integration.py Adjusts default model and skip condition to OPENROUTER-only; docstring update.
tests/test_fixtures.py Adds session-scope fixture lifecycle tests and a loader heuristic regression test.
tests/test_generation.py Removes old standalone test script.
tests/cassettes/test_generate_spec_string.json Updates recorded YAML spec fixture.
tests/cassettes/test_generate_spec_simple.json Updates recorded YAML spec fixture.
tests/cassettes/test_generate_palindrome.json Updates recorded function generation fixture.
tests/cassettes/test_generate_factorial.json Updates recorded function generation fixture.
tests/cassettes/test_generate_and_run.json Updates recorded generate-and-run fixture (including coverage field).
tests/cassettes/llm_judge_custom_model.json Updates recorded judge model field.
README.md Updates quick-start links and documents name matching + serializer/assertion behavior + new CLI commands.
docs/README.md Adds CodeMode doc entry.
docs/CLI.md Documents new schema and costs commands.
docs/FIXTURES.md Documents fixture scope aliases and normalization mapping.
docs/YAML_SPEC.md Updates fixture scope docs to preferred names + aliases.
docs/SERIALIZERS.md Documents YAML-native serializer registry and serializer-aware assertion context.
docs/CODEMODE.md Adds CodeMode overview and usage guide.
docs/FEEDBACK_GUIDED_EXPLORATION.md Adds detailed design doc for feedback-guided exploration rounds.
CHANGELOG.md Adds branch changelog summary.
CLAUDE.md Adds agent guidance section on critical thinking.
AGENTS.md Adds agent guidance section on critical thinking.
costs.yml Adds model pricing table for cost estimation.
examples/basic_usage/evals.yml Adds case id fields for clarity.
examples/fluent_api/evals.yml Adds case id fields for clarity.
examples/fixtures/evals.yml Adds case id fields for clarity.
examples/evaluators/evals.yml Adds case id fields for clarity.
examples/evals/builtins.yml Adds case id fields for clarity.
examples/evals/math.yml Adds case id fields for clarity.
examples/evals/strings.yml Adds case id fields for clarity.
examples/evals/validation.yml Adds case id fields for clarity.
examples/db_fixtures/db.py Switches to enable_monitoring(...) instead of commented logfire configure.
examples/serializers/init.py Adds package marker for serializer example.
examples/serializers/app.py Adds example functions under test for serializer + fixture flow.
examples/serializers/defn.py Adds Query model + serial_fn example for YAML-native registry.
examples/serializers/fixtures.py Adds sqlite fixture class used by serializer example.
examples/serializers/db_query_evals.yml Adds runnable YAML-native serializer registry + fixtures example.
db_fixture.yml Removes old top-level fixture example file.
quality-judge/runner.py Adds runner script for quality-judge scenario.
quality-judge/evals.py Adds CodeMode-based generation driver for quality-judge scenario (includes mock path).
quality-judge/largestPathValue_evals.yml Adds large eval spec fixture for quality-judge scenario.
quality-judge/largest_color_value_judge.yml Adds judge rubric eval spec for quality-judge scenario.
.github/workflows/tests.yml Drops Python 3.10 from matrix and installs [all] extras in CI.
.gitmodules Adds codemode-benchmark submodule entry.
.gitignore Ignores benchmark artifacts and an additional links file.
.env.sample Adds CodeMode model env vars and benchmark model env vars.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 54 to +60
def test_invalid_yaml_raises_error(self):
"""Test that invalid YAML raises an error."""
with pytest.raises(Exception): # noqa: B017
load_evals_from_yaml_string("invalid: [unclosed")
load_bundle_from_yaml_string("invalid: [unclosed")

def test_yaml_with_top_level_serializers(self):
"""Test loading top-level serializer registry and eval references."""
Comment on lines 21 to 23

MONTY_AVAILABLE = bool(importlib.util.find_spec("pydantic-monty"))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +271 to 275
error=str(exc),
)
with suppress(Exception):
if eval(self.condition, inputs, inputs):
return EvaluationReason(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on [this feedback]: restrict builtins don't broke eval fallback.(#10 (comment))

Comment on lines 303 to +308
def evaluate(self, output: Any) -> bool:
return isinstance(output, eval(self.type))
try:
expected = _eval_type_restricted(self.type)
except Exception:
expected = eval(self.type)
return isinstance(output, expected)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on [this feedback]: I am not sure we no longer use IsInstanceCase.evaluate, you can remove it, if it doesn't have use references. (#10 (comment))

Comment on lines 361 to +366
def evaluate(self, input: Any, output: Any) -> bool:
env = {"input": input, "output": output}
return eval(self.assertion, env, env)
try:
return _eval_assertion_restricted(self.assertion, env)
except Exception:
return bool(eval(self.assertion, env, env))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on [this feedback]: I am not sure we no longer use AssertionCase.evaluate, you can remove it, if it doesn't have use references.(#10 (comment))

Comment on lines +30 to +33
async def generate_spec_mock(fn: Function):
return pathlib.Path(
"/Users/mert/Desktop/LIP/evalspec/quality-judge/largestPathValue_evals.yml"
).read_text()
Comment on lines 28 to +31
dotenv.load_dotenv()
console = Console()
COSTS_FILE = Path.home() / ".vowel" / "codemode" / "generation_costs.json"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment on lines +420 to +421
str | None
): # TODO: intead of building call code, consider passing arguments through executor inputs
Copy link

Copilot AI commented Mar 19, 2026

@fswair I've opened a new pull request, #11, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Mar 19, 2026

@fswair I've opened a new pull request, #12, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Mar 19, 2026

@fswair I've opened a new pull request, #13, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Mar 19, 2026

@fswair I've opened a new pull request, #14, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link

Copilot AI commented Mar 19, 2026

@fswair I've opened a new pull request, #15, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@fswair fswair merged commit 40c0e9e into main Mar 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants