Conversation
Removed Python 3.10 from the test matrix.
There was a problem hiding this comment.
Pull request overview
This PR introduces a major “CodeMode-driven” generation + execution architecture, adds YAML-native serializer registry support, and expands CLI/schema/cost tooling while updating tests, examples, and docs for the new bundle-based loading model.
Changes:
- Adds executor/session protocols with Monty + default execution backends, including fallback behavior and parity/session tests.
- Introduces YAML-native serializer registry (
serializers:) and serializer-aware assertion contexts, plus schema generation/cache + CLI commands. - Adds CodeMode cost tracking (JSON persistence + CLI inspection) and expands documentation/examples to cover new features.
Reviewed changes
Copilot reviewed 76 out of 78 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| VERSION | Bumps version to 0.4.0. |
| pyproject.toml | Updates package version/extras and tooling excludes; adds uv workspace config. |
| pyrightconfig.json | Updates type-checker excludes and relaxes one diagnostic. |
| pytest.ini | Removes pytest discovery configuration (now relying on defaults/pyproject). |
| src/vowel/init.py | Updates public exports (bundle loaders, executors, CodeMode, costs) and triggers schema cache creation on import. |
| src/vowel/ai.py | Writes generated specs with schema header via schema materialization helper. |
| src/vowel/cli.py | Adds vowel schema and vowel costs modes, updates JSON export path, and improves some error handling. |
| src/vowel/context.py | Docstring update for prompt context module. |
| src/vowel/costs.py | Adds CodeMode cost tracking, pricing resolution, and atomic persistence. |
| src/vowel/eval_types.py | Adds fixture scope aliases, serializer registry models, and restricted eval helpers (with fallbacks). |
| src/vowel/evals.py | Makes assertions serializer-aware; adds restricted assertion/type evaluation; enhances env-ref handling for LLM judge config. |
| src/vowel/mcp_server.py | Switches YAML validation to bundle loader and updates imports accordingly. |
| src/vowel/runner.py | Adds code sanitization before exec; introduces EvalsBundle source support and executor preferences wiring. |
| src/vowel/schema.py | Adds runtime-model-driven YAML JSON schema generation and content-hash cache + header materialization. |
| src/vowel/tdd.py | Adds executor-aware expected-value validation + shared failure-context builder usage; extends agent opts plumb-through. |
| src/vowel/validation.py | Switches to safe_dump and adds helpers for failure context, duration injection, expected-value validation, and error-case injection. |
| tests/test_yaml_loading.py | Migrates tests to bundle loaders and adds serializer-registry parsing coverage. |
| tests/test_tdd_eval_retries.py | Uses shared build_failure_context and updates BAD_YAML assertion scenario. |
| tests/test_session.py | Adds new session-focused tests (default + optional Monty). |
| tests/test_executor.py | Adds extensive executor backend/factory/parity coverage. |
| tests/test_serializer.py | Adds YAML-native serializer registry tests and serializer-aware assertion regression tests. |
| tests/test_schema.py | Adds schema generation + header materialization tests. |
| tests/test_run_evals.py | Adds short-name/module.function matching tests and executor preference wiring tests. |
| tests/test_evaluators.py | Adds regression ensuring restricted-assertion fallback preserves compatibility. |
| tests/test_cli.py | Adds regression test for --export-json output shape. |
| tests/test_import_function.py | Adds regression ensuring local module import doesn’t mutate sys.path. |
| tests/test_llm_judge.py | Docstring update. |
| tests/test_llm_judge_env_refs.py | Adds env-ref resolution tests for LLM judge rubric/model. |
| tests/test_llm_integration.py | Adjusts default model and skip condition to OPENROUTER-only; docstring update. |
| tests/test_fixtures.py | Adds session-scope fixture lifecycle tests and a loader heuristic regression test. |
| tests/test_generation.py | Removes old standalone test script. |
| tests/cassettes/test_generate_spec_string.json | Updates recorded YAML spec fixture. |
| tests/cassettes/test_generate_spec_simple.json | Updates recorded YAML spec fixture. |
| tests/cassettes/test_generate_palindrome.json | Updates recorded function generation fixture. |
| tests/cassettes/test_generate_factorial.json | Updates recorded function generation fixture. |
| tests/cassettes/test_generate_and_run.json | Updates recorded generate-and-run fixture (including coverage field). |
| tests/cassettes/llm_judge_custom_model.json | Updates recorded judge model field. |
| README.md | Updates quick-start links and documents name matching + serializer/assertion behavior + new CLI commands. |
| docs/README.md | Adds CodeMode doc entry. |
| docs/CLI.md | Documents new schema and costs commands. |
| docs/FIXTURES.md | Documents fixture scope aliases and normalization mapping. |
| docs/YAML_SPEC.md | Updates fixture scope docs to preferred names + aliases. |
| docs/SERIALIZERS.md | Documents YAML-native serializer registry and serializer-aware assertion context. |
| docs/CODEMODE.md | Adds CodeMode overview and usage guide. |
| docs/FEEDBACK_GUIDED_EXPLORATION.md | Adds detailed design doc for feedback-guided exploration rounds. |
| CHANGELOG.md | Adds branch changelog summary. |
| CLAUDE.md | Adds agent guidance section on critical thinking. |
| AGENTS.md | Adds agent guidance section on critical thinking. |
| costs.yml | Adds model pricing table for cost estimation. |
| examples/basic_usage/evals.yml | Adds case id fields for clarity. |
| examples/fluent_api/evals.yml | Adds case id fields for clarity. |
| examples/fixtures/evals.yml | Adds case id fields for clarity. |
| examples/evaluators/evals.yml | Adds case id fields for clarity. |
| examples/evals/builtins.yml | Adds case id fields for clarity. |
| examples/evals/math.yml | Adds case id fields for clarity. |
| examples/evals/strings.yml | Adds case id fields for clarity. |
| examples/evals/validation.yml | Adds case id fields for clarity. |
| examples/db_fixtures/db.py | Switches to enable_monitoring(...) instead of commented logfire configure. |
| examples/serializers/init.py | Adds package marker for serializer example. |
| examples/serializers/app.py | Adds example functions under test for serializer + fixture flow. |
| examples/serializers/defn.py | Adds Query model + serial_fn example for YAML-native registry. |
| examples/serializers/fixtures.py | Adds sqlite fixture class used by serializer example. |
| examples/serializers/db_query_evals.yml | Adds runnable YAML-native serializer registry + fixtures example. |
| db_fixture.yml | Removes old top-level fixture example file. |
| quality-judge/runner.py | Adds runner script for quality-judge scenario. |
| quality-judge/evals.py | Adds CodeMode-based generation driver for quality-judge scenario (includes mock path). |
| quality-judge/largestPathValue_evals.yml | Adds large eval spec fixture for quality-judge scenario. |
| quality-judge/largest_color_value_judge.yml | Adds judge rubric eval spec for quality-judge scenario. |
| .github/workflows/tests.yml | Drops Python 3.10 from matrix and installs [all] extras in CI. |
| .gitmodules | Adds codemode-benchmark submodule entry. |
| .gitignore | Ignores benchmark artifacts and an additional links file. |
| .env.sample | Adds CodeMode model env vars and benchmark model env vars. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_invalid_yaml_raises_error(self): | ||
| """Test that invalid YAML raises an error.""" | ||
| with pytest.raises(Exception): # noqa: B017 | ||
| load_evals_from_yaml_string("invalid: [unclosed") | ||
| load_bundle_from_yaml_string("invalid: [unclosed") | ||
|
|
||
| def test_yaml_with_top_level_serializers(self): | ||
| """Test loading top-level serializer registry and eval references.""" |
|
|
||
| MONTY_AVAILABLE = bool(importlib.util.find_spec("pydantic-monty")) | ||
|
|
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| error=str(exc), | ||
| ) | ||
| with suppress(Exception): | ||
| if eval(self.condition, inputs, inputs): | ||
| return EvaluationReason( |
There was a problem hiding this comment.
@copilot apply changes based on [this feedback]: restrict builtins don't broke eval fallback.(#10 (comment))
| def evaluate(self, output: Any) -> bool: | ||
| return isinstance(output, eval(self.type)) | ||
| try: | ||
| expected = _eval_type_restricted(self.type) | ||
| except Exception: | ||
| expected = eval(self.type) | ||
| return isinstance(output, expected) |
There was a problem hiding this comment.
@copilot apply changes based on [this feedback]: I am not sure we no longer use IsInstanceCase.evaluate, you can remove it, if it doesn't have use references. (#10 (comment))
| def evaluate(self, input: Any, output: Any) -> bool: | ||
| env = {"input": input, "output": output} | ||
| return eval(self.assertion, env, env) | ||
| try: | ||
| return _eval_assertion_restricted(self.assertion, env) | ||
| except Exception: | ||
| return bool(eval(self.assertion, env, env)) |
There was a problem hiding this comment.
@copilot apply changes based on [this feedback]: I am not sure we no longer use AssertionCase.evaluate, you can remove it, if it doesn't have use references.(#10 (comment))
| async def generate_spec_mock(fn: Function): | ||
| return pathlib.Path( | ||
| "/Users/mert/Desktop/LIP/evalspec/quality-judge/largestPathValue_evals.yml" | ||
| ).read_text() |
| dotenv.load_dotenv() | ||
| console = Console() | ||
| COSTS_FILE = Path.home() / ".vowel" / "codemode" / "generation_costs.json" | ||
|
|
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| str | None | ||
| ): # TODO: intead of building call code, consider passing arguments through executor inputs |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
CHANGELOG
codemode_driven_generation
This document summarizes the main features added or improved on this branch.
1) Executor and ExecutionSession protocols
2) MontyExecutor, DefaultExecutor, MontySession, FallbackSession structures
3) Main implementation: CodeModeGenerator
4) Runtime hierarchy and utility usage
CodeMode hierarchy:
Utilities used:
5) Cost Manager
6) Serializer syntax and YAML-native serializer registry
7) Spec model / Exploration model separation
8) Adding executor/fallback executor to utilities
9) YAML schema generator
10) CLI komutları: schema, costs
11) module.function -> function alias support
12) Feedback-guided exploration
13) Assertion + serializer integration
14) LLM Judge env-ref improvements
15) Examples, documentation, and test coverage
16) Fixture scope alias support
Note: Old names would be deprecated after v1.0.0
Note
This changelog is based on features observed and validated in code on this branch, without using git history.