sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql

## Goal

Port `sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql` from Frontier-CS into an Agentics challenge as the first `separated_evaluator` migration proof-of-concept.

## Provenance

- Frontier-CS path: `research/problems/grammar_fuzzing/seed/sql`
- Frontier-CS source link: https://github.com/agentic-science/Frontier-CS/tree/main/research/problems/grammar_fuzzing/seed/sql
- Original category/tag: `pl`
- Original challenge title: SQL Parser Test Case Generation
- Original evaluator shape: Python evaluator imports `Solution`, calls `Solution.solve(resources_path=...)`, expects a `list[str]` of generated SQL statements, runs those statements through an evaluator-owned SQL parser, measures line and branch coverage, and computes a weighted coverage plus efficiency score.
- Original runtime assumptions: `python:3.11-slim`, no DinD, 300 second timeout, evaluator-owned resources under `resources/` including grammar and parser implementation.

## Proposed Agentics Challenge

- Challenge name: `sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql`
- Execution mode: `separated_evaluator`
- Target: CPU target using an approved Agentics CPU image. Public validation should use a small Python parser/grammar fixture that is safe to commit.
- Public validation: one or more deterministic public run cases where the submitted zip-project writes generated SQL statements to a declared output file, for example `statements.json` or JSONL. Keep the fixture small enough for quick validation.
- Official scoring: use private official parser resources, hidden grammar variants, private seeds, or private coverage fixtures as needed. Keep official scoring data and evaluator-only reference material out of the public repo.
- Private assets: likely a `private_benchmark_data` or `private_seeds` overlay containing hidden grammar/parser resources or official scoring configuration, depending on the final evaluator design.
- Solution interface: submitted solution reads public challenge resources or run inputs and writes generated SQL statements to the declared output path. Do not require Frontier-CS-style direct Python class import in the participant run container unless the final challenge statement intentionally chooses that as the zip-project contract.
- Evaluator responsibilities: read `/solution-runs/{run_name}/agentics-run.json`, parse only declared output files from sanitized run trees, validate output shape and size, run coverage measurement against evaluator-owned parser resources, and write Agentics `result.json`.
- Ranking metric: coverage-oriented aggregate score, with tie-breakers such as higher weighted coverage and fewer valid statements if useful.
- Public metrics: `score`, `line_coverage`, `branch_coverage`, `total_statements`, `successful_parses`, `failed_parses`, and `runs_successfully` or equivalent.

## Migration Steps

1. Create the challenge bundle skeleton under `challenges/sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql/`.
2. Write `agentics.challenge.json`, `v1/spec.json`, `README.md`, and `v1/statement.md` using the `separated_evaluator` topology.
3. Adapt the participant-facing contract from `Solution.solve(resources_path) -> list[str]` to a zip-project run contract that writes SQL statements to a declared output file.
4. Port the evaluator logic so it reads Agentics run artifacts from `/solution-runs`, treats participant files as hostile input, measures parser coverage, and emits Agentics `result.json`.
5. Add a tiny deterministic public validation fixture and baseline/simple fuzzer example for local smoke testing.
6. Define the official private overlay shape, including any hidden grammar, parser, seed, or scoring config paths referenced by `private_assets[].required_paths`.
7. Run `python3 scripts/validate_challenges.py` and a public validation smoke test before opening the migration PR.
8. Document any review notes around coverage stability, output limits, and private-data handling.

## Risks

- Trust boundary: normal separated evaluator boundary. Participant output is still hostile and must be parsed defensively from sanitized run trees only.
- Runtime/image dependency: coverage.py and parser dependencies must be pinned or bundled so validation is reproducible on the approved Agentics CPU image.
- Scoring stability: coverage scores can shift if parser files or coverage.py behavior change. Pin dependencies and keep public/private fixtures stable.
- Private-data leakage risk: do not commit official parser variants, private grammar seeds, hidden scoring config, or evaluator-only reference material.
- Output abuse risk: cap statement count, statement byte length, and total output size before coverage measurement.

## Acceptance Criteria

- Public validation succeeds for a known-good baseline/simple fuzzer.
- Invalid JSON, non-string statements, empty outputs, excessive output, and parser crashes fail with clear evaluator feedback.
- Official benchmark path uses private assets without exposing private parser internals beyond the intended public fixture.
- Manifest passes repository validation.
- The issue or migration PR records the Frontier-CS source path and any intentional scoring changes from the original evaluator.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql #3

Goal

Provenance

Proposed Agentics Challenge

Migration Steps

Risks

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql #3

Description

Goal

Provenance

Proposed Agentics Challenge

Migration Steps

Risks

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions