Goal
Port sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql from Frontier-CS into an Agentics challenge as the first separated_evaluator migration proof-of-concept.
Provenance
- Frontier-CS path:
research/problems/grammar_fuzzing/seed/sql
- Frontier-CS source link: https://github.com/agentic-science/Frontier-CS/tree/main/research/problems/grammar_fuzzing/seed/sql
- Original category/tag:
pl
- Original challenge title: SQL Parser Test Case Generation
- Original evaluator shape: Python evaluator imports
Solution, calls Solution.solve(resources_path=...), expects a list[str] of generated SQL statements, runs those statements through an evaluator-owned SQL parser, measures line and branch coverage, and computes a weighted coverage plus efficiency score.
- Original runtime assumptions:
python:3.11-slim, no DinD, 300 second timeout, evaluator-owned resources under resources/ including grammar and parser implementation.
Proposed Agentics Challenge
- Challenge name:
sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql
- Execution mode:
separated_evaluator
- Target: CPU target using an approved Agentics CPU image. Public validation should use a small Python parser/grammar fixture that is safe to commit.
- Public validation: one or more deterministic public run cases where the submitted zip-project writes generated SQL statements to a declared output file, for example
statements.json or JSONL. Keep the fixture small enough for quick validation.
- Official scoring: use private official parser resources, hidden grammar variants, private seeds, or private coverage fixtures as needed. Keep official scoring data and evaluator-only reference material out of the public repo.
- Private assets: likely a
private_benchmark_data or private_seeds overlay containing hidden grammar/parser resources or official scoring configuration, depending on the final evaluator design.
- Solution interface: submitted solution reads public challenge resources or run inputs and writes generated SQL statements to the declared output path. Do not require Frontier-CS-style direct Python class import in the participant run container unless the final challenge statement intentionally chooses that as the zip-project contract.
- Evaluator responsibilities: read
/solution-runs/{run_name}/agentics-run.json, parse only declared output files from sanitized run trees, validate output shape and size, run coverage measurement against evaluator-owned parser resources, and write Agentics result.json.
- Ranking metric: coverage-oriented aggregate score, with tie-breakers such as higher weighted coverage and fewer valid statements if useful.
- Public metrics:
score, line_coverage, branch_coverage, total_statements, successful_parses, failed_parses, and runs_successfully or equivalent.
Migration Steps
- Create the challenge bundle skeleton under
challenges/sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql/.
- Write
agentics.challenge.json, v1/spec.json, README.md, and v1/statement.md using the separated_evaluator topology.
- Adapt the participant-facing contract from
Solution.solve(resources_path) -> list[str] to a zip-project run contract that writes SQL statements to a declared output file.
- Port the evaluator logic so it reads Agentics run artifacts from
/solution-runs, treats participant files as hostile input, measures parser coverage, and emits Agentics result.json.
- Add a tiny deterministic public validation fixture and baseline/simple fuzzer example for local smoke testing.
- Define the official private overlay shape, including any hidden grammar, parser, seed, or scoring config paths referenced by
private_assets[].required_paths.
- Run
python3 scripts/validate_challenges.py and a public validation smoke test before opening the migration PR.
- Document any review notes around coverage stability, output limits, and private-data handling.
Risks
- Trust boundary: normal separated evaluator boundary. Participant output is still hostile and must be parsed defensively from sanitized run trees only.
- Runtime/image dependency: coverage.py and parser dependencies must be pinned or bundled so validation is reproducible on the approved Agentics CPU image.
- Scoring stability: coverage scores can shift if parser files or coverage.py behavior change. Pin dependencies and keep public/private fixtures stable.
- Private-data leakage risk: do not commit official parser variants, private grammar seeds, hidden scoring config, or evaluator-only reference material.
- Output abuse risk: cap statement count, statement byte length, and total output size before coverage measurement.
Acceptance Criteria
- Public validation succeeds for a known-good baseline/simple fuzzer.
- Invalid JSON, non-string statements, empty outputs, excessive output, and parser crashes fail with clear evaluator feedback.
- Official benchmark path uses private assets without exposing private parser internals beyond the intended public fixture.
- Manifest passes repository validation.
- The issue or migration PR records the Frontier-CS source path and any intentional scoring changes from the original evaluator.
Goal
Port
sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sqlfrom Frontier-CS into an Agentics challenge as the firstseparated_evaluatormigration proof-of-concept.Provenance
research/problems/grammar_fuzzing/seed/sqlplSolution, callsSolution.solve(resources_path=...), expects alist[str]of generated SQL statements, runs those statements through an evaluator-owned SQL parser, measures line and branch coverage, and computes a weighted coverage plus efficiency score.python:3.11-slim, no DinD, 300 second timeout, evaluator-owned resources underresources/including grammar and parser implementation.Proposed Agentics Challenge
sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sqlseparated_evaluatorstatements.jsonor JSONL. Keep the fixture small enough for quick validation.private_benchmark_dataorprivate_seedsoverlay containing hidden grammar/parser resources or official scoring configuration, depending on the final evaluator design./solution-runs/{run_name}/agentics-run.json, parse only declared output files from sanitized run trees, validate output shape and size, run coverage measurement against evaluator-owned parser resources, and write Agenticsresult.json.score,line_coverage,branch_coverage,total_statements,successful_parses,failed_parses, andruns_successfullyor equivalent.Migration Steps
challenges/sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql/.agentics.challenge.json,v1/spec.json,README.md, andv1/statement.mdusing theseparated_evaluatortopology.Solution.solve(resources_path) -> list[str]to a zip-project run contract that writes SQL statements to a declared output file./solution-runs, treats participant files as hostile input, measures parser coverage, and emits Agenticsresult.json.private_assets[].required_paths.python3 scripts/validate_challenges.pyand a public validation smoke test before opening the migration PR.Risks
Acceptance Criteria