Skip to content

sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql #3

@ifsheldon

Description

@ifsheldon

Goal

Port sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql from Frontier-CS into an Agentics challenge as the first separated_evaluator migration proof-of-concept.

Provenance

  • Frontier-CS path: research/problems/grammar_fuzzing/seed/sql
  • Frontier-CS source link: https://github.com/agentic-science/Frontier-CS/tree/main/research/problems/grammar_fuzzing/seed/sql
  • Original category/tag: pl
  • Original challenge title: SQL Parser Test Case Generation
  • Original evaluator shape: Python evaluator imports Solution, calls Solution.solve(resources_path=...), expects a list[str] of generated SQL statements, runs those statements through an evaluator-owned SQL parser, measures line and branch coverage, and computes a weighted coverage plus efficiency score.
  • Original runtime assumptions: python:3.11-slim, no DinD, 300 second timeout, evaluator-owned resources under resources/ including grammar and parser implementation.

Proposed Agentics Challenge

  • Challenge name: sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql
  • Execution mode: separated_evaluator
  • Target: CPU target using an approved Agentics CPU image. Public validation should use a small Python parser/grammar fixture that is safe to commit.
  • Public validation: one or more deterministic public run cases where the submitted zip-project writes generated SQL statements to a declared output file, for example statements.json or JSONL. Keep the fixture small enough for quick validation.
  • Official scoring: use private official parser resources, hidden grammar variants, private seeds, or private coverage fixtures as needed. Keep official scoring data and evaluator-only reference material out of the public repo.
  • Private assets: likely a private_benchmark_data or private_seeds overlay containing hidden grammar/parser resources or official scoring configuration, depending on the final evaluator design.
  • Solution interface: submitted solution reads public challenge resources or run inputs and writes generated SQL statements to the declared output path. Do not require Frontier-CS-style direct Python class import in the participant run container unless the final challenge statement intentionally chooses that as the zip-project contract.
  • Evaluator responsibilities: read /solution-runs/{run_name}/agentics-run.json, parse only declared output files from sanitized run trees, validate output shape and size, run coverage measurement against evaluator-owned parser resources, and write Agentics result.json.
  • Ranking metric: coverage-oriented aggregate score, with tie-breakers such as higher weighted coverage and fewer valid statements if useful.
  • Public metrics: score, line_coverage, branch_coverage, total_statements, successful_parses, failed_parses, and runs_successfully or equivalent.

Migration Steps

  1. Create the challenge bundle skeleton under challenges/sql-parser-coverage-frontier-cs-grammar-fuzzing-seed-sql/.
  2. Write agentics.challenge.json, v1/spec.json, README.md, and v1/statement.md using the separated_evaluator topology.
  3. Adapt the participant-facing contract from Solution.solve(resources_path) -> list[str] to a zip-project run contract that writes SQL statements to a declared output file.
  4. Port the evaluator logic so it reads Agentics run artifacts from /solution-runs, treats participant files as hostile input, measures parser coverage, and emits Agentics result.json.
  5. Add a tiny deterministic public validation fixture and baseline/simple fuzzer example for local smoke testing.
  6. Define the official private overlay shape, including any hidden grammar, parser, seed, or scoring config paths referenced by private_assets[].required_paths.
  7. Run python3 scripts/validate_challenges.py and a public validation smoke test before opening the migration PR.
  8. Document any review notes around coverage stability, output limits, and private-data handling.

Risks

  • Trust boundary: normal separated evaluator boundary. Participant output is still hostile and must be parsed defensively from sanitized run trees only.
  • Runtime/image dependency: coverage.py and parser dependencies must be pinned or bundled so validation is reproducible on the approved Agentics CPU image.
  • Scoring stability: coverage scores can shift if parser files or coverage.py behavior change. Pin dependencies and keep public/private fixtures stable.
  • Private-data leakage risk: do not commit official parser variants, private grammar seeds, hidden scoring config, or evaluator-only reference material.
  • Output abuse risk: cap statement count, statement byte length, and total output size before coverage measurement.

Acceptance Criteria

  • Public validation succeeds for a known-good baseline/simple fuzzer.
  • Invalid JSON, non-string statements, empty outputs, excessive output, and parser crashes fail with clear evaluator feedback.
  • Official benchmark path uses private assets without exposing private parser internals beyond the intended public fixture.
  • Manifest passes repository validation.
  • The issue or migration PR records the Frontier-CS source path and any intentional scoring changes from the original evaluator.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions