eval-toolkit

Here are 2 public repositories matching this topic...

kroq86 / honeybadger

formal VM benchmark and inspectable reasoning runtime for testing whether language models can follow machine-like execution semantics on synthetic tasks.

benchmark vm semantics mcp language-models copilot vm-benchmark mcp-server copilot-coding-agent dataset-factory eval-toolkit reasoning-runtime execution-semantics

Updated Mar 20, 2026
Python

krishdef7 / gemini-cli-eval-toolkit

Star

Three-stage pipeline for behavioral eval coverage inventory, gap analysis, and automated eval generation for Gemini CLI - GSoC 2026 proposal for Project #23331

hooks typescript ast coverage-analysis vitest google-gemini gemini-cli gsoc2026 behavioral-evals eval-toolkit

Updated Mar 24, 2026
TypeScript

Improve this page

Add a description, image, and links to the eval-toolkit topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the eval-toolkit topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly