Skill Forge is a quality gate for Agent Skills.
It reviews OpenAI, Claude, and generic Agent Skill packages before they get installed, published, or shipped. It combines a deterministic package inspector with a qualitative review workflow so you can catch the boring structural failures, the dangerous edge cases, and the subtle instruction problems that only show up once an agent starts interpreting the skill in the wild.
In plain English: Skill Forge helps you find out whether a skill is actually ready — or just looks ready because the folder has a SKILL.md in it.
skill-forge/
├── SKILL.md
├── README.md
├── LICENSE
├── .gitignore
├── .github/
│ └── workflows/
│ └── self-tests.yml
├── agents/
│ └── openai.yaml
├── scripts/
│ ├── inspect_skill_package.py
│ └── run_self_tests.py
└── references/
├── audit-checklist.md
├── evaluation-rubric.md
├── example-report.md
├── inspector-output-schema.md
├── platform-compatibility.md
├── pressure-test-suite.md
├── release-gate-checklist.md
├── report-template.md
└── severity-framework.md
Skill Forge looks at two different failure modes:
- Package integrity — the things a script can inspect reliably.
- Agent behavior quality — the things that require judgment, pressure testing, and a working understanding of how skills fail in real use.
It can help evaluate:
- Uploaded Skill ZIPs, folders, or
SKILL.mddrafts. - Package structure and expected entrypoints.
- Cross-platform compatibility risks across OpenAI, Claude, and generic agent environments.
- Unsafe archive paths, symlinks, oversized files, suspected secrets, missing resources, and leftover template content.
- Trigger quality: when the skill should activate, when it should stay out of the way, and where ambiguity may cause bad routing.
- Instruction clarity, contradiction risk, and unnecessary context bloat.
- Progressive loading: whether the core skill stays lean while detailed rubrics and references live in linked files.
- Safety posture and risky workflow assumptions.
- Pressure-test results and release readiness.
The goal is not to produce a pretty audit for a dashboard. The goal is to stop weak, unsafe, confusing, or overbuilt skills from getting shipped with a confident little smile on their face.
The inspector is dependency-free and uses only the Python standard library.
python -S scripts/inspect_skill_package.py /path/to/skill-or-skill.zip --jsonFor CI-style checks, use strict mode. Strict mode exits with code 2 when any error-severity finding is present.
python -S scripts/inspect_skill_package.py /path/to/skill-or-skill.zip --json --strictHuman-readable markdown output is available by omitting --json:
python -S scripts/inspect_skill_package.py /path/to/skill-or-skill.zipThe markdown output ends with a status and finding-count footer. JSON output includes a top-level summary object for release-gate and CI integrations.
Use the inspector process exit code as the canonical machine signal for CI and release gates.
In strict mode:
- Exit code
0means the inspected package passed strict checks. - Exit code
2means one or more error-severity findings were present.
The top-level JSON summary object is retained for compatibility with existing automation that reads fields such as:
summary.statussummary.strict_passsummary.error_countsummary.finding_codes
New integrations should still treat the process exit code as the source of truth for pass/fail decisions. Use summary for reporting, dashboards, or compatibility with older pipelines.
Run the regression suite after changing the inspector, output schema, release-gate behavior, or any rule that could change pass/fail results.
python -S scripts/run_self_tests.pyThe tests build temporary valid, malformed, and hostile Skill fixtures. They do not require network access or third-party packages.
This repository includes a GitHub Actions workflow at .github/workflows/self-tests.yml.
It runs the bundled regression suite on pushes, pull requests, and manual dispatches:
python -S scripts/run_self_tests.pyKeep the workflow dependency-free unless the inspector intentionally adds third-party runtime requirements.
For install, publish, ship, or release-candidate decisions, use references/release-gate-checklist.md and report results using the Release Gate Review section from references/report-template.md.
A skill should not be called release-ready when a Critical gate fails.
Treat the following as blockers unless the user explicitly requested only a draft review:
- Strict inspector failures.
- Likely bundled secrets.
- Unsafe archive structures.
- Missing or multiple
SKILL.mdentrypoints. - Official validator failures.
- Safety risks that could cause the agent to take unsafe, misleading, destructive, or unsupported actions.
Release readiness should mean something. Otherwise it is just ceremony wearing a badge.
Treat uploaded archives and bundled scripts as untrusted input.
The secret scan is heuristic and non-exhaustive. A clean scan does not prove that no secrets exist.
Do not run bundled scripts that appear destructive, credential-harvesting, network-dependent, or unrelated to validation.
Keep SKILL.md compact. Move detailed rubrics, examples, schemas, and extended guidance into directly linked files under references/.
The skill should help the agent load the right amount of context at the right time — not bury it under a pile of instructions and then act surprised when behavior drifts.
A distributable Skill ZIP should contain the skill-forge/ directory as the archive root entry.
Keep the package under the target platform upload limit and exclude generated caches.
This repository is distributed under the MIT License.
Update the copyright holder in LICENSE if your organization requires a different owner or license.