Compiling and certifying agent skills with Mellea
Research preview — IBM Research, May 2026
What · Why · How · Quick Start · Examples · Next Steps · FAQ · Tech Report · Cite
Research preview (v0.1) — This is an early-stage research project from IBM Research. The APIs, CLI, and artifact formats are subject to change. We welcome feedback via Issues.
Coming soon (active development):
- Interactive dependency resolution during compile
- Export for additional agent harnesses — MCP, LangGraph, and Claude Code available today, all experimental
- Support for compiling non-
.mdagent skills- Increased coverage for different interaction modalities (streaming, conversational session, scheduled, event-triggered)
Mellea Skills Compiler is a certification pipeline for AI agent skills. It takes a natural-language skill specification (a .md file) and produces a typed, instrumented program with policy-driven guardrails and auditable execution traces.
The pipeline composes three IBM Research technologies:
| Component | Role | Source |
|---|---|---|
| Mellea | Structured generative programs with typed schemas, validation, and hooks | Apache 2.0 |
| Granite Guardian | Runtime risk detection integrated via Mellea's hook system | Apache 2.0 |
| AI Atlas Nexus | Governance knowledge graph mapping use cases to risks across taxonomies | Apache 2.0 |
AI agents increasingly ship as natural-language specifications — Markdown files, YAML configs, system prompts — executed by LLMs without formal verification, runtime monitoring, or compliance documentation. The specification format is right for rapid development, but specifications alone don't guarantee reliable execution at scale.
Mellea Skills Compiler addresses three governance gaps:
- Specification opacity — When an LLM interprets a Markdown spec, contradictions are silently resolved through implicit judgement. Structured decomposition surfaces these conflicts as testable failures.
- Runtime unobservability — Agent outputs are typically unmonitored. Mellea Skills Compiler instruments every LLM generation with Guardian risk checks and JSONL audit trails.
- Compliance disconnect — Enterprise frameworks (NIST AI RMF, EU AI Act) require documented evidence of risk management. Mellea Skills Compiler maps governance requirements to runtime capabilities and produces evidence packages.
Mellea Skills Compiler operates as a two-step user workflow — compile then certify:
SKILL.md / spec.md COMPILE CERTIFY
Natural-language → mellea-fy → AI Atlas Nexus → policy manifest
agent specification spec → typed pipeline Guardian hooks instrument runtime
contradictions surfaced fixtures executed + audited
compliance classification + report
Step 1: Compile — A .md specification is decomposed into a typed Mellea pipeline package: Pydantic schemas, @generative extraction slots, requirement validators, and multi-phase orchestration code. Two compilation paths are available: the mellea-skills compile CLI command, or the /mellea-fy command inside Claude Code. See src/mellea_skills_compiler/examples/ for pre-compiled examples.
Step 2: Certify — A single mellea-skills certify invocation performs end-to-end governance: AI Atlas Nexus identifies applicable risks from Granite Guardian, NIST AI RMF, and Credo UCF taxonomies and emits a PolicyManifest; Guardian hooks configured from that manifest monitor every m.instruct() call as fixtures execute; each governance requirement is classified as AUTOMATED, PARTIAL, or MANUAL based on runtime evidence; a compliance report and audit trail are written alongside the compiled pipeline.
-
Claude Code is required to compile a Mellea skill. Please ensure that the Claude Code is installed by following the guide here: https://code.claude.com/docs/en/quickstart
-
Set relevant platform-specific environment variables to communicate with your Claude platform.
For example, Claude via LiteLLM Gateway requires following env variables:
export ANTHROPIC_BASE_URL = "" export ANTHROPIC_AUTH_TOKEN = ""or if you have an ANTHROPIC_API_KEY
export ANTHROPIC_API_KEY = "" export ANTHROPIC_BASE_URL = ""
Clone code repository
git clone https://github.com/generative-computing/mellea-skills-compiler
Create Python environment and install library
# Requires Python >=3.11, <3.14.4
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Set Ollama API URL in the environment variables:
export OLLAMA_API_URL=<ollama-api-url>Example skills: https://github.com/generative-computing/mellea-skills-compiler/tree/main/skills
We recommend downloading the Ollama models granite3.3:8b and ibm/granite3.3-guardian:8b beforehand, as they are set as defaults.
For Risk Identification
ollama pull granite3.3:8b
For Risk Assessment
ollama pull ibm/granite3.3-guardian:8b
Begin operation by using the Mellea Skills Compiler Node.js Interactive CLI or skip to the next step to use command-based CLI.
./mellea-skills-ui.sh
Compile a skill into a typed Mellea pipeline via the CLI:
mellea-skills compile <Your-local-path>/skills/weather/spec.md # if skill is a single spec file.
mellea-skills compile <Your-local-path>/skills/weather # if skill is a directory containing spec filesCompile uses Sonnet as the default claude model. To use different claude model,
mellea-skills compile <Your-local-path>/skills/weather/spec.md --model aws/claude-opus-4-5
mellea-skills compile <Your-local-path>/skills/weather --model aws/claude-opus-4-5Melleafy Repair: Identify and correct any errors effectively in Mellea skill compilation
mellea-skills compile --repair-mode <Your-local-path>/skills/weather --model aws/claude-opus-4-5Run /mellea-fy directly inside Claude Code:
./mellea-fy <Your-local-path>/skills/weather/spec.mdSee mellea-fy/README.md for detailed usage of the Claude Code command.
Run skill pipeline for a given fxiture
mellea-skills run <Your-local-path>/skills/weather/weather_mellea --fixture rain_check # provide path to the compiled skill directory and the fixture name
mellea-skills run <Your-local-path>/skills/weather/weather_mellea --enforce # Block execution when Guardian detects risks (default: audit-only)
mellea-skills run <Your-local-path>/skills/weather/weather_mellea --no_guardian # Skip Guardian checks even if a policy manifest exists.Run end-to-end certification — risk identification via AI Atlas Nexus, Guardian hook instrumentation, fixture execution, and compliance report — in a single command:
mellea-skills certify <Your-local-path>/skills/weather/weather_mellea # provide path to the compiled skill directory
mellea-skills certify <Your-local-path>/skills/weather/weather_mellea --enforce # Block on risk detection
mellea-skills certify <Your-local-path>/skills/weather/weather_mellea --fixture rain_check # Run specific fixture - rain_check
mellea-skills certify <Your-local-path>/skills/weather/weather_mellea --model granite3.3:8b --guardian-model ibm/granite3.3-guardian:8b --inference-engine ollama # Using different risk model, guardian model and inference engineExport a compiled Mellea skill to a deployment target - langgraph, claude-code, or mcp
Note: This command is experimental. Output structure and CLI interface may change in future releases without a deprecation period.
mellea-skills export --target mcp <Your-local-path>/skills/weather/weather_mellea # Supported deployment target: mcp, langgraph, claude-code
mellea-skills export --target mcp --force <Your-local-path>/skills/weather/weather_mellea # '--force' overwrites output directory if it already exists.All outputs are written to audit/ adjacent to the compiled directory:
skills/weather/audit/
├── policy_manifest.json # Policy manifest (risks + governance actions)
├── POLICY.md # Human-readable policy document
├── CERTIFICATION.md # Certification report with coverage summary
├── audit_trail.jsonl # Runtime Guardian verdicts
└── pipeline_report.json # Pipeline execution output
The src/mellea_skills_compiler/examples/ directory contains pre-compiled, validated Mellea pipeline packages — runnable end-to-end against the project's Ollama + granite3.3:8b baseline. Each is a curated reference snapshot of what mellea-skills compile produces under the current architecture.
| Skill | Tier | Archetype | Description |
|---|---|---|---|
| weather | T1 | Fetch + summarise | Public no-auth HTTP to wttr.in; intent classification dispatches to one of seven URL templates |
| sentry-find-bugs | T1 / T2 | Structured analysis | Multi-phase OWASP review producing severity-classified findings; two stub helpers (search_fn, read_file_fn) for codebase-scanning fixtures |
| superpowers-systematic-debugging | T1 | Constrained reasoning | Four-phase debugging walk with hypothesis testing; fix_attempts_count >= 3 triggers architectural-issue branch |
| clawdefender | T3 | Adversarial classification | Prompt injection / SSRF / command injection / credential exfiltration detection; bundled scripts need chmod +x on Unix |
Each example includes the original spec.md (or SKILL.md), generated pipeline code, factory-shape fixtures, intermediate IR (config_emission.json, fixtures_emission.json, etc.), mapping_report.md, and melleafy.json manifest. See docs/README.md for the runnable tutorial that walks through each one and docs/FROM_STUBS_TO_RUNNING.md for the stub-implementation walkthrough.
The skills/ directory contains 16 skill specifications drawn from multiple sources (Sentry, Anthropic, community contributions, and IBM Research). Four of these ship as pre-compiled examples (see above); the rest can be compiled locally via mellea-skills compile skills/<name>/spec.md.
Skills are classified into three tiers by what's needed to run a fixture against the compiled package:
- T1 — Runs out of the box. No stubs, no external services, no credentials.
- T2 — Runs after filling 1–2 stubs or supplying a small bundled artifact.
- T3 — Requires external integration before any fixture completes (CLI tool, API key, OAuth, runtime helper).
See skills/README.md for the full per-skill tier table and source attribution.
src/mellea_skills_compiler/ # pip-installable package
certification/ # Ingest → policy → compliance → certification report
compile/ # Compile Mellea skill specification into a Mellea pipeline using mellea-fy Claude command.
guardian/ # Granite Guardian hooks for Mellea pipelines
toolkit/ # Shared utilities and enums
export/ # Export a compiled Mellea skill to a deployment target
mellea-fy/ # Claude Code /mellea-fy command definition
skills/ # Skill specs, compiled pipelines, and fixtures
examples/ # mellea-fy output examples and demos
tests/ # Test suite
pytest -s testsSee tests/README.md for details.
Mellea Skills Compiler is an active research project. The current release demonstrates the core pipeline; several directions are in progress.
- Cross-model evaluation — We are developing a systematic comparison framework for how specification decomposition affects skill behaviour across model sizes and task types, capturing both correctness and predictability dimensions.
- Cost-benefit analysis — Decomposition increases LLM call count compared to monolithic execution. Quantifying the efficiency-governance tradeoff is part of the ongoing work.
- Compiler reflection loop — Currently,
/mellea-fyis a single-pass compiler with no automated self-review. We are building a validate-and-repair cycle: generate, validate (syntax, imports, fixture execution), and repair broken files — applying the same reflection pattern the compiled pipelines already use internally. - Modular compiler specification — The mellea-fy command spec is itself a large natural-language document. We are investigating decomposing it into smaller, independently-testable modules to improve consistency.
- Specification linting — Self-consistency analysis to detect contradictions in skill specs before compilation. Decomposition surfaces spec quality issues that monolithic execution can resolve silently; we are developing this into a standalone quality gate.
- Per-phase model routing — Decomposed pipelines enable routing each phase to a different model tier; classification and extraction phases tend to suit smaller models, while complex reasoning phases benefit from larger ones. The optimisation surface is being explored.
- Closed-loop repair — Feeding Guardian verdicts back into Mellea's existing repair loops, moving from "guardrails that flag" to "guardrails that fix."
- Ecosystem-scale governance — Applying the certification pipeline to skill registries at scale.
- Research preview — APIs, CLI, and artifact formats may change
- Claude Code required for compilation — Both
mellea-skills compileand/mellea-fyinvoke Claude Code under the hood for specification decomposition - Static compliance classification — YAML-based action-to-control mapping, not yet validated against ground truth
- Single domain evaluation — Certification pipeline has been tested primarily on security and utility skills
- Python version constraints — Python >=3.11 and <3.14.4 (ai-atlas-nexus requires 3.11+ and <3.14.4; Mellea supports 3.11+)
This is a research preview. We welcome feedback, bug reports, and suggestions via Issues. If you're interested in contributing or collaborating, please open an issue to start the conversation.
Elizabeth M. Daly, Dhaval Salwala, Inge Vejsbjerg, Seshu Tirupathi, Rebecka Nordenlöw, Jessica He, Kush R. Varshney, and Jordan McAfoose — IBM Research
A technical report describing the system architecture, design rationale, and governance pipeline is included in this repository: docs/Mellea_Skills_Compiler-tech_report.pdf.
If you use Mellea Skills Compiler in your work, please cite:
@techreport{daly2026mellea,
title = {Mellea Skills Compiler: Compiling and Certifying Agent Skills with Mellea},
author = {Daly, Elizabeth M. and Salwala, Dhaval and Vejsbjerg, Inge and
Tirupathi, Seshu and Nordenl{\"o}w, Rebecka and He, Jessica and
Varshney, Kush R. and McAfoose, Jordan},
institution = {IBM Research},
year = {2026},
month = {May},
type = {Technical Report},
url = {https://github.com/generative-computing/mellea-skills-compiler/blob/main/docs/Mellea_Skills_Compiler-tech_report.pdf}
}Apache 2.0 — see LICENSE.