Make AI a software engineering discipline.
On the TravelPlanner benchmark (ICML 2024), LangChain passes 77.8% of tasks and CrewAI passes 73.3%. They burn 3–6× more tokens, cost 4–8× more per passing result, and lose track of instructions as context grows. GPT-4 alone scores 0.6%.
OpenSymbolicAI scores 97.9% on 1,000 tasks by splitting the LLM's job in two:
┌─────────────────────────────────────┐
│ Traditional Agent (ReAct) │
│ │
│ User ─→ LLM ─→ Tool ─→ LLM ─→ │
│ Tool ─→ LLM ─→ Tool ─→ │
│ LLM ─→ ... (loop forever) │
│ │
│ ⚠ Data in prompt = injection risk │
│ ⚠ Context bloats every iteration │
│ ⚠ LLM makes unplanned tool calls │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ OpenSymbolicAI (Plan + Execute) │
│ │
│ User ─→ LLM ─→ Plan │
│ ↓ │
│ Runtime executes plan │
│ deterministically │
│ │
│ ✓ Data never enters LLM context │
│ ✓ Fewer tokens, fewer LLM calls │
│ ✓ Every side effect is explicit │
└─────────────────────────────────────┘
The LLM plans. The runtime executes. Data stays in application memory and never gets tokenized.
Three blueprints for different problem shapes:
| Blueprint | Pattern | Use when |
|---|---|---|
| PlanExecute | Plan once, execute deterministically | Fixed sequence of steps (calculators, converters, simple QA) |
| DesignExecute | Plan with loops and conditionals | Dynamic-length data (shopping carts, batch processing) |
| GoalSeeking | Plan → execute → evaluate → repeat | Iterative problems (optimization, multi-hop research, deep research) |
pip install opensymbolicai-coreDefine primitives (what your agent can do) and decompositions (examples of how to use them). The LLM learns from your examples to plan new queries:
from opensymbolicai import PlanExecute, primitive, decomposition
class Calculator(PlanExecute):
@primitive(read_only=True)
def add(self, a: float, b: float) -> float:
return a + b
@decomposition(
intent="What is 2 + 3?",
expanded_intent="Add the two numbers",
)
def _example(self) -> float:
return self.add(a=2, b=3)Every decomposition you add makes the agent better. This is the flywheel that prompt engineering doesn't have.
| Problem | How OpenSymbolicAI solves it |
|---|---|
| Prompt injection | Symbolic Firewall keeps data out of LLM context. Nothing to inject into |
| Unpredictable behavior | Execution is deterministic and fully traced. Even iterative agents (GoalSeeking) produce inspectable plans each step — no runaway tool-calling |
| High costs | Fewer LLM calls to plan, then pure code execution. No re-tokenizing on every step |
| Can't test or debug | Full execution traces, typed outputs (Pydantic), version-controlled behavior |
| Model lock-in | Model-agnostic. Swap providers without rewriting your agent |
| Language | Repo | Description |
|---|---|---|
| Python | core-py | Primitives, blueprints (PlanExecute, DesignExecute, GoalSeeking), multi-provider LLM abstraction |
| TypeScript | core-ts | TypeScript core SDK |
| Go | core-go | Go runtime with AST-based plan execution |
| C# / .NET | core-dotnet | .NET runtime |
| Repo | Description |
|---|---|
| examples-py | Python examples: RAG, multi-hop QA, deep research, unit converter, date calculator |
| examples-ts | TypeScript examples: RAG Agent, Date Agent, Unit Converter |
| cli-py | Interactive TUI for discovering and running agents |
| claude-skills | Claude Code skills for scaffolding agents, adding primitives/decompositions/evaluators, and debugging traces |
| Benchmark | Result | What it shows |
|---|---|---|
| TravelPlanner | 97.9% on 1,000 tasks — GPT-4 gets 0.6% | GoalSeeking two-stage. 100% hard constraint pass rate, 3.1× fewer tokens than LangChain. Blog post |
| MultiHopRAG | 82.9% — +7.9pp over previous best | GoalSeeking, 609 documents, 2,556 queries. Same result in Python, C# (83.8%), and Go (81.6%). Blog post |
| LegalBench | 93.1% across 162 legal reasoning tasks | GoalSeeking agent. 835 items, 0 errors, $1.88 total cost |
| FOLIO | 89.2% — outperforms GPT-4 CoT (78.1%) | PlanExecute + Z3 theorem prover. First-order logic reasoning |
Same model (gpt-oss-120b), same tools, same evaluation — only the framework differs:
Pass Rate Tokens/Task Cost/Passing Task LLM Calls/Task
───────── ─────────── ───────────────── ──────────────
OpenSymbolicAI ████████████ 100% ██░░░░░░░ 13,936 █░░░░░░░ $0.013 ██░░░░░░░ 2.3
LangChain █████████░░░ 77.8% █████░░░░ 43,801 ████░░░░ $0.051 ████████░ 13.5
CrewAI ████████░░░░ 73.3% █████████ 81,331 ████████ $0.100 █████████ 39.6
7 models hit 100% pass rate — including Llama 3.3 70B at $0.006/task and 4.3s latency on Groq. The framework matters more than the model. See the full model landscape.
- Getting Started - Build your first agent in 5 minutes
- The OpenSymbolicAI Manifesto - The philosophy behind the architecture
- The Anatomy of PlanExecute - Why the core blueprint is what it is
- Behaviour Programming vs. Tool Calling - Why executable examples beat massive prompts
- English, Spec, or Code - How you talk to the LLM decides how far you get
- LLM Attention Is Precious - A visual breakdown of token waste
- Secure by Design - How the Symbolic Firewall prevents prompt injection
- The Missing Flywheel in Agent Building - Why agents stay brittle and how to fix it
- Closing the Flywheel in Practice - Practical implementation of the flywheel
- Agent-to-Agent Is Just Function Calls - Multi-agent systems need typed interfaces, not new infrastructure
- Change Everything, Change Nothing - MultiHopRAG in Python and C# — accuracy moves by 0.9pp
- Third Language, Same Result - MultiHopRAG in Go — the framework holds across three languages
MIT