OpenSymbolicAI

Make AI a software engineering discipline.

On the TravelPlanner benchmark (ICML 2024), LangChain passes 77.8% of tasks and CrewAI passes 73.3%. They burn 3–6× more tokens, cost 4–8× more per passing result, and lose track of instructions as context grows. GPT-4 alone scores 0.6%.

OpenSymbolicAI scores 97.9% on 1,000 tasks by splitting the LLM's job in two:

┌─────────────────────────────────────┐
│  Traditional Agent (ReAct)          │
│                                     │
│  User ─→ LLM ─→ Tool ─→ LLM ─→      │
│          Tool ─→ LLM ─→ Tool ─→     │
│          LLM ─→ ... (loop forever)  │
│                                     │
│  ⚠ Data in prompt = injection risk  │
│  ⚠ Context bloats every iteration   │
│  ⚠ LLM makes unplanned tool calls   │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  OpenSymbolicAI (Plan + Execute)    │
│                                     │
│  User ─→ LLM ─→ Plan                │
│                    ↓                │
│          Runtime executes plan      │
│          deterministically          │
│                                     │
│  ✓ Data never enters LLM context    │
│  ✓ Fewer tokens, fewer LLM calls    │
│  ✓ Every side effect is explicit    │
└─────────────────────────────────────┘

The LLM plans. The runtime executes. Data stays in application memory and never gets tokenized.

Three blueprints for different problem shapes:

Blueprint	Pattern	Use when
PlanExecute	Plan once, execute deterministically	Fixed sequence of steps (calculators, converters, simple QA)
DesignExecute	Plan with loops and conditionals	Dynamic-length data (shopping carts, batch processing)
GoalSeeking	Plan → execute → evaluate → repeat	Iterative problems (optimization, multi-hop research, deep research)

pip install opensymbolicai-core

How It Works

Define primitives (what your agent can do) and decompositions (examples of how to use them). The LLM learns from your examples to plan new queries:

from opensymbolicai import PlanExecute, primitive, decomposition

class Calculator(PlanExecute):

    @primitive(read_only=True)
    def add(self, a: float, b: float) -> float:
        return a + b

    @decomposition(
        intent="What is 2 + 3?",
        expanded_intent="Add the two numbers",
    )
    def _example(self) -> float:
        return self.add(a=2, b=3)

Every decomposition you add makes the agent better. This is the flywheel that prompt engineering doesn't have.

Why This Matters

Problem	How OpenSymbolicAI solves it
Prompt injection	Symbolic Firewall keeps data out of LLM context. Nothing to inject into
Unpredictable behavior	Execution is deterministic and fully traced. Even iterative agents (GoalSeeking) produce inspectable plans each step — no runaway tool-calling
High costs	Fewer LLM calls to plan, then pure code execution. No re-tokenizing on every step
Can't test or debug	Full execution traces, typed outputs (Pydantic), version-controlled behavior
Model lock-in	Model-agnostic. Swap providers without rewriting your agent

Repositories

Runtimes

Language	Repo	Description
Python	core-py	Primitives, blueprints (PlanExecute, DesignExecute, GoalSeeking), multi-provider LLM abstraction
TypeScript	core-ts	TypeScript core SDK
Go	core-go	Go runtime with AST-based plan execution
C# / .NET	core-dotnet	.NET runtime

Examples & Tools

Repo	Description
examples-py	Python examples: RAG, multi-hop QA, deep research, unit converter, date calculator
examples-ts	TypeScript examples: RAG Agent, Date Agent, Unit Converter
cli-py	Interactive TUI for discovering and running agents
claude-skills	Claude Code skills for scaffolding agents, adding primitives/decompositions/evaluators, and debugging traces

Benchmarks

Benchmark	Result	What it shows
TravelPlanner	97.9% on 1,000 tasks — GPT-4 gets 0.6%	GoalSeeking two-stage. 100% hard constraint pass rate, 3.1× fewer tokens than LangChain. Blog post
MultiHopRAG	82.9% — +7.9pp over previous best	GoalSeeking, 609 documents, 2,556 queries. Same result in Python, C# (83.8%), and Go (81.6%). Blog post
LegalBench	93.1% across 162 legal reasoning tasks	GoalSeeking agent. 835 items, 0 errors, $1.88 total cost
FOLIO	89.2% — outperforms GPT-4 CoT (78.1%)	PlanExecute + Z3 theorem prover. First-order logic reasoning

Framework Comparison (TravelPlanner)

Same model (gpt-oss-120b), same tools, same evaluation — only the framework differs:

                Pass Rate        Tokens/Task       Cost/Passing Task    LLM Calls/Task
                ─────────        ───────────       ─────────────────    ──────────────
OpenSymbolicAI  ████████████ 100%  ██░░░░░░░  13,936   █░░░░░░░  $0.013    ██░░░░░░░  2.3
LangChain       █████████░░░ 77.8% █████░░░░  43,801   ████░░░░  $0.051    ████████░  13.5
CrewAI          ████████░░░░ 73.3% █████████  81,331   ████████  $0.100    █████████  39.6

7 models hit 100% pass rate — including Llama 3.3 70B at $0.006/task and 4.3s latency on Groq. The framework matters more than the model. See the full model landscape.

Deep Dives

Getting Started - Build your first agent in 5 minutes
The OpenSymbolicAI Manifesto - The philosophy behind the architecture
The Anatomy of PlanExecute - Why the core blueprint is what it is
Behaviour Programming vs. Tool Calling - Why executable examples beat massive prompts
English, Spec, or Code - How you talk to the LLM decides how far you get
LLM Attention Is Precious - A visual breakdown of token waste
Secure by Design - How the Symbolic Firewall prevents prompt injection
The Missing Flywheel in Agent Building - Why agents stay brittle and how to fix it
Closing the Flywheel in Practice - Practical implementation of the flywheel
Agent-to-Agent Is Just Function Calls - Multi-agent systems need typed interfaces, not new infrastructure
Change Everything, Change Nothing - MultiHopRAG in Python and C# — accuracy moves by 0.9pp
Third Language, Same Result - MultiHopRAG in Go — the framework holds across three languages

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenSymbolicAI

How It Works

Why This Matters

Repositories

Runtimes

Examples & Tools

Benchmarks

Framework Comparison (TravelPlanner)

Deep Dives

License

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!