Stop prompting. Start engineering. A structured reference for taking AI agents into production.
A curated map of agentic AI systems — covering architectures, frameworks, memory, evaluation, and safety.
🌐 Live site: https://natnew.github.io/Awesome-Agentic-Engineering/
This is not a list of tools.
We reject "tool list energy." It is a structured guide to building reliable, observable, and production-grade agentic systems, rigorously evaluated against engineering dimensions.
- 🧭 Thesis
- ⚖️ Architecture Decision Guide
- 🧩 Core Agentic Patterns
- 🏗️ Reference Architectures
- 🧠 Memory Systems
- 📊 Formal Evaluation Rubric
- Benchmark and Evidence Policy
- ⚙️ Orchestration Frameworks
- 📡 Protocols and Standards
- � Reasoning & Planning Models
- �🧪 Evaluation & Safety
- 🧠 Skills and Operating Principles
- 🚫 What NOT to Do
- 📊 Signals (How to Read This List)
- 🚀 Getting Started
- 🤝 Contributing
- 📌 Final Note
- 🌐 Browser and Desktop Agents
- 🎙 Voice Agents
- 🎨 Creative AI
- 💼 Customer Support and CRM Agents
- 🧠 Open-Source Models for Agents
- 📰 Newsletters and Communities
- 📚 Learning Resources
- ⚡ Fast-Moving Product Lists
| 📈 The Shift (Agentic systems are moving to) | 📉 The Challenge (Implementations suffer from) | 🎯 Our Focus (This repository prioritises) |
|---|---|---|
| • Stateful, multi-step reasoning • Multi-agent collaboration & orchestration • Feedback-driven learning loops • Tool-augmented execution environments |
• Fragility under iteration • Poor observability & evaluation • Weak memory & context management • Limited safety & governance |
• Reliability over novelty • Evaluation over intuition • Architecture over tooling • Systems thinking over prompt engineering |
| If your task is... | Start with... | Escalate to... | Avoid... |
|---|---|---|---|
| bounded, tool-using, low-risk | single-agent + tools | typed state, retries | multi-agent teams |
| long-running, inspectable, enterprise | graph/workflow orchestration | approval gates, persistence | opaque emergent loops |
| open-ended research | planner/executor or supervisor | critique loops, memory | rigid pipelines only |
| high-reliability extraction | prompt chains + strict schemas | validator feedback loops | unconstrained conversational agents |
| complex parallel execution | modular multi-agent setups | shared workspace/memory | treating LLMs as deterministic |
These patterns underpin most production-grade agentic systems.
| Pattern | Description | Key Characteristic |
|---|---|---|
| Single-Agent + Tool Use | One reasoning loop with structured tool invocation | Suited to focused tasks with bounded scope |
| Supervisor / Router Agents | Central agent delegates tasks to specialised agents | Enables modularity and scalability |
| Multi-Agent Collaboration | Agents operate in parallel or sequence | Patterns: debate, critique, planning/execution split |
| Reflection / Critique Loops | Agents evaluate and refine their own outputs | Improves reliability over multiple iterations |
| Retrieval-Augmented Agents | External knowledge via vector search or APIs | Reduces hallucination and improves grounding |
| Event-Driven / Long-Running Agents | Persistent agents reacting to triggers over time | Requires memory, state, and orchestration |
Representative system designs for real-world use.
| Architecture | Ecosystem Maturity | Description | Architectural Strengths | Operational Constraints | Workload Suitability | Design Paradigm | Governance Fit |
|---|---|---|---|---|---|---|---|
| DeerFlow | Emerging | Is: Open-source orchestration system combining sub-agents, memory, and sandboxes. Demonstrates: Workflow-oriented orchestration across agents with shared execution context. |
Strong system-level reference for memory, sandbox, and skills composition. | Higher setup complexity and a heavier runtime surface than most teams need initially. | Strong fit for compound research/coding workflows and teams studying full-stack agent architectures. Poor fit for lightweight orchestration or narrowly scoped tasks. | Hierarchical multi-agent orchestration. | Requires explicit sandbox policy, tool boundaries, and operator oversight before untrusted code execution. |
| SWE-agent | Experimental | Is: Autonomous SWE system using a specialized Agent-Computer Interface (ACI). Demonstrates: Narrow action spaces and interface design tuned for code-repair tasks. |
Streamlined command space, compressed history handling, and a clear task boundary for patch workflows. | Benchmark-oriented design, high token cost, and long end-to-end fix latency on larger tasks. | Strong fit for isolated PRs and self-contained bug fixes. Poor fit for broad refactors or environments without standard build tooling. | Single agent with a highly specialized action space (ACI). | Needs tight repository scoping, review gates, and execution controls to reduce silent code regressions. |
Last reviewed: April 2026.
Memory is a first-class concern in agentic systems. Rather than treating memory as a simple array of previous messages, production systems require structured approaches to state, persistence, and retrieval. Four categories — working, episodic, procedural, semantic — are load-bearing decisions, not implementation details; pick them deliberately before picking a vendor.
Different types of memory serve distinct functional roles in an agentic architecture:
| Type | Definition | Implementation Examples |
|---|---|---|
| Working Memory (Thread State) | Short-term context for the current execution loop or active conversation thread. Ephemeral. | Context window, LangGraph State, in-memory message lists. |
| Episodic Memory | Autobiographical history of past actions, inputs, and outcomes. Enables reflection on past mistakes. | Checkpoint logs, event stores, prompt / trajectory histories. |
| Procedural Memory | Reusable skills, system prompts, and tool configurations. Defines how the agent operates. | Static configuration, retrieved skill libraries, GitHub workflows. |
| Semantic Memory | Embedded, factual knowledge about the world, the user, or the domain. Defines what the agent knows. | Vector databases (FAISS, Pinecone), knowledge graphs, Letta core memory. |
In multi-agent systems, memory boundaries are architectural decisions:
- Private Agent Memory: Each agent maintains its own semantic and episodic stores. Prevents context leakage and maintains strong role boundaries.
- Shared Workspace (Global Memory): A common blackboard or shared state where multiple agents read and write. Requires collision management and strict typing.
Managing the memory lifecycle is critical for long-running agents.
| Mechanism | Description | Best Practices & Risks |
|---|---|---|
| Checkpointing | Saving the exact thread state at a specific point in time (e.g., node transitions). | Enables "time travel" (rewind and replay) and human-in-the-loop approvals. |
| Write Policies | Rules defining when and how an agent commits data to long-term storage. | Prefer explicit SaveMemory tool calls over passive auto-saving to maintain control. |
| Retrieval Triggers | Determining when to query past memory (e.g., pre-fetch vs. just-in-time). | Use vector search for semantic recall, but use explicit graph keys for structured state. |
| Summarisation / Compression | Reducing token counts of episodic histories. | Summarise older interactions into a rolling summary while preserving recent exact messages. |
| Pruning / Decay | Deleting or archiving old or irrelevant memories. | Implement TTL (time-to-live) for working memory to prevent context exhaustion. |
| Contamination / Poisoning | Malicious or incorrect data persisting in long-term memory. | Risk: Once poisoned, an agent's future logic breaks. Require validation or bounds on semantic writes. |
Specialised infrastructure for managing agent memory.
| System | Role | Description |
|---|---|---|
| LangGraph Persistence | Thread-level state | Built-in check-pointers (SQLite, Postgres) for DAG-based execution loops, enabling interrupt/resume. |
| LangMem | Long-term memory extraction | LangChain's framework for extracting user preferences and entity profiles in the background. |
| Letta (formerly MemGPT) | OS-level memory abstraction | Advanced core memory management with explicit paging (read/write limits) to mimic virtual memory. |
| Mem0 | Personalized memory layer | Managed memory API focusing on user contexts, interactions, and entity relationships. |
| Zep / Graphiti | Enterprise memory & graphs | Fast, long-term memory for AI assistants; uses temporal knowledge graphs to map entity relationships over time. |
| MCP (Model Context Protocol) | Interoperability fabric | While not a DB itself, MCP provides a standard protocol to expose memory stores and file systems universally across tools and agents. |
Every major framework and architecture in this repository is judged against the following Required Scoring Dimensions. We evaluate systems based on engineering rigor, not marketing copy.
| Dimension | Evaluation Criteria |
|---|---|
| Control flow explicitness | How observable and deterministic is the execution path? |
| State model | How is agent state typed, managed, and persisted? |
| Memory support | Are there built-in primitives for short-term, episodic, and semantic memory? |
| Observability / tracing | Is it easy to trace intermediate reasoning steps and tool calls? |
| Human-in-the-loop support | Does it natively support interrupt-and-resume or approval gates? |
| Type safety / structured outputs | Are outputs guaranteed against strict schemas? |
| Provider portability | How tightly coupled is it to one specific LLM provider? |
| Security posture | Are there built-in mechanisms for sandboxing, access control, or guardrails? |
| Architectural strengths | Which design choices materially improve decomposition, control, state handling, or interface clarity? |
| Operational constraints | What deployment burden, runtime cost, debugging friction, or failure modes does it introduce? |
| Ecosystem maturity | How stable are the APIs, docs, integrations, and operator knowledge base? |
| Governance fit | Does it support auditability, approval gates, access boundaries, policy enforcement, and regulated environments? |
| Workload suitability | Which workflows, task shapes, and team contexts does it fit well or poorly? |
Canonical resources are trusted here because they define what counts as evidence. Prefer official docs, architecture guides, papers, benchmark repos, and first-party repositories when establishing capabilities, methodology, or interface details.
| Evidence Tag | Use For |
|---|---|
[official] |
Official docs, architecture guides, specifications, benchmark documentation, or first-party repositories. |
[benchmark] |
Published benchmark runs, evaluation papers, or benchmark repos tied to a named workload. |
[field report] |
Production write-ups, incident reports, engineering blogs, or operator notes about real deployments. |
[author assessment] |
This repository's synthesis after reviewing the sources above and applying the rubric. |
- Do not treat marketing copy, launch-day demos, or GitHub stars as sufficient evidence for production claims.
- Separate benchmark performance from production maturity. A benchmark result can support workload fit, but it does not by itself prove reliability, governance fit, cost control, or operational maturity.
- Record
Last reviewed: Month YYYYin rapidly changing sections such as product lists, vendor capability summaries, and release-sensitive guidance. - See appendix/benchmark-and-evidence-policy.md for the full policy.
Last reviewed: April 2026.
Evidence tags follow the Benchmark and Evidence Policy. Scored against RUBRIC.md; cap of 5–8 deep-dive entries enforced.
| Framework | Ecosystem Maturity | Description | Architectural Strengths | Operational Constraints | Workload Suitability | Design Paradigm | Governance Fit | Evidence |
|---|---|---|---|---|---|---|---|---|
| LangGraph | Production-ready | Is: Stateful orchestration framework building directed graphs with typed state. Demonstrates: Deterministic execution control mixed with LLM reasoning. |
Explicit state management, persistence, and support for complex multi-actor workflows. | Verbose abstractions, steep learning curve, and graph sprawl if the workflow is over-modeled. | Strong fit for multi-step, stateful, and interruptible agent systems. Poor fit for simple single-prompt completions or linear chains. | DAG-based state machine. | Good fit for auditable workflows and approval gates, but graph edges must be tightly constrained to avoid runaway loops. | [official] docs · [field report] LinkedIn SQL Bot |
| Microsoft Agent Framework | Production-ready | Is: Microsoft's unified agent framework merging Semantic Kernel and AutoGen; first-class MCP and A2A support. Demonstrates: Enterprise-grade agent composition with typed plugins, approval workflows, and Azure integration. |
Strong .NET + Python parity, typed function-calling, native MCP/A2A, and OpenTelemetry tracing. | Broader Azure coupling in the managed path; framework surface is still stabilizing post-merger. | Strong fit for enterprise teams already on Azure / Semantic Kernel and needing multi-language agents. Poor fit for teams wanting a minimal Python-only stack. | Typed plugin graph with pluggable orchestration (sequential, group chat, handoff). | Strong — supports approval gates, policy plugins, and audit logging out of the box. | [official] repo · [official] announce |
| AutoGen | Production-ready | Is: Microsoft Research multi-agent conversation framework; now an orchestration pattern inside Microsoft Agent Framework. Demonstrates: Conversable agents with group chat, code-executor, and human-proxy patterns. |
Battle-tested multi-agent conversation patterns, large research footprint, flexible role composition. | Emergent conversation loops need explicit termination conditions; observability requires added tooling. | Strong fit for research on multi-agent collaboration and code-gen crews. Poor fit for strictly deterministic workflows. | Conversational multi-agent loop with configurable managers. | Needs explicit stop conditions and sandboxed code execution to be safe in production. | [official] v0.4 docs · [benchmark] AutoGen paper |
| OpenAI Agents SDK | Production-ready | Is: OpenAI's official agents SDK with handoffs, guardrails, and sessions; successor path to Assistants API. Demonstrates: First-party multi-step agents with tool-use, tracing, and structured handoffs. |
Tight integration with OpenAI tools, built-in tracing, ergonomic Python API, provider-agnostic via LiteLLM. | Primary optimization target is OpenAI models; porting to other providers loses some ergonomics. | Strong fit for teams shipping OpenAI-backed agents quickly with tracing. Poor fit for strict provider portability or local-only models. | Handoff-based multi-agent loop with sessions. | Viable for hosted approval flows; guardrails are first-class primitives. | [official] docs · [official] repo |
| CrewAI | Emerging | Is: Multi-agent collaboration framework where agents are assigned roles, goals, and tools. Demonstrates: Role-based agentic workflows with sequential and hierarchical processes. |
Simple mental model and fast team-based decomposition for prototypes; growing enterprise feature set. | Less control for highly complex or non-standard systems; observability and typed state are weaker than LangGraph/MAF. | Strong fit for rapid prototyping of agent teams. Poor fit for deterministic execution, rigorous type safety, or custom orchestration loops. | Role-based sequential or hierarchical process execution. | Requires added guardrails and observability to manage emergent loops and inconsistent agent behaviour. | [official] docs · [field report] case studies |
| Pydantic AI | Production-ready | Is: Framework built directly on Pydantic enforcing strict data validation and type-safe outputs from LLMs. Demonstrates: Type-driven agentic execution and dependency injection. |
Strong type-system integration, schema enforcement, dependency injection, and retry support. | Smaller surrounding ecosystem than older orchestration stacks; retry loops can increase latency and cost. | Strong fit for production systems needing strict type safety and predictable parsing. Poor fit for open-ended generative writing or weakly structured tasks. | Strongly typed, schema-first LLM interactions. | Good fit where schema validation and dependency control matter, but retry policies need explicit cost and failure bounds. | [official] docs |
| Smolagents | Emerging | Is: Minimalist framework using CodeAgents (Python logic code generation over JSON calling).Demonstrates: Code-first model execution bounds. |
Lightweight core and direct execution model that stays close to Python control flow. | Weak typed-state enforcement and high exposure if generated code runs with broad permissions. | Strong fit for fast prototyping and Python-native experimentation. Poor fit for regulated networks or systems that need strict sandboxing and observability. | Python-native logic execution via LLM generation. | Requires strong sandboxing, network controls, and review boundaries before production use. | [official] docs |
Broader catalog beyond the deep-dive set. Each subsection capped at 8 entries; entries that cannot clear the rubric were removed in this phase (see PR body for cut list).
| Framework | Lang | Description | Evidence |
|---|---|---|---|
| LangChain | Py/JS | Modular framework with chains, tools, memory, and broad integration coverage. | [official] |
| LangGraph | Py/JS | Graph-based orchestration. Stateful typed-state graphs with checkpointing. | [official] |
| LlamaIndex | Py/JS | Data-centric framework for retrieval-heavy and RAG-oriented agent systems. | [official] |
| Haystack | Py | Pipeline-based framework for search, retrieval, and hybrid agent workflows. | [official] |
| Semantic Kernel | C#/Py/Java | Microsoft enterprise kernel; now a composable layer inside Microsoft Agent Framework. | [official] |
| Microsoft Agent Framework | Py/.NET | Microsoft's unified agent framework merging Semantic Kernel and AutoGen; first-class MCP and A2A support. | [official] |
| Pydantic AI | Py | Type-safe, Pydantic-native; schema-first LLM interactions with dependency injection. | [official] |
| DSPy | Py | Stanford. Programming not prompting; compiler optimizes prompts against metrics. | [official] · [benchmark] |
| Framework | Lang | Description | Evidence |
|---|---|---|---|
| AutoGen | Py | Microsoft Research multi-agent conversations; v0.4 redesigned for async event-driven execution. | [official] · [benchmark] |
| CrewAI | Py | Role-based crew members with goals, tools, and sequential/hierarchical processes. | [official] |
| OpenAI Agents SDK | Py | Official OpenAI multi-step agents with handoffs, guardrails, sessions, and tracing. | [official] |
| Google ADK | Py | Native Gemini multi-agent orchestration; deploys to Vertex AI Agent Engine. | [official] |
| MetaGPT | Py | PM / architect / engineer roles simulating a software company; research-oriented. | [official] · [benchmark] |
| CAMEL | Py | Role-based simulation and collaborative reasoning research framework. | [official] · [benchmark] |
| DeerFlow | Py | ByteDance orchestration system for planning, tools, memory, and execution. | [official] |
| AgentScope | Py | Alibaba multi-agent framework with message-passing runtime and distributed mode. | [official] |
| Framework | Lang | Description | Evidence |
|---|---|---|---|
| Smolagents | Py | HuggingFace minimal agents (~1000 lines); code-action agents with sandboxed execution. | [official] |
| Agno | Py | Lightweight, model-agnostic agent framework with native multi-modal support. | [official] |
| Upsonic | Py | MCP-first framework with minimal setup and typed task graphs. | [official] |
| Portia AI | Py | Plan-based agent framework aimed at reliable production deployments with approval gates. | [official] |
| Mastra | TS | TypeScript-first framework with observability, workflows, and memory. | [official] |
Last reviewed: April 2026.
Protocols are the stable contracts between agents, tools, and hosts. Each entry below distinguishes the specification from any specific implementation — mixing the two is a repeat anti-pattern (see ANTI-PATTERNS.md).
| Protocol | Kind | Description | Evidence |
|---|---|---|---|
| MCP (Model Context Protocol) | Open spec | Anthropic-authored open standard for exposing tools, resources, prompts, and sampling to LLM hosts; wide multi-vendor adoption in 2025–2026. | [official] spec |
| A2A (Agent2Agent) | Open spec | Google-originated, Linux Foundation–hosted protocol for secure cross-agent communication across vendors and frameworks. | [official] spec |
| OpenAI Function / Tool Calling | Vendor API | Native structured tool invocation for OpenAI models; JSON-schema-typed tool definitions. | [official] |
| Anthropic Tool Use | Vendor API | Native structured tool invocation for Claude models; supports parallel tool calls and computer-use tools. | [official] |
| OpenAPI | Open spec | Industry-standard HTTP API specification; foundation for typed, discoverable tool surfaces behind MCP or direct function-calling. | [official] |
Last reviewed: April 2026.
Models that do explicit reasoning or planning at inference time — chain-of-thought baked into the decoding loop, extended thinking budgets, or trained planner heads. They change the shape of agent loops: the model absorbs work that used to live in a planner node, which shifts where you spend tokens, latency, and trust. Cap of 5–8 entries; selected for agentic relevance, not general benchmark wins.
| Model | Provider | Reasoning Mode | Why it matters for agents | Evidence |
|---|---|---|---|---|
| OpenAI o3 / o4-mini | OpenAI | Internal long chain-of-thought with tool use during reasoning | First widely available tool-using reasoning models; plan, call tools, and verify inside one model call. | [official] launch · [benchmark] system card |
| Claude Opus 4 / Sonnet 4 (extended thinking) | Anthropic | Configurable thinking-token budget, interleaved with tool calls | Agent-friendly thinking budget; predictable latency/cost dials for planner-heavy tasks. | [official] docs · [field report] Claude 4 post |
| Gemini 2.5 Pro (Deep Think) | Google DeepMind | Parallel hypothesis exploration before answering | Strong on multi-step math, code, and agent planning; native long-context lets planners keep full trajectories in-window. | [official] blog · [benchmark] model card |
| DeepSeek-R1 | DeepSeek | RL-trained reasoning traces, open weights | First strong open-weight reasoning model; reproducible baseline for planner research and local agent stacks. | [official] repo · [benchmark] paper |
| Qwen3 (thinking mode) | Alibaba | Switchable thinking / non-thinking modes | Open-weight family with explicit thinking toggle — useful when you want the same model in both planner and actor roles. | [official] repo · [benchmark] tech report |
| Grok 4 | xAI | Native reasoning with tool use | Aggressive frontier-reasoning entrant; useful as a diversity source in multi-model planner ensembles. | [official] page |
Decision guide: if your agent loop already does explicit plan → act → verify steps, a reasoning model can often replace the planner node — but it rarely removes the need for typed state, tracing, and eval. Treat reasoning as a cheaper planner, not a free reliability upgrade.
Last reviewed: April 2026.
This section covers frameworks and operational tooling for testing agent quality, correctness, task completion, regressions, and system behaviour, as well as security scanning, red teaming, policy testing, and misalignment research. Evidence tags follow the Benchmark and Evidence Policy.
- Output correctness
- Reasoning quality
- Tool-use accuracy
- Latency and cost
- Robustness under adversarial input
| Framework | Description | Methodology / Workload Suitability | Evidence |
|---|---|---|---|
| OpenAI Evals | Core framework for testing and improving AI systems. | Foundational evaluation framework and methodology. | [official] |
| DeepEval | Open-source LLM evaluation framework with metrics for hallucination, answer relevance, and task completion. | Application-level evaluation and regression testing. | [official] |
| promptfoo | CLI and library for evaluation and red teaming of LLM apps. | Regression testing, prompt/application evals, adversarial testing. | [official] |
| Inspect | UK AI Security Institute's framework for rigorous LLM evals covering coding, reasoning, agent behavior, and model-graded scoring. | Rigorous research-grade and agent-task evaluation. | [official] · [benchmark] |
| Azure AI Evaluation SDK | Azure Foundry evaluation SDK with built-in agent, safety, and quality evaluators. | Enterprise agent evaluation tied to Foundry tracing. | [official] |
- Golden datasets
- Regression testing
- Adversarial / red-team inputs
- Continuous evaluation pipelines
| Tool | Description |
|---|---|
| Langfuse | OSS LLM observability. Traces, evals, prompts. |
| LangSmith | LangChain platform. Tracing, testing, evaluation. |
| Braintrust | Eval-driven development. Experiment tracking. |
| Arize Phoenix | OSS AI observability. Traces, evals, embeddings. |
| Helicone | OSS LLM observability. One-line integration. |
| Weights and Biases Weave | Trace and evaluate LLM apps. |
| Benchmark | Description | Evidence |
|---|---|---|
| SWE-bench | Coding-agent benchmark grounded in real GitHub issues and patches; Verified subset is the canonical agent workload. |
[official] · [benchmark] |
| AgentBench | 8-environment LLM agent benchmark covering OS, DB, web, and game tasks. | [official] · [benchmark] |
| Terminal-Bench | Evaluates terminal-agent execution on shell-based tasks with scored task completions. | [official] · [benchmark] |
| GAIA | General AI assistant benchmark with real-world multi-step tasks and tool use. | [official] · [benchmark] |
| WebArena / VisualWebArena | Web agent benchmark on real-website snapshots; visual variant tests multimodal web agents. | [official] · [benchmark] |
| τ-bench | Tool-use + user-simulation benchmark measuring agent reliability and consistency across trials. | [official] · [benchmark] |
| 🛡️ Mitigation Strategies | |
|---|---|
| Prompt injection (direct & indirect) | Input validation and filtering |
| Tool misuse | Tool permissioning and sandboxing |
| Data exfiltration | Human-in-the-loop approval gates |
| Memory poisoning | Audit logs and traceability |
| Unbounded autonomous behaviour | Policy-driven execution |
| Resource | Description | Workload Suitability | Official Link |
|---|---|---|---|
| garak | LLM vulnerability scanner probing for hallucination, leakage, injection, toxicity, and jailbreaks. | Automated red teaming & vulnerability scanning | GitHub |
| OWASP GenAI Security Project | Governance and mitigation framework for safety risks in LLMs and agentic systems. | Governance, controls, and secure-design reference | Project Home |
| Anthropic Alignment Stress-Testing | Research and operational approach for deliberately stress-testing alignment evals and oversight. | Research-driven safety evaluation methodology | Post |
| Model Organisms of Misalignment | In-vitro demonstrations of alignment failures so they can be studied empirically. | Advanced safety research and methodology | Post |
| AI Safety via Debate | Alignment framework for cases where direct human supervision is too hard. | Alignment and scalable oversight resource | Paper |
| Concrete Problems in AI Safety | Foundational framing paper for safety problems (side effects, reward hacking, safe exploration, shift). | Foundational safety resource | Paper |
| Anthropic Agentic Misalignment | Grounds safety concerns in concrete behaviours (blackmail, espionage) in simulated settings. | Applied safety & threat-modelling reference | Research Post |
| Tool | Description |
|---|---|
| Guardrails AI | Structural, type, quality guarantees for LLM outputs. |
| NeMo Guardrails | NVIDIA. Programmable conversation guardrails. |
| LLM Guard | Security toolkit. Input/output scanning. |
| Rebuff | Prompt injection detection. |
| Lakera Guard | Real-time protection. Prompt injection, data leakage, toxicity. |
Building agentic systems requires a shift in skillset:
- Problem decomposition
- System design and orchestration
- Tool and interface design
- Memory modelling
- Evaluation design
- Failure mode analysis
- Safety and governance thinking
To keep this repository genuinely opinionated, we advocate against these common anti-patterns:
- Do not begin with multi-agent systems when a single agent plus tools will do. Escalate to multi-agent only when task decomposition requires it.
- Do not add memory before defining what deserves persistence. Avoid "state bloat" by being intentional about what is stored and why.
- Do not treat tracing as optional for long-running systems. Observability is the only way to debug non-deterministic agentic failures.
- Do not confuse benchmark wins with production readiness. Real-world reliability requires evaluation on your specific data and edge cases.
- Do not use framework abstractions as a substitute for architecture. Understand your control flow before outsourcing it to a library.
- ⭐ Production-grade
- 🧪 Experimental
⚠️ Early-stage / unstable
- Choose a core pattern (e.g. single-agent + tools)
- Add structured tool use
- Introduce evaluation early
- Layer in memory only when needed
- Expand into multi-agent systems with clear roles
- Add observability and safety constraints
Contributions are welcome! Please read the CONTRIBUTING.md for full details before submitting a pull request.
At a high level, submissions must meet the following criteria:
- Clear description of purpose
- Architectural strengths and operational constraints
- Governance fit and workload suitability
- Evidence of ecosystem maturity or real-world usage (preferred)
- Evidence tags and
Last reviewedmarkers where claims are time-sensitive or likely to change
This is a curated list, not an exhaustive one.
See appendix/benchmark-and-evidence-policy.md for the sourcing, evidence-tagging, and Last reviewed policy.
The shift to agentic systems is not about more tools.
It is about:
- Designing systems that can reason, act, evaluate, and improve
- Ensuring those systems are reliable, observable, and safe
Build accordingly.