A contract-to-repair diagnostic layer for AG2 multi-agent workflows.
Concord monitors a multi-agent workflow run, detects when agents violate their behavioural contracts, attributes the root cause to the responsible agent, proposes a repair targeting the correct AG2 primitive, and validates the fix with a sandboxed regression test — all automatically, end-to-end.
Current product loop: register a workflow contract, submit a run or trace, inspect deterministic violations, review one repair patch per violation, validate in Daytona, export the report, and return to persisted history.
Live demo URL: https://concord-lite.vercel.app/ Pipeline (CLI):
python run_all.py --fixtureNorth-star scorecard:docs/PLAN_VS_REALITY.mdArchitecture & Q&A doc:docs/ARCHITECTURE.mdDemo script & cue cards:docs/DEMO_SCRIPT.mdDeep Q&A reference (70 questions):docs/QA_DEEP.md
- Code of conduct - expectations for project participation.
- Contributing - local setup, test gates, and contribution rules.
- Security - supported branches and vulnerability reporting.
- License - Apache License 2.0.
curl -X POST $CONCORD_API_BASE/api/runs \
-H "Content-Type: application/json" \
-d '{
"workflow_id": "WF-...",
"task_spec": {
"task": "Survey reliability patterns in multi-agent systems",
"research_question": "What architectural patterns improve MAS reliability?"
}
}'Omitting mode uses the product default: the real AG2 swarm with Tavily and
Daytona regression. Explicit mode=stub remains available for deterministic
internal tests, not as the public product path.
- Register or import an AG2 workflow contract.
- Submit a real task or trace.
- Watch the run status move through queued, analyzing, and completed.
- Open the completed run and follow violation -> evidence -> AG2 primitive -> repair patch -> regression result.
- Export the completed report JSON.
- Reopen the run from persisted history.
For deterministic local verification, run python run_all.py --fixture (no API keys needed). Fixture mode exercises the same Zone B report path while skipping live Zone A execution.
The FastAPI layer supports a dev bootstrap path and tenant-scoped API keys:
uvicorn api.index:app --port 8765
curl -X POST http://localhost:8765/api/api-keys \
-H 'Content-Type: application/json' \
-d '{"tenant_id":"tenant-a","name":"tenant-a primary"}'
curl http://localhost:8765/api/tenant/usage \
-H "Authorization: Bearer <returned-api-key>"When no API keys exist, unauthenticated requests use the local tenant for demo setup. After a key exists, /api/* routes require a bearer key except /api/health; browser SSE uses a short-lived stream token minted from an authenticated run.
Multi-agent systems fail silently. An agent claims it verified sources but sets verified_sources_count=0. Another runs a side-effect action without waiting for human approval. A third records no tool call despite claiming it searched. These are contract violations — the gap between what an agent says it did and what it actually produced in the trace.
Concord is a diagnostic pipeline that sits outside the target workflow, reads its execution trace, and systematically detects, attributes, and repairs those gaps.
Zone A: target workflow (demo fixture can be broken by design)
↓ execution trace (JSON)
Zone B: Concord diagnostic (detects, attributes, repairs)
↓ Contract Violation Report
┌─────────────────────────────────────────────────────────────────────────┐
│ python run_all.py (or --fixture to skip Zone A) │
└──────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────▼──────────────────────┐
│ ZONE A │
│ Literature Review Assistant │
│ (broken by design) │
└───────────────────┬──────────────────────┘
│
task.json ──────────────┤
(task, research_question, │
run_id) │
▼
┌────────────────────────────┐
│ ResearcherAgent [step 1] │──→ Tavily API
│ tool_call_id = "tc_001" │ 3 sources retrieved
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ CriticAgent [step 2] │ critique notes
│ tool_call_id = null │ risk flags
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ VerifierAgent [step 3] │ ⚠ INTENTIONALLY BROKEN
│ tool_call_id = null ✗ │ ← Contract C2 violation
│ verified_sources = 0 ✗ │ ← Contract C1 violation
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ ReporterAgent [step 4] │ runs despite 0 verified
│ produces final_output │ sources (cascading failure)
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ HumanGate [gate] │ ⚠ INTENTIONALLY BROKEN
│ approval = "pending" ✗ │ ← never approves
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ ActionAgent [step 5] │ ⚠ INTENTIONALLY BROKEN
│ runs without approval ✗ │ ← Contract C4 violation
└──────────────┬─────────────┘
│
▼
trace_emitter.py
zone_b/fixtures/sample_trace.json
│
┌─────────────────▼──────────────────────┐
│ ZONE B │
│ Concord Diagnostic │
└─────────────────┬──────────────────────┘
│
▼
┌────────────────────────────┐
│ TraceCollector [B1] │ JSON → RunTrace
│ (no LLM) │ + ContextSnapshot
└──────────────┬─────────────┘
│
▼
┌────────────────────────────┐
│ ContractChecker [B2] │ checks 5 rules
│ (deterministic + LLM │ → 4 Violation objects
│ for human text) │ 3 HIGH, 1 MEDIUM
└──────────────┬─────────────┘
│
4 violations
│
▼
┌────────────────────────────┐
│ Attribution [B3] │──→ LLM
│ │ failed_agent = VerifierAgent
└──────────────┬─────────────┘ failed_step = 3
│
▼
┌────────────────────────────┐
│ Repair [B4] │──→ LLM
│ │ primitive = Guardrail
└──────────────┬─────────────┘ confidence = 0.85
│
▼
┌────────────────────────────┐
│ RegressionTest [B5] │──→ LLM generates test code
│ │──→ Daytona sandbox runs it
└──────────────┬─────────────┘ status: pass / fail
│
▼
┌────────────────────────────┐
│ Reporter [B6] │──→ LLM
│ │ Contract Violation Report
└──────────────┬─────────────┘ + narrative
│
▼
┌────────────────────────────┐
│ HumanGate [B7] │ auto-approves (demo)
│ │ approval_status = approved
└──────────────┬─────────────┘
│
▼
╔═══════════════════════════════════╗
║ CONTRACT VIOLATION REPORT ║
╠═══════════════════════════════════╣
║ run_id : run_041 ║
║ violations : 3 (all HIGH) ║
║ failed_agent : VerifierAgent ║
║ primitive : Guardrail ║
║ confidence : 0.85 ║
║ regression : pass / fail ║
║ approval : approved ║
╚═══════════════════════════════════╝
Zone A is deliberately broken in exactly three ways. Zone B must detect all three.
| ID | Contract type | Severity | What breaks | AG2 primitive to fix |
|---|---|---|---|---|
| C1 | evidence |
HIGH | VerifierAgent sets verified_sources_count=0 in context_delta — ReporterAgent runs with no verified evidence |
Guardrail |
| C2 | tool |
HIGH | VerifierAgent has tool_call_id=None despite claiming to verify sources — no tool use recorded |
OnContextCondition |
| C3 | approval |
HIGH | ActionAgent runs with approval_status="pending" — HumanGate never approves |
HumanGate |
These failures cascade: C2 prevents C1 from being fixable at runtime, and C3 is independent. Zone B's ContractChecker checks each rule deterministically against the ContextSnapshot, then uses an LLM to generate human-readable expected / observed text for each violation.
All data flowing between agents — across both zones — is typed via dataclasses in shared/models.py. Nothing passes as raw dicts between pipeline stages.
ToolEvent
tool_name str "tavily_search"
input Any the query sent
output Any "3 results returned"
status str "success" | "failure"
evidence_id str "ev_001"
timestamp float
TraceEvent ← one per agent turn in Zone A
step int 1 – 5
agent str "ResearcherAgent" etc.
type str "agent_turn"
content str the agent's text output
tool_call_id str | None non-null only for ResearcherAgent
context_delta dict incremental state update
handoff_to str | None next agent in chain
timestamp float
RunTrace ← full Zone A execution record
run_id str "run_041"
workflow_name str
events list[TraceEvent]
final_output Any
ContextSnapshot ← folded state after all events
retrieved_sources list from ResearcherAgent
verified_sources_count int 0 (broken by design)
tool_events list[ToolEvent]
approval_status str "pending" | "approved" | "rejected"
failed_agent str | None
failed_step int | None
final_output Any
Violation ← one per broken contract
contract_type str "evidence" | "tool" | "approval" etc.
severity str "high" | "medium" | "low"
rule str
expected str LLM-generated human text
observed str LLM-generated human text
failed_agent str
failed_step int
Each TraceEvent carries a context_delta — the incremental state change that agent produced. TraceCollector._build_context_snapshot() folds all deltas left-to-right into a single ContextSnapshot:
- Scalar fields (
verified_sources_count,approval_status,final_output): last write wins tool_events: accumulated (extended, not replaced)- Unknown keys (e.g.
action_event): silently dropped duringContextSnapshotconstruction
This means Zone A agents only need to emit what they changed — not the full state.
Zone A is the target workflow — the system being monitored. It implements a Literature Review Assistant.
task.json
│
├─→ ResearcherAgent calls Tavily, returns 3 sources + summary
├─→ CriticAgent critiques sources, returns notes + risk flags
├─→ VerifierAgent [BROKEN] returns tool_call_id=None, verified_sources_count=0
├─→ ReporterAgent assembles final_output dict (runs anyway)
├─→ HumanGate [BROKEN] always returns approval_status="pending"
└─→ ActionAgent [BROKEN] runs without checking approval_status
Every Zone A agent must return a dict with exactly these keys — run.py converts them to TraceEvent via _to_trace_event():
{
"step": int, # 1–5
"agent": str, # agent class name
"type": "agent_turn",
"content": str, # agent's primary text output
"tool_call_id": str | None, # non-null only when a real tool was called
"context_delta": dict, # incremental state — only what this agent changed
"handoff_to": str | None, # next agent name
"timestamp": float, # time.time()
}| Agent | Keys emitted in context_delta |
|---|---|
| ResearcherAgent | retrieved_sources, tool_events |
| CriticAgent | (empty) |
| VerifierAgent | verified_sources_count (= 0) |
| ReporterAgent | final_output |
| ActionAgent | action_event |
All Zone A agents use:
make_proxy(name)— creates a standardUserProxyAgentwithllm_config=False,human_input_mode="NEVER",is_termination_msg=lambda x: True,max_consecutive_auto_reply=0,code_execution_config=Falsestrip_json_fences(text)— removes```jsonfences from LLM responses beforejson.loads()
Zone B is the diagnostic pipeline — it never runs Zone A's agents, only reads the trace Zone A emitted.
| Stage | Agent | LLM? | Input | Output |
|---|---|---|---|---|
| B1 | TraceCollector | No | raw JSON | RunTrace + ContextSnapshot |
| B2 | ContractChecker | Yes (text only) | RunTrace + ContextSnapshot |
list[Violation] |
| B3 | Attribution | Yes | violations + trace | failed_agent, failed_step, likely_root_cause |
| B4 | Repair | Yes | violations + attribution | patches[], affected_primitive, patch_code, confidence |
| B5 | RegressionTest | Yes + Daytona | repair patch + violations | test_status, per_violation_results, sandbox_id |
| B6 | Reporter | Yes | all upstream outputs | Contract Violation Report dict |
| B7 | HumanGate | No | report | approval_status = "approved" |
repair.py maps violation type to AG2 primitive without an LLM call:
evidence → Guardrail
tool → OnContextCondition
routing → Handoff
approval → HumanGate
schema → Guardrail
The LLM is only used to generate patch_code and expected_impact text for each violation's mapped primitive.
Five rules, all checked deterministically against ContextSnapshot:
# C1 — evidence
lambda trace, snap: snap.verified_sources_count > 0
# C2 — tool
lambda trace, snap: any(
e.agent == "VerifierAgent" and e.tool_call_id
for e in trace.events
)
# C3 — approval
lambda trace, snap: snap.approval_status == "approved"The LLM is called after a check fails — only to produce expected / observed strings for the report.
make_proxy(name)— same contract as Zone A's versionparse_json_body(body)— strips```jsonfences and raisesValueErroron parse failure (not silent)
RegressionTest uses an LLM to generate a self-contained Python script that simulates the post-repair state and asserts each violation is no longer reachable. The script is executed in a fresh Daytona sandbox via daytona.process.code_run(). The sandbox is always deleted, even on error. The stage returns aggregate test_status plus one per_violation_results[] row per violation. If DAYTONA_API_KEY or DAYTONA_API_URL are absent, the stage returns test_status="error" and sandbox_id="no-sandbox" without crashing the pipeline.
.
├── run_all.py entry point — chains Zone A → Zone B
│
├── shared/
│ └── models.py all shared dataclasses (ToolEvent, TraceEvent,
│ RunTrace, ContextSnapshot, Violation, ...)
│
├── zone_a/ Literature Review Assistant (target workflow)
│ ├── config.py get_llm_config() — Gemini Flash via OpenRouter
│ ├── context_variables.py ZoneAContext dataclass
│ ├── workflow_contract.py C1–C5 contract definitions (reference)
│ ├── trace_emitter.py writes RunTrace → zone_b/fixtures/sample_trace.json
│ ├── run.py pipeline orchestrator + _to_trace_event()
│ ├── fixtures/
│ │ └── task.json task, research_question, run_id
│ └── agents/
│ ├── _utils.py make_proxy(), strip_json_fences()
│ ├── researcher.py Tavily search → sources + summary
│ ├── critic.py critique notes + risk flags
│ ├── verifier.py ⚠ BROKEN: tool_call_id=None, verified_sources_count=0
│ ├── reporter.py final_output dict assembly
│ ├── human_gate.py ⚠ BROKEN: always returns pending
│ └── action_agent.py ⚠ BROKEN: runs without approval
│
├── zone_b/ Concord diagnostic pipeline
│ ├── config.py get_llm_config() — same model, same pattern
│ ├── utils.py make_proxy(), parse_json_body()
│ ├── orchestrator.py wires B1–B7 sequentially
│ ├── run.py standalone Zone B runner (reads fixture)
│ ├── sandbox_run.py Daytona demo runner with mock trace
│ ├── fixtures/
│ │ └── sample_trace.json pre-baked run_041 trace (4 violations)
│ ├── contracts/ dataclass registry + YAML contract DSL
│ └── agents/
│ ├── trace_collector.py JSON → RunTrace + ContextSnapshot (no LLM)
│ ├── contract_checker.py registry-backed deterministic checks + LLM text
│ ├── attribution.py LLM → failed_agent + root cause
│ ├── repair.py primitive map + LLM patch code
│ ├── regression_test.py LLM test gen + Daytona execution
│ ├── reporter.py LLM narrative + report assembly
│ └── human_gate.py auto-approves (demo mode)
│
├── public/ frontend (Vercel-deployed mission-control dashboard)
│ ├── index.html self-contained HTML + inline React app + fixture
│ ├── styles.css monospace dark-mode UI tokens
│ ├── app.jsx 7-screen React component tree (split-file copy)
│ └── data.js extracted CONCORD_DATA fixture (split-file copy)
│
├── api/ backend HTTP layer (FastAPI; not deployed yet)
│ ├── index.py routes: /api/health, /api/runs, /api/runs/{id}.js, approval
│ ├── adapter.py Zone B report → CONCORD_DATA shape
│ └── store.py in-memory run store seeded with RUN-041
│
├── vercel.json static-only deploy config
├── requirements.txt FastAPI deps (for the api/ layer)
│
└── tests/
├── conftest.py shared fixtures (sample_trace_raw, clean_trace_raw, ...)
├── test_models.py 21 — dataclass field integrity
├── test_trace_collector.py 30 — parsing, folding, snapshot building
├── test_contract_checker.py 26 — contract lambdas + step lookup
├── test_attribution.py 10 — deterministic fallback paths
├── test_repair.py 20 — per-violation patches, PRIMITIVE_MAP, scalar aliases
├── test_regression_test.py 20 — _parse_status, fallback test execution
├── test_reporter.py 15 — report assembly, severity summary, patches
├── test_human_gate.py 6 — approval output shape
├── test_zone_a.py 22 — strip_json_fences, _to_trace_event, agent shapes
├── test_integration.py 20 — Zone A→B schema, 4 violations, clean trace = 0
├── test_per_violation_repairs.py 3 — per-violation regression statuses
├── test_rigorous.py 57 — edge cases, error paths, boundary conditions
├── test_routing_contract.py 3 — routing contract broken + clean trace cases
├── test_schema_contract.py 6 — schema contract missing-key + fixture cases
└── test_swarm.py 27 — AG2 swarm tools, handoffs, guardrails, trace extraction
────
288 total (274 non-integration + 14 integration-marked)
git clone https://github.com/d3v07/AG2_Hackathon.git
cd AG2_Hackathon
pip install -e .Create .env in the repo root:
OPENROUTER_API_KEY=your_openrouter_key
TAVILY_API_KEY=your_tavily_key
DAYTONA_API_KEY=your_daytona_key
DAYTONA_API_URL=https://app.daytona.io/apiOnly OPENROUTER_API_KEY is required to run Zone B. TAVILY_API_KEY is required to run Zone A live. DAYTONA_* is required only for the sandboxed regression test stage.
Uses the pre-baked zone_b/fixtures/sample_trace.json (run_041, 4 violations). Runs the full Zone B diagnostic pipeline live with your LLM keys.
python run_all.py --fixtureRuns Zone A end-to-end (Tavily search → agent chain → trace emission), then runs Zone B on the freshly generated trace.
python run_all.pypython zone_b/run.pypython zone_a/run.py============================================================
CONCORD — Full Pipeline Run
============================================================
[Zone A] Skipped — using fixture trace
Trace loaded from zone_b/fixtures/sample_trace.json
[Zone B] Running diagnostic pipeline...
[1/7] TraceCollector — loading zone_b/fixtures/sample_trace.json
run_041: 5 events, 1 tool call(s), handoff path length 5
[2/7] ContractChecker — applying contracts
4 violation(s) found
[3/7] Attribution — identifying failed agent
failed_agent=VerifierAgent step=3
[4/7] Repair — mapping to AG2 primitive
affected_primitive=Guardrail confidence=0.85
[5/7] RegressionTest — running in Daytona
test_status=pass sandbox=<id>
per_violation=4 pass/0 fail/0 error
[6/7] Reporter — assembling Contract Violation Report
[7/7] HumanGate — approval check
Decision: APPROVED
============================================================
CONTRACT VIOLATION REPORT
============================================================
Run ID : run_041
Violations : 4
Severity : {'high': 3, 'medium': 1, 'low': 0}
Failed agent : VerifierAgent (step 3)
Root cause : VerifierAgent failed to use a tool to gather verified sources
Affected primitive: Guardrail
Repair confidence : 0.85
Approval status : approved
============================================================
# Fast — no API calls (~0.5s, 274 tests)
pytest tests/ -m "not integration"
# Full suite including LLM + Daytona integration tests
pytest tests/| File | Count | What it covers |
|---|---|---|
test_models.py |
21 | All shared dataclasses, field types, optional fields |
test_trace_collector.py |
30 | JSON parsing, context_delta folding, edge cases |
test_contract_checker.py |
26 | Contract lambdas, boundary values, step lookup |
test_attribution.py |
10 | Deterministic fallback, empty violations |
test_repair.py |
20 | Per-violation patch cardinality, scalar alias selection, full PRIMITIVE_MAP coverage |
test_regression_test.py |
20 | _parse_status edge cases, fallback test code executes PASS |
test_reporter.py |
15 | Report assembly, severity summary, repair patch passthrough, fallback narrative |
test_human_gate.py |
6 | Auto-approve shape, handles empty report |
test_zone_a.py |
22 | strip_json_fences, _to_trace_event, all 5 agent return shapes |
test_integration.py |
20 | Zone A→B schema compatibility, exactly 4 violations, clean trace = 0 |
test_per_violation_repairs.py |
3 | Per-violation regression status and reporter aggregation |
test_rigorous.py |
57 | Edge cases, error paths, partial violations, data-flow contracts |
test_routing_contract.py |
3 | Routing contract fails fixture, passes clean trace |
test_schema_contract.py |
6 | Schema contract fails missing keys, passes fixture |
test_swarm.py |
27 | Swarm tools, handoffs, guardrails, trace extraction |
All agents in this codebase follow these patterns. PRs that deviate will be rejected.
from autogen import ConversableAgent
from zone_b.config import get_llm_config # or zone_a.config
from zone_b.utils import make_proxy # or zone_a.agents._utils
agent = ConversableAgent(
name="AgentName",
llm_config=get_llm_config(),
system_message="...",
human_input_mode="NEVER",
max_consecutive_auto_reply=1,
code_execution_config=False, # always False
)
proxy = make_proxy("AgentNameProxy") # shared helper, never inline
result = proxy.initiate_chat(agent, message="...", max_turns=1)
output = result.chat_history[-1]["content"]Never trust LLMs to return bare JSON. Always strip fences first:
# Zone A
from zone_a.agents._utils import strip_json_fences
parsed = json.loads(strip_json_fences(result.chat_history[-1]["content"]))
# Zone B
from zone_b.utils import parse_json_body
parsed = parse_json_body(result.chat_history[-1]["content"]) # raises ValueError on failureBoth zones use the same get_llm_config() pattern — Gemini 2.5 Flash via OpenRouter, temperature 0.1:
def get_llm_config(model: str = "google/gemini-2.5-flash") -> dict:
return {
"config_list": [{
"model": model,
"api_key": os.environ["OPENROUTER_API_KEY"],
"base_url": "https://openrouter.ai/api/v1",
"api_type": "openai",
}],
"temperature": 0.1,
}- Never use
autogen.beta.Agent - Never call
get_config()— useget_llm_config()from the zone'sconfig.py - Always set
code_execution_config=Falseon both agent and proxy - Always set
human_input_mode="NEVER"on both max_consecutive_auto_reply=1on agents,0on proxies- Use
make_proxy()— never copy-paste the 8-lineUserProxyAgentblock - Validate JSON parsing at the boundary — never
json.loads()raw LLM output
| Tool | Version | Role |
|---|---|---|
| AG2 | >=0.12 |
Multi-agent framework — ConversableAgent, UserProxyAgent |
| Gemini 2.5 Flash | — | LLM for all agents, via OpenRouter |
| OpenRouter | — | OpenAI-compatible API proxy for Gemini |
| Tavily | — | Web search API for ResearcherAgent |
| Daytona | — | Sandboxed code execution for regression tests |
| Python | >=3.12 |
Required for `str |
| GitHub | Zone | Sprint 1 | Sprint 2 |
|---|---|---|---|
| d3v07 | Zone B | #1 — Scaffold + models + TraceCollector + ContractChecker | #5 — Wire Zone A→B + run_all.py |
| Frex22 | Zone B | #2 — Attribution + Repair + RegressionTest + Reporter + Orchestrator | #6 — Zone B full pytest suite + Daytona tests |
| PruthviVKadam | Zone A | #3 — Scaffold + ContextVariables + ResearcherAgent + CriticAgent | #7 — Zone A live run + trace schema validation |
| niharika2701 | Zone A | #4 — VerifierAgent + ReporterAgent + ActionAgent + HumanGate + trace_emitter | #8 — Contract Violation Report validation + demo run |