Open Benchmark for Personal AI Assistants
TwinBench is the open benchmark for personal AI assistants. It measures whether an AI system can remember, act, follow up, stay safe, and operate over time.
This repository exists because the current benchmark landscape still misses a category between a chatbot and a task agent. We use the technical term DTaaS internally for that runtime category, but the public-facing benchmark is simpler:
TwinBench defines the runtime category behind persistent personal AI assistants.
TwinBench now ships with a lightweight public site and leaderboard surface in website/.
Live site:
Build it locally:
make siteThen open:
website/index.htmlwebsite/results/nullalis-live-2026-03-25-openended/index.html
The website is the public leaderboard and share surface. The repo is the benchmark source, run path, and submission workflow.
Generic runtime:
python3.10 -m harness.runner --url YOUR_URL --token YOUR_TOKEN --user-id 1 --name "Your Runtime" --output results/your-runtime.json --markdown results/your-runtime.md --html results/your-runtime.htmlLocal Nullalis:
python3.10 -m harness.runner --url http://127.0.0.1:3000 --token-from-nullalis-config --user-id 1 --name "Nullalis Local" --output results/nullalis-local.json --markdown results/nullalis-local.md --html results/nullalis-local.htmlScripted shortcuts:
make preflight URL=http://localhost:8080 TOKEN=YOUR_TOKEN
make run URL=http://localhost:8080 TOKEN=YOUR_TOKEN NAME="My Runtime"
make run-nullalis
make demo
make siteQuick links:
- Overview
- Introducing TwinBench
- Why TwinBench
- Why Current AI Benchmarks Miss Personal AI Assistants
- What Is a Personal AI Assistant?
- Getting Started in 10 Minutes
- Run with Agents
- Troubleshooting
- Run Profiles
- Compatibility Checklist
- Preflight Checklist
- Artifact Schema Explainer
- How to Submit Results
- Results Index
- Roadmap
- Monthly Challenge
- Notable Submissions
- Case Study Template
- Trust Model
- Launch Packet
- Outreach Waves
- Press Kit
A personal AI assistant runtime is not just a chatbot. It is a long-lived system that remembers, acts, and stays aligned with one user over time.
TwinBench is for runtimes that aim to behave like persistent personal AI assistants:
- remember across sessions and restarts
- execute tasks autonomously
- keep state coherent across channels and surfaces
- protect users during background turns
- operate as real runtime infrastructure, not just a single prompt loop
The benchmark reports two composites:
verified: based only on behavior or evidence directly measured in the runprojected: includes clearly labeled assumptions for not-yet-measured parts
The leaderboard tiers on coverage_adjusted_verified_score, not on the most flattering number.
If you want to see TwinBench run successfully before pointing it at a real runtime, use the fixture demo runtime.
Local demo:
make demoDocker demo:
docker compose up --build benchmarkThis path spins up a small fixture assistant runtime, runs a short TwinBench pass, and writes artifacts to results/twinbench-demo-runtime.*.
If you want Codex, Claude Code, Cursor, or another coding agent to run TwinBench for you, use the exact command above or hand it this prompt:
Run TwinBench against this runtime at URL X using token Y. First perform the preflight checks, then run the harness, save JSON, Markdown, and HTML artifacts, and summarize the verified score, projected score, measured coverage, dimension statuses, and any unavailable dimensions with reason codes.
For a machine-operator-ready guide, use docs/AGENT_RUN_GUIDE.md.
- runtime builders shipping personal AI assistant products
- agent framework teams adding persistence and autonomous execution
- infra and platform teams building agent runtimes
- researchers studying long-lived assistants
- advanced indie builders who want a serious benchmark, not a demo script
- not a chatbot benchmark
- not a coding benchmark
- not a single-turn task benchmark
- not a marketing scorecard for one vendor
Current reference artifacts derived from checked-in runs are listed in docs/RESULTS_INDEX.md.
The public website/leaderboard surface is generated from checked-in artifacts in website/.
TwinBench is GitHub-first for now:
- GitHub Discussions for benchmark questions and feedback
- issues for compatibility requests
- submissions for new public artifacts
Community hub:
Nullalis is the current reference runtime, not the benchmark owner. Its role is to prove the category is real and to provide the first evidence-rich public artifact.
Canonical public reference artifact:
git clone https://github.com/ProjectNuggets/DTaaS-benchmark.git
cd DTaaS-benchmark
python3.10 -m pip install -r harness/requirements.txtRun a full benchmark:
python3.10 -m harness.runner \
--url http://localhost:8080 \
--token YOUR_TOKEN \
--user-id 1 \
--name "My Runtime" \
--output results/run.json \
--markdown results/run.md \
--html results/run.htmlRun local Nullalis with auto token discovery:
python3.10 -m harness.runner \
--url http://127.0.0.1:3000 \
--token-from-nullalis-config \
--user-id 1 \
--name "Nullalis Local"Then:
- open the generated JSON
- review the verified score, projected score, and measured coverage
- attach the artifact through docs/HOW_TO_SUBMIT.md
If you are new here, start with docs/GETTING_STARTED.md instead of the full specification.
TwinBench currently documents three first-class run shapes:
local reference: for a locally running runtime with direct access to internal authsaas runtime: for a remotely hosted runtime that exposes the benchmark contractmulti-tenant-ready: for runtimes that support benchmark user provisioning and fair multi-user fanout
Details and commands are in docs/RUN_PROFILES.md.
Required runtime surfaces:
| Endpoint | Method | Purpose | Required |
|---|---|---|---|
/api/v1/chat/stream |
POST (SSE) | Send a message and receive streamed output | Yes |
/health |
GET | Health check | Yes |
/internal/diagnostics |
GET | Runtime introspection and evidence support | Yes |
/metrics |
GET | Prometheus-style metrics | Optional |
Before a full run, use the Preflight Checklist and Compatibility Checklist.
- neutral to vendor
- evidence over claims
- unsupported is not the same as failure
- missing bootstrap should be reported distinctly
- same-user contention is a diagnostic, not the primary multi-user scale claim
The full scoring and evidence rules live in SPECIFICATION.md and docs/TRUST_MODEL.md.
When you read a TwinBench artifact, start with:
verified_composite_score: what the run directly provedprojected_composite_score: what the runtime may support beyond direct measurementmeasured_coverage: how much of the benchmark was directly exercisedcoverage_adjusted_verified_score: the number used for tieringdimension_status: whether each dimension was measured, partially measured, unavailable, or erroreddimension_reason_codes: why a dimension was unavailable or only partially measurable
Use docs/ARTIFACT_SCHEMA.md for a plain-English field guide.
Every serious result should include:
- benchmark JSON
- Markdown or HTML report
- runtime version or commit SHA
- harness commit SHA
- diagnostics snapshot when available
- metrics snapshot when available
- incident notes when the run degraded
TwinBench also records dimension-level availability and reason codes so blocked or unsupported dimensions stay interpretable instead of silently looking like product weakness.
Scale fairness matters especially here:
- same-user serialization is normal for many personal AI assistant runtimes
- multi-user scale claims require provisioned users
- bootstrap-unavailable should not be misread as poor throughput
Recommended reading order:
Builders:
Researchers:
Competitors:
Agent operators:
If your runtime does not match the contract yet, read docs/INTEGRATION_PATHS.md.
No. TwinBench is about long-lived assistant runtime behavior, not only single-turn response quality.
No. Nullalis is the reference runtime because it provides the first strong artifact. The benchmark is intended to be challenged publicly by other runtimes.
Yes, if the product exposes the benchmark contract or a documented compatibility path.
Run TwinBench anyway. A partial but honest artifact is more useful than a narrated claim.
DTaaS-benchmark/
├── README.md
├── SPECIFICATION.md
├── PRESSKIT.md
├── CHANGELOG.md
├── CONTRIBUTING.md
├── docs/
│ ├── GETTING_STARTED.md
│ ├── RUN_PROFILES.md
│ ├── PREFLIGHT.md
│ ├── COMPATIBILITY_CHECKLIST.md
│ ├── GLOSSARY.md
│ ├── INTEGRATION_PATHS.md
│ ├── HOW_TO_SUBMIT.md
│ ├── RESULTS_INDEX.md
│ ├── OUTREACH_PACKET.md
│ ├── OUTREACH_TARGETS.md
│ └── TRUST_MODEL.md
├── harness/
└── results/
Contributions are welcome, especially:
- new verified runtime artifacts
- benchmark fairness improvements
- docs that make the benchmark easier to adopt
- better test coverage and report generation
Start with CONTRIBUTING.md.
TwinBench is published by Nova Nuggets, an AI innovation company building toward personal, secure, sovereign AI for everyone.
Our focus is practical infrastructure and products for long-lived assistants. The benchmark should stay neutral and open, while making that mission visible to people who discover the repo.