testclaw is a CLI tool that uses Claude to automatically understand your mobile app, generate tests from plain English, run them on iOS simulators, and self-heal when tests break.
You describe what to test in natural language. Claude reads your codebase, writes the test code, executes it, and when something breaks, figures out whether it's a real bug or just a UI change — and fixes the test automatically.
Mobile app testing is painful. You write fragile UI tests that break every time a button moves. You spend more time maintaining tests than writing features. And when tests fail in CI, half the time it's because the test is stale, not because anything is actually broken.
testclaw flips this:
- Write tests in English, not Dart or YAML. Describe what should happen ("user logs in, sees the home screen") and Claude generates the actual test code.
- Tests heal themselves. When your app's UI changes,
testclawdiffs the code, reads the new UI, and updates selectors and assertions automatically. Only real bugs get flagged. - Agentic failure analysis — every test failure is automatically analyzed by Claude. It takes a screenshot of the simulator, looks at what's on screen, and tells you exactly why the test failed. No more guessing from stack traces.
- Agentic testing for flows that can't be scripted — OAuth popups, camera interactions, complex visual flows. Claude takes screenshots, reasons about what's on screen, and drives the app step by step.
It uses the Claude Agent SDK under the hood, so Claude has full access to read your codebase, edit test files, and run commands — the same capabilities as Claude Code, but orchestrated programmatically.
A fully agentic test runner — where Claude screenshots the app and decides what to do on every run — sounds appealing, but it doesn't scale. Each run costs API calls, takes minutes, and can produce non-deterministic results. You can't run it in CI 50 times a day.
TestClaw separates the intelligence from the execution:
-
AI writes the tests — Claude reads your codebase, understands widget trees, routes, and selectors, and generates real test code (Dart integration tests or Maestro YAML flows) from plain-English descriptions. This is the expensive, smart part — and it only happens once.
-
Deterministic code runs the tests — The generated tests are standard Flutter/Maestro tests. They run fast, produce consistent results, cost nothing, and work in CI. No AI in the loop during execution.
-
AI kicks in again only when things break — When a test fails, Claude takes a screenshot, analyzes the failure, and either heals the test automatically or flags it as a real bug. The agentic loop is reserved for diagnosis and repair, not routine execution.
-
Full agentic mode is available when you need it — For flows that genuinely can't be scripted (OAuth popups, camera, biometrics), you can mark tests as
type: agenticand Claude will drive the app visually. But this is opt-in, not the default.
The result: you get the authoring speed of AI with the execution speed and reliability of traditional tests.
testclaw init --local ./my-flutter-app # Analyze codebase, discover flows
testclaw suggest # Auto-generate test cases from discovered flows
testclaw add-test-ai "User can log in" # Add more tests in plain English
testclaw generate # Claude writes the actual test code
testclaw build # Build app for iOS simulator
testclaw run # Run all tests
testclaw heal # Auto-fix broken tests
testclaw status # See results
testclaw init analyzes your codebase and discovers testable flows. testclaw suggest turns those into structured test cases — complete with steps, preconditions, element mappings, and priority levels. Add more at any time with testclaw add-test-ai using plain English.
- macOS with Xcode and iOS Simulators installed
- Bun v1.2+ —
curl -fsSL https://bun.sh/install | bash - Claude Code CLI —
npm install -g @anthropic-ai/claude-code - Flutter (if testing Flutter apps) — install guide
- Maestro (for Maestro-based and agentic tests) —
curl -fsSL "https://get.maestro.mobile.dev" | bash(requires Java 17+)
Note:
ANTHROPIC_API_KEYis not required if you have Claude Code CLI installed and authenticated. TestClaw uses the Claude Agent SDK which piggybacks on your existing Claude Code session. If Claude Code works on your machine, TestClaw will too — no separate API key needed.
# Verify prerequisites
claude --version # Claude Code CLI
flutter --version # Flutter SDK
xcrun simctl list # iOS Simulators
maestro --version # Maestro CLIgit clone https://github.com/agarwal-sumit/testclaw.git
cd testclaw
bun install
bun link # Makes 'testclaw' available globallyNow you can run testclaw from anywhere.
git clone https://github.com/agarwal-sumit/testclaw.git
cd testclaw
bun install
bun run src/index.ts --help # Run directlyYou can compile testclaw into a self-contained binary (~58 MB, includes the Bun runtime). This requires Claude Code CLI to be installed separately on the target machine.
bun run compile # Produces ./testclaw binary
./testclaw --help # Works anywhere on the same OS/archThe binary embeds all dependencies except the Claude Code subprocess. Set ANTHROPIC_API_KEY and ensure the claude CLI is on your PATH.
# From an existing Flutter project directory
cd ~/my-flutter-app
testclaw init
# Or point to a directory
testclaw init --local ~/my-flutter-app
# Or clone from a git URL
testclaw init https://github.com/user/flutter-app.gitThis does three things:
- Detects the framework (Flutter, React Native)
- Analyzes the codebase — Claude reads your source files and produces a map of screens, routes, widgets, API endpoints, and test identifiers
- Scaffolds a
.qa/directory with config, test case storage, and result directories
The analysis also discovers testable flows and stores them as suggestions. Run testclaw suggest next to turn them into test cases.
testclaw suggestThis takes every suggested flow from the analysis (auth, KYC, checkout, etc.) and asks Claude to produce a complete structured test case for each — with steps, preconditions, element mappings, and priority. Results are saved as YAML files in .qa/testcases/.
Add tests anytime in plain English — Claude reads your codebase and generates the structured steps, selectors, and assertions:
testclaw add-test-ai "User can log in with valid email and password"
testclaw add-test-ai "User adds an item to cart and completes checkout"
testclaw add-test-ai "Verify the portfolio screen shows correct holdings after refresh"Claude produces a complete test case YAML — you don't need to know widget keys or route names.
You can also include test data directly in the description:
testclaw add-test-ai "User logs in with phone 0100100001, receives OTP 0101, enters PIN 1234"For full manual control, use add-test to specify every field yourself:
testclaw add-test \
--suite auth \
--name login-happy-path \
--description "User can log in with valid email and password" \
--priority critical \
--type auto \
--steps '[
{"action":"input","target":"Email input field","value":"test@example.com","description":"Enter email address"},
{"action":"tap","target":"Login button","description":"Submit login form"},
{"action":"assert","target":"Home screen","value":"Welcome","description":"Verify home screen appears"}
]'Or create YAML files directly in .qa/testcases/<suite>/<name>.yaml. A template is available at templates/testcase.yaml.
testclaw generate # Generate tests for all test cases
testclaw generate --suite auth # Generate only for the auth suiteClaude reads your test cases + the codebase analysis and produces:
- Flutter integration tests in
.qa/tests/integration/(Dart files) - Maestro flows in
.qa/tests/maestro/(YAML files) - Agentic instructions in
.qa/tests/agentic/(YAML files with test data and step instructions for Claude's screenshot→act loop)
The type: auto setting lets Claude decide the best test format based on complexity.
testclaw build # Build for iOS simulator, auto-selects device
testclaw build --device <udid> # Target a specific simulator
testclaw build --flavor dev # Build with a specific flavor
testclaw build --dart-define ENV=staging # Pass dart-define flags (repeatable)This runs flutter build ios --simulator --debug --no-codesign, finds the .app bundle, and installs it on the simulator. If the build fails, Claude diagnoses the error.
testclaw run # Run everything
testclaw run --suite auth # Run one suite
testclaw run --type integration # Run only integration tests
testclaw run --type maestro # Run only Maestro tests
testclaw run --type agentic # Run only agentic (screenshot-driven) testsWhen any test fails, TestClaw automatically takes a screenshot and has Claude analyze what's on screen to give you a clear, visual explanation of the failure — not just a raw stack trace.
Results are saved to .qa/results/runs/<timestamp>/ with screenshots, logs, agentic analysis, and a summary.json.
testclaw heal # Classify failures and auto-repair
testclaw heal --dry-run # Classify only, don't change anythingFor each failure, Claude:
- Checks if element fingerprints changed
- Diffs the source code since the last green run
- Classifies the failure:
- real_bug — the app has a genuine bug, flagged in results
- implementation_change — UI changed intentionally, test gets updated automatically
- flaky — timing issue, adds waits/retries
- If confidence >= 80%, applies the fix and optionally re-runs to verify
testclaw statusShows framework info, test case count, and last run results.
| Type | How it works | Best for |
|---|---|---|
integration |
Generates Dart integration_test files, runs via flutter test |
Widget interactions, navigation, data validation |
maestro |
Generates Maestro YAML flows, runs via maestro test |
Cross-app flows, multi-step UI journeys |
agentic |
Claude takes screenshots, reasons about the screen, executes Maestro commands in a loop. Gets an instruction file with test data (phone, OTP, PIN, etc.) | OAuth, camera, complex visual verification, flows that resist scripting |
auto |
Claude picks the best type based on test case complexity | Default — let the AI decide |
All test types benefit from agentic failure analysis — when any test fails, Claude automatically screenshots the simulator and explains what went wrong visually.
After testclaw init, your repo gets a .qa/ directory:
your-app/.qa/
├── config.yaml # TestClaw settings
├── CLAUDE.md # Auto-generated project context for Claude
├── analysis/
│ └── app-structure.json # Codebase analysis (screens, routes, etc.)
├── testcases/
│ └── auth/
│ └── login-happy-path.yaml
├── tests/
│ ├── integration/ # Generated Dart test files
│ ├── maestro/ # Generated Maestro YAML flows
│ └── agentic/ # Generated instruction files with test data
├── results/
│ ├── runs/<timestamp>/ # Screenshots, logs, analysis, summary per run
│ └── baselines/ # Baseline screenshots for visual regression
├── fingerprints/
│ └── elements.json # Element fingerprints for self-healing
└── history/
└── heal-log.json # Healing audit trail
This directory is designed to be committed to git. Test results and healing history create a traceable audit trail.
.qa/config.yaml:
framework: flutter
repoPath: /path/to/your/app
defaultTestType: auto
healingConfidenceThreshold: 0.8 # Only auto-heal if confidence >= this
maxAgenticTurns: 50 # Max Claude turns for agentic tests
screenshotOnFailure: true
autoCommitResults: true # Git commit after each run| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
No | Anthropic API key. Not needed if Claude Code CLI is already authenticated on your machine. |
CLAUDE_PATH |
No | Path to Claude Code executable (default: auto-detect) |
QA_USE_SYSTEM_CLAUDE |
No | Set to 1 to force using the system claude CLI |
testclaw uses the Claude Agent SDK to give Claude access to the same tools that power Claude Code:
- Code analysis: Claude uses
Read,Glob,Grepto understand your app's structure - Test generation: Claude uses
Read,Write,Editto produce runnable test files - Agentic testing: Claude connects to a custom MCP server with
take_screenshot,maestro_execute, andget_app_logstools. It enters a screenshot-reason-act loop to drive the app - Failure analysis: When any test fails, Claude takes a screenshot of the simulator and visually analyzes what went wrong
- Self-healing: Claude uses
Read,Grep,Editto classify failures and repair tests
Each AI operation runs with scoped permissions — the code analyzer can only read files, the test generator can read and write, and only the self-healer can edit existing tests.
testclaw init [repo-url] Clone + analyze + scaffold
testclaw init --local <path> Analyze existing local repo
testclaw analyze Re-analyze codebase
testclaw suggest Generate test cases from analysis suggestions
testclaw add-test-ai "<description>" Create a test case from plain English
testclaw add-test --suite <s> --name <n> ... Create a test case manually with full control
testclaw generate [--suite <name>] Generate test code from test cases
testclaw build [--device <udid>] Build + install on simulator
testclaw run [--suite <s>] [--type <t>] Run tests
testclaw heal [--dry-run] Classify + repair failures
testclaw status Show status and last results
All commands support --verbose for debug logging.
- iOS only — no Android SDK support currently. The simulator manager wraps
xcrun simctl. - Flutter-first — React Native detection exists but test generation is optimized for Flutter.
- Requires Claude Code CLI — the Agent SDK spawns Claude Code as a subprocess. You need it installed and an API key set.
- API costs — each
analyze,generate, orhealoperation makes API calls to Claude. Costs depend on codebase size and number of test cases.
Contributions are welcome. The codebase is TypeScript, runs on Bun, and has no build step for development.
git clone https://github.com/agarwal-sumit/testclaw.git
cd testclaw
bun install
bun run src/index.ts --help # Start developingKey files:
src/orchestrator.ts— main coordinator, delegates to all managerssrc/utils/claude-sdk.ts— thin wrapper around the Agent SDKsrc/agentic-tester.ts— the screenshot-reason-act loop with MCP tools + failure analysissrc/self-healer.ts— failure classification and auto-repair pipeline
MIT