A pure vision-based AI agent that plays Age of Empires 2: Definitive Edition using LLMs.
The agent uses a simple loop:
- Capture - Take a screenshot of the game window
- Think - Send to LLM with game context, get back actions
- Act - Execute mouse/keyboard commands
- Remember - Update memory with observations
- Repeat
The LLM sees the raw screenshot and outputs structured JSON with reasoning, observations, and actions. No OCR, no hardcoded coordinates, no predefined action vocabulary.
- Windows 10/11 with AoE2:DE installed
- Python 3.11+
- Anthropic API key
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtSet your Anthropic API key:
set ANTHROPIC_API_KEY=your-key-hereOptional environment variables:
AOE2_MODEL- Claude model to use (default: claude-sonnet-4-5-20250929)AOE2_LOOP_DELAY- Seconds between decisions (default: 2.0)AOE2_SAVE_SCREENSHOTS- Save screenshots to logs/ (default: true)
# Start the agent (runs until Ctrl+C)
python -m src.main
# Run specific number of iterations
python -m src.main --iterations 10
# Test mode - single iteration, no action execution
python -m src.main --test
# Specify provider (currently only claude available)
python -m src.main --provider claudefrom src.screen import capture_screenshot, save_screenshot
# Capture returns (bytes, width, height)
screenshot_bytes, width, height = capture_screenshot()
save_screenshot(screenshot_bytes, "test.jpg")agent/
├── src/ # Core agent implementation
│ ├── main.py # Entry point, CLI argument parsing
│ ├── game_loop.py # Main capture→think→act loop
│ ├── screen.py # Screenshot capture (targets game window)
│ ├── executor.py # Mouse/keyboard action execution
│ ├── config.py # Settings via Pydantic
│ ├── window.py # AoE2 window detection and focus
│ ├── memory.py # Agent memory and game state tracking
│ ├── models.py # Pydantic models for action validation
│ └── providers/
│ ├── base.py # Provider interface (BaseLLMProvider)
│ └── claude.py # Anthropic Claude implementation
├── detection/ # YOLO entity detection system
│ ├── inference/ # Runtime detection (detector.py, models/)
│ ├── training/ # Training pipeline & config
│ ├── extraction/ # Sprite & screenshot extraction
│ ├── testing/ # Validation scripts
│ └── docs/ # Detection documentation
├── data/ # Game knowledge database
│ ├── game_knowledge.py # Structured AoE2 game rules
│ └── aoe2.db # SQLite database
├── prompts/
│ └── system.md # System prompt with game knowledge
├── logs/ # Screenshots saved here
└── requirements.txt
See detection/README.md for details on the entity detection system.
When available, YOLO-based detection provides structured game understanding:
from detection import get_detector
detector = get_detector()
entities = detector.detect(screenshot_bytes)
# Returns: [{"id": "sheep_0", "class": "sheep", "center": (x, y), "confidence": 0.95}, ...]The LLM can reference entities by ID instead of pixel coordinates:
{"type": "right_click", "target_id": "sheep_0"}The executor automatically resolves IDs to screen coordinates.
The agent maintains memory across turns:
- Working Memory - Last 10 turns with reasoning and actions
- Game State - Tracked resources, population, age, alerts
- Context Building - Memory is summarized and sent to LLM each turn
Actions are validated using Pydantic models before execution:
click- Left click at (x, y)right_click- Right click at (x, y)press- Keyboard key pressdrag- Mouse drag from (x1, y1) to (x2, y2)wait- Delay in milliseconds
Coordinates are automatically translated from screenshot-relative to screen-absolute based on window position.
The agent automatically:
- Finds the AoE2 window by title
- Captures only the game window (not full screen)
- Ensures the game is focused before executing actions
- Falls back gracefully if window detection fails
Create a new file in src/providers/ implementing BaseLLMProvider:
from .base import BaseLLMProvider
class MyProvider(BaseLLMProvider):
async def get_actions(
self,
screenshot_bytes: bytes,
context: str = "",
width: int = 1920,
height: int = 1080,
) -> dict:
# Send screenshot to your LLM
# Return {"reasoning": "...", "observations": {...}, "actions": [...]}
pass
def get_system_prompt(self) -> str:
return "..."Then register it in src/main.py in the create_provider() function.
MIT