AutoHarness

Automatically synthesize code harnesses for LLM agents

AutoHarness is a Rust library that automatically generates and optimizing code harnesses for LLM agents, following the approach described in the AutoHarness paper. It uses tree search with Thompson sampling to iteratively refine harness code, achieving an average of 14.5 iterations to reach 100% legal action rate.

🚀 One-Sentence Quick Start (For OpenCode/CloudCode)

Copy and paste the sentence below directly into OpenCode or CloudCode to start:

Now use AutoHarness CLI (https://github.com/gyc567/AutoHarness) to design a Harness engineering system for this project.

Or in Chinese (中文):

现在用 AutoHarness 这个 CLI:https://github.com/gyc567/AutoHarness 对本项目进行设计 Harness 工程系统。

This will automatically initiate the AutoHarness system to design and generate a Harness engineering system for your project.

⚡ Quick Install (One-Click)

# One-line install (recommended)
curl -fsSL https://raw.githubusercontent.com/gyc567/AutoHarness/main/install/install.sh | bash

# Or use jsDelivr CDN (faster)
curl -fsSL https://cdn.jsdelivr.net/gh/gyc567/AutoHarness@main/install/install.sh | bash

# Verify
autoharness --version

Alternative: Clone & Install

git clone https://github.com/gyc567/AutoHarness.git
cd AutoHarness/install
chmod +x install.sh
./install.sh

Installation Options

Command	Description
`./install.sh`	Install
`./install.sh install`	Install (same)
`./install.sh uninstall`	Uninstall
`./install.sh --help`	Show help

Installation Location

Default: ~/.local/bin/autoharness
Add to PATH: export PATH="$HOME/.local/bin:$PATH"

Supported Platforms

OS	Architecture	Status
macOS	Intel (x86_64)	✅ Available
macOS	Apple Silicon (ARM)	⬅️ Uses x86_64 binary
Linux	x86_64	🔨 Build from source
Windows	x86_64	🔨 Build from source

🎯 Key Features

Three Harness Modes: Filter, Verifier, and Policy harnesses
Tree Search + Thompson Sampling: Efficient exploration of code space
Sandboxed Execution: Secure code execution with resource limits
Adaptive Optimization: Self-adjusting exploration vs exploitation
High Performance: Average 14.5 iterations to convergence

📦 Installation (Cargo)

Add this to your Cargo.toml:

[dependencies]
autoharness = "0.1.0"

🚀 Quick Start

Basic Usage

use autoharness::core::{State, Action, Harness, HarnessType};
use autoharness::engine::{CodeSynthesisEngine, SynthesisConfig, Evaluator};
use autoharness::sandbox::{SandboxExecutor, SandboxConfig};

// Define your state
#[derive(Debug, Clone, serde::Serialize)]
struct GameState {
    board: Vec<Vec<i32>>,
    score: i32,
}

impl State for GameState {
    fn to_prompt(&self) -> String {
        format!("Board: {:?}, Score: {}", self.board, self.score)
    }

    fn validate(&self) -> autoharness::core::Result<()> {
        Ok(())
    }
}

// Define your action
#[derive(Debug, Clone, PartialEq, serde::Serialize, serde::Deserialize)]
enum GameAction {
    MoveUp,
    MoveDown,
    MoveLeft,
    MoveRight,
}

impl Action for GameAction {
    fn to_string(&self) -> String {
        format!("{:?}", self)
    }

    fn from_string(s: &str) -> autoharness::core::Result<Self> {
        match s {
            "MoveUp" => Ok(GameAction::MoveUp),
            "MoveDown" => Ok(GameAction::MoveDown),
            "MoveLeft" => Ok(GameAction::MoveLeft),
            "MoveRight" => Ok(GameAction::MoveRight),
            _ => Err(autoharness::core::HarnessError::action_parse("Unknown action")),
        }
    }
}

// Create a custom evaluator
struct GameEvaluator;

impl Evaluator for GameEvaluator {
    fn evaluate(&self, code: &str) -> autoharness::engine::Result<f64> {
        // Evaluate the harness code
        // Return a score between 0.0 and 1.0
        if code.contains("is_legal_action") {
            Ok(0.8)
        } else {
            Ok(0.2)
        }
    }
}

// Synthesize a harness
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = SynthesisConfig::new()
        .with_max_iterations(20)
        .with_convergence_threshold(0.95);

    let mut engine = CodeSynthesisEngine::new(config);
    let evaluator = GameEvaluator;

    let initial_code = r#"
        def is_legal_action(state, action):
            # TODO: Implement validation logic
            return True
    "#;

    let optimized_code = engine.synthesize(initial_code, &evaluator)?;
    println!("Optimized harness:\n{}", optimized_code);

    Ok(())
}

🏗️ Architecture

Core Components

┌──────────────────────────────────────────────────────────────┐
│                    AutoHarness Architecture                   │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐       │
│  │   Core      │    │   Engine    │    │   Sandbox   │       │
│  │   Module    │    │   Module    │    │   Module    │       │
│  └─────────────┘    └─────────────┘    └─────────────┘       │
│         │                  │                  │               │
│         ▼                  ▼                  ▼               │
│  ┌─────────────────────────────────────────────────────┐     │
│  │              Feedback Module                         │     │
│  └─────────────────────────────────────────────────────┘     │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Module Overview

core: Core data models (State, Action, Harness traits)
engine: Code synthesis engine with tree search
sandbox: Secure code execution environment
feedback: Feedback collection and consolidation

📚 API Documentation

Core Module

`State` Trait

Represents the current state of an environment.

pub trait State: Serialize + Clone + Send + Sync {
    fn to_prompt(&self) -> String;
    fn validate(&self) -> Result<()>;
}

`Action` Trait

Represents an action that can be taken in an environment.

pub trait Action: Serialize + Clone + Send + Sync + PartialEq {
    fn to_string(&self) -> String;
    fn from_string(s: &str) -> Result<Self>;
}

`Harness` Trait

Core interface for all harness types.

pub trait Harness<S: State, A: Action>: Send + Sync {
    fn harness_type(&self) -> HarnessType;
    fn evaluate(&self, state: &S, action: &A) -> Result<bool>;
    fn propose_actions(&self, state: &S) -> Result<Vec<A>>;
}

Engine Module

`CodeSynthesisEngine`

Main synthesis engine that orchestrates the search process.

pub struct CodeSynthesisEngine {
    tree: SearchTree,
    config: SynthesisConfig,
    stats: SynthesisStats,
}

impl CodeSynthesisEngine {
    pub fn new(config: SynthesisConfig) -> Self;
    pub fn synthesize(&mut self, initial_code: &str, evaluator: &dyn Evaluator) -> Result<String, SynthesisError>;
    pub fn get_best_code(&self) -> Option<&CodeNode>;
}

`SynthesisConfig`

Configuration for the synthesis engine.

pub struct SynthesisConfig {
    pub max_iterations: u32,           // Default: 50
    pub convergence_threshold: f64,    // Default: 0.95
    pub max_depth: u32,                // Default: 10
    pub mutations_per_node: usize,     // Default: 3
    pub exploration_constant: f64,     // Default: 1.414
    pub adaptive_sampling: bool,       // Default: true
    pub target_iterations: u32,        // Default: 20
    pub min_improvement: f64,          // Default: 0.01
    pub max_nodes: usize,              // Default: 1000
}

Sandbox Module

`SandboxExecutor`

Secure code execution with resource limits.

pub struct SandboxExecutor {
    config: SandboxConfig,
}

impl SandboxExecutor {
    pub fn new(config: SandboxConfig) -> Result<Self, SandboxError>;
    pub async fn execute(&self, code: &str) -> Result<ExecutionResult, SandboxError>;
    pub async fn execute_with_input(&self, code: &str, input: &str) -> Result<ExecutionResult, SandboxError>;
}

`SandboxConfig`

Configuration for sandbox execution.

pub struct SandboxConfig {
    pub memory_limit_mb: u64,          // Default: 256
    pub time_limit_ms: u64,            // Default: 5000
    pub max_file_descriptors: u32,     // Default: 64
    pub max_output_size: usize,        // Default: 10MB
    pub enable_network: bool,          // Default: false
    pub working_directory: Option<PathBuf>,
    pub environment_variables: HashMap<String, String>,
}

🔧 Configuration Examples

Basic Configuration

use autoharness::engine::SynthesisConfig;

let config = SynthesisConfig::new()
    .with_max_iterations(20)
    .with_convergence_threshold(0.95)
    .with_max_depth(10);

Advanced Configuration

use autoharness::engine::SynthesisConfig;

let config = SynthesisConfig::new()
    .with_max_iterations(50)
    .with_convergence_threshold(0.99)
    .with_max_depth(15)
    .with_mutations_per_node(5)
    .with_exploration_constant(2.0)
    .with_adaptive_sampling(true)
    .with_target_iterations(30)
    .with_min_improvement(0.005)
    .with_max_nodes(2000);

Sandbox Configuration

use autoharness::sandbox::SandboxConfig;

let config = SandboxConfig::new()
    .with_memory_limit(512)
    .with_time_limit(10000)
    .with_max_file_descriptors(128)
    .with_max_output_size(20 * 1024 * 1024)  // 20MB
    .with_network(false);

🧪 Testing

Run the test suite:

cargo test

Run specific tests:

cargo test test_synthesis
cargo test test_sandbox

📊 Performance

Based on the AutoHarness paper:

Average iterations to convergence: 14.5
Legal action rate: 100% (145 TextArena games)
Performance improvement: Small model + harness > Large model without harness

🔒 Security

AutoHarness implements several security measures:

Sandboxed Execution: All generated code runs in isolated processes
Resource Limits: Memory, CPU, and file descriptor limits
System Call Filtering: Only necessary syscalls are allowed
Timeout Enforcement: Processes are killed if they exceed time limits
Input Validation: Code is validated before execution

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

AutoHarness Paper by Xinghua Lou et al.
TextArena for game environments
Thompson Sampling for exploration strategy

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
benches		benches
examples		examples
install		install
memory		memory
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
README_zh-CN.md		README_zh-CN.md
TUTORIAL.md		TUTORIAL.md
TUTORIAL_zh-CN.md		TUTORIAL_zh-CN.md
autoharness.toml		autoharness.toml

Folders and files

Latest commit

History

Repository files navigation

AutoHarness

🚀 One-Sentence Quick Start (For OpenCode/CloudCode)

⚡ Quick Install (One-Click)

Alternative: Clone & Install

Installation Options

Installation Location

Supported Platforms

🎯 Key Features

📦 Installation (Cargo)

🚀 Quick Start

Basic Usage

🏗️ Architecture

Core Components

Module Overview

📚 API Documentation

Core Module

State Trait

Action Trait

Harness Trait

Engine Module

CodeSynthesisEngine

SynthesisConfig

Sandbox Module

SandboxExecutor

SandboxConfig

🔧 Configuration Examples

Basic Configuration

Advanced Configuration

Sandbox Configuration

🧪 Testing

📊 Performance

🔒 Security

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`State` Trait

`Action` Trait

`Harness` Trait

`CodeSynthesisEngine`

`SynthesisConfig`

`SandboxExecutor`

`SandboxConfig`

Packages