Automatically synthesize code harnesses for LLM agents
AutoHarness is a Rust library that automatically generates and optimizing code harnesses for LLM agents, following the approach described in the AutoHarness paper. It uses tree search with Thompson sampling to iteratively refine harness code, achieving an average of 14.5 iterations to reach 100% legal action rate.
Copy and paste the sentence below directly into OpenCode or CloudCode to start:
Now use AutoHarness CLI (https://github.com/gyc567/AutoHarness) to design a Harness engineering system for this project.
Or in Chinese (中文):
现在用 AutoHarness 这个 CLI:https://github.com/gyc567/AutoHarness 对本项目进行设计 Harness 工程系统。
This will automatically initiate the AutoHarness system to design and generate a Harness engineering system for your project.
# One-line install (recommended)
curl -fsSL https://raw.githubusercontent.com/gyc567/AutoHarness/main/install/install.sh | bash
# Or use jsDelivr CDN (faster)
curl -fsSL https://cdn.jsdelivr.net/gh/gyc567/AutoHarness@main/install/install.sh | bash
# Verify
autoharness --versiongit clone https://github.com/gyc567/AutoHarness.git
cd AutoHarness/install
chmod +x install.sh
./install.sh| Command | Description |
|---|---|
./install.sh |
Install |
./install.sh install |
Install (same) |
./install.sh uninstall |
Uninstall |
./install.sh --help |
Show help |
- Default:
~/.local/bin/autoharness - Add to PATH:
export PATH="$HOME/.local/bin:$PATH"
| OS | Architecture | Status |
|---|---|---|
| macOS | Intel (x86_64) | ✅ Available |
| macOS | Apple Silicon (ARM) | ⬅️ Uses x86_64 binary |
| Linux | x86_64 | 🔨 Build from source |
| Windows | x86_64 | 🔨 Build from source |
- Three Harness Modes: Filter, Verifier, and Policy harnesses
- Tree Search + Thompson Sampling: Efficient exploration of code space
- Sandboxed Execution: Secure code execution with resource limits
- Adaptive Optimization: Self-adjusting exploration vs exploitation
- High Performance: Average 14.5 iterations to convergence
Add this to your Cargo.toml:
[dependencies]
autoharness = "0.1.0"use autoharness::core::{State, Action, Harness, HarnessType};
use autoharness::engine::{CodeSynthesisEngine, SynthesisConfig, Evaluator};
use autoharness::sandbox::{SandboxExecutor, SandboxConfig};
// Define your state
#[derive(Debug, Clone, serde::Serialize)]
struct GameState {
board: Vec<Vec<i32>>,
score: i32,
}
impl State for GameState {
fn to_prompt(&self) -> String {
format!("Board: {:?}, Score: {}", self.board, self.score)
}
fn validate(&self) -> autoharness::core::Result<()> {
Ok(())
}
}
// Define your action
#[derive(Debug, Clone, PartialEq, serde::Serialize, serde::Deserialize)]
enum GameAction {
MoveUp,
MoveDown,
MoveLeft,
MoveRight,
}
impl Action for GameAction {
fn to_string(&self) -> String {
format!("{:?}", self)
}
fn from_string(s: &str) -> autoharness::core::Result<Self> {
match s {
"MoveUp" => Ok(GameAction::MoveUp),
"MoveDown" => Ok(GameAction::MoveDown),
"MoveLeft" => Ok(GameAction::MoveLeft),
"MoveRight" => Ok(GameAction::MoveRight),
_ => Err(autoharness::core::HarnessError::action_parse("Unknown action")),
}
}
}
// Create a custom evaluator
struct GameEvaluator;
impl Evaluator for GameEvaluator {
fn evaluate(&self, code: &str) -> autoharness::engine::Result<f64> {
// Evaluate the harness code
// Return a score between 0.0 and 1.0
if code.contains("is_legal_action") {
Ok(0.8)
} else {
Ok(0.2)
}
}
}
// Synthesize a harness
fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = SynthesisConfig::new()
.with_max_iterations(20)
.with_convergence_threshold(0.95);
let mut engine = CodeSynthesisEngine::new(config);
let evaluator = GameEvaluator;
let initial_code = r#"
def is_legal_action(state, action):
# TODO: Implement validation logic
return True
"#;
let optimized_code = engine.synthesize(initial_code, &evaluator)?;
println!("Optimized harness:\n{}", optimized_code);
Ok(())
}┌──────────────────────────────────────────────────────────────┐
│ AutoHarness Architecture │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Core │ │ Engine │ │ Sandbox │ │
│ │ Module │ │ Module │ │ Module │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Feedback Module │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
core: Core data models (State, Action, Harness traits)engine: Code synthesis engine with tree searchsandbox: Secure code execution environmentfeedback: Feedback collection and consolidation
Represents the current state of an environment.
pub trait State: Serialize + Clone + Send + Sync {
fn to_prompt(&self) -> String;
fn validate(&self) -> Result<()>;
}Represents an action that can be taken in an environment.
pub trait Action: Serialize + Clone + Send + Sync + PartialEq {
fn to_string(&self) -> String;
fn from_string(s: &str) -> Result<Self>;
}Core interface for all harness types.
pub trait Harness<S: State, A: Action>: Send + Sync {
fn harness_type(&self) -> HarnessType;
fn evaluate(&self, state: &S, action: &A) -> Result<bool>;
fn propose_actions(&self, state: &S) -> Result<Vec<A>>;
}Main synthesis engine that orchestrates the search process.
pub struct CodeSynthesisEngine {
tree: SearchTree,
config: SynthesisConfig,
stats: SynthesisStats,
}
impl CodeSynthesisEngine {
pub fn new(config: SynthesisConfig) -> Self;
pub fn synthesize(&mut self, initial_code: &str, evaluator: &dyn Evaluator) -> Result<String, SynthesisError>;
pub fn get_best_code(&self) -> Option<&CodeNode>;
}Configuration for the synthesis engine.
pub struct SynthesisConfig {
pub max_iterations: u32, // Default: 50
pub convergence_threshold: f64, // Default: 0.95
pub max_depth: u32, // Default: 10
pub mutations_per_node: usize, // Default: 3
pub exploration_constant: f64, // Default: 1.414
pub adaptive_sampling: bool, // Default: true
pub target_iterations: u32, // Default: 20
pub min_improvement: f64, // Default: 0.01
pub max_nodes: usize, // Default: 1000
}Secure code execution with resource limits.
pub struct SandboxExecutor {
config: SandboxConfig,
}
impl SandboxExecutor {
pub fn new(config: SandboxConfig) -> Result<Self, SandboxError>;
pub async fn execute(&self, code: &str) -> Result<ExecutionResult, SandboxError>;
pub async fn execute_with_input(&self, code: &str, input: &str) -> Result<ExecutionResult, SandboxError>;
}Configuration for sandbox execution.
pub struct SandboxConfig {
pub memory_limit_mb: u64, // Default: 256
pub time_limit_ms: u64, // Default: 5000
pub max_file_descriptors: u32, // Default: 64
pub max_output_size: usize, // Default: 10MB
pub enable_network: bool, // Default: false
pub working_directory: Option<PathBuf>,
pub environment_variables: HashMap<String, String>,
}use autoharness::engine::SynthesisConfig;
let config = SynthesisConfig::new()
.with_max_iterations(20)
.with_convergence_threshold(0.95)
.with_max_depth(10);use autoharness::engine::SynthesisConfig;
let config = SynthesisConfig::new()
.with_max_iterations(50)
.with_convergence_threshold(0.99)
.with_max_depth(15)
.with_mutations_per_node(5)
.with_exploration_constant(2.0)
.with_adaptive_sampling(true)
.with_target_iterations(30)
.with_min_improvement(0.005)
.with_max_nodes(2000);use autoharness::sandbox::SandboxConfig;
let config = SandboxConfig::new()
.with_memory_limit(512)
.with_time_limit(10000)
.with_max_file_descriptors(128)
.with_max_output_size(20 * 1024 * 1024) // 20MB
.with_network(false);Run the test suite:
cargo testRun specific tests:
cargo test test_synthesis
cargo test test_sandboxBased on the AutoHarness paper:
- Average iterations to convergence: 14.5
- Legal action rate: 100% (145 TextArena games)
- Performance improvement: Small model + harness > Large model without harness
AutoHarness implements several security measures:
- Sandboxed Execution: All generated code runs in isolated processes
- Resource Limits: Memory, CPU, and file descriptor limits
- System Call Filtering: Only necessary syscalls are allowed
- Timeout Enforcement: Processes are killed if they exceed time limits
- Input Validation: Code is validated before execution
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- AutoHarness Paper by Xinghua Lou et al.
- TextArena for game environments
- Thompson Sampling for exploration strategy