ToolGym

An Open-world Tool-using Environment for Scalable Agent Testing

Overview

ToolGym is a large-scale, open-world benchmark for evaluating LLM agents' tool-using capabilities. Built on 5,571 real tools across 204 applications, ToolGym enables realistic testing with:

Long-horizon workflows: Multi-step tasks requiring complex tool coordination
Wild constraints: Natural language requirements that must be satisfied
Robustness testing: State Controller for systematic perturbation testing

Key Statistics

Metric	Value
Total Tools	5,571
Applications	204
Task Instances	3,091
Avg. Tools per Task	4.77
Avg. Steps per Task	7.46

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         ToolGym                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐    ┌─────────────────┐    ┌──────────────┐ │
│  │ Task Creation   │    │ Tool Retrieval  │    │    State     │ │
│  │    Engine       │    │     Index       │    │  Controller  │ │
│  │                 │    │                 │    │              │ │
│  │ • Workflow      │    │ • BGE-M3        │    │ • Tool-level │ │
│  │   Synthesis     │    │ • FAISS         │    │ • State-level│ │
│  │ • Constraint    │    │ • 5,571 tools   │    │ • Constraint │ │
│  │   Generation    │    │                 │    │   -level     │ │
│  └─────────────────┘    └─────────────────┘    └──────────────┘ │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                    Planner-Actor Framework                       │
│  ┌─────────────────┐              ┌─────────────────────────┐   │
│  │     Planner     │ ──prompts──▶ │         Actor           │   │
│  │  (Decomposes    │              │  (Executes tools via    │   │
│  │   into subtasks)│ ◀─feedback── │   ReAct reasoning)      │   │
│  └─────────────────┘              └─────────────────────────┘   │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                       LLM-as-Judge                               │
│            Multi-model evaluation with majority voting           │
└─────────────────────────────────────────────────────────────────┘

Installation

# Clone the repository
git clone https://github.com/Ziqiao-git/ToolGym.git
cd ToolGym

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env with your API keys

Quick Start

Running an Agent

# Basic usage with semantic tool discovery
python runtime/run_react_agent.py "Search for latest AI news"

# With trajectory logging
python runtime/run_react_agent.py "Find GitHub repos about ML" --save-trajectory

# Custom model
python runtime/run_react_agent.py "Your query" \
  --model anthropic/claude-3.5-sonnet \
  --max-iterations 10

Core Components

1. Task Creation Engine

Synthesizes realistic, long-horizon tasks through:

Workflow synthesis: Chains tool calls into coherent task sequences
Constraint generation: Adds natural language requirements
Diversity sampling: Ensures coverage across tool categories

Location: task_creation_engine/

2. Tool Retrieval Index

Semantic search over 5,571 tools using:

Embeddings: BGE-M3 (multilingual, 1024 dimensions)
Index: FAISS for efficient similarity search
Dynamic loading: On-demand MCP server connections

Location: tool_retrieval_index/

3. State Controller

Systematic robustness testing with three control types:

Control Type	Strategies
Tool-level	Timeout, Rate limit, Unavailable, Schema change, Partial failure
State-level	Response delay, Data corruption, Truncation, Session timeout, Stale data
Constraint-level	Add constraint, Modify constraint, Tighten deadline, Resource limit

Location: toolgym/state_controller/

4. Planner-Actor Framework

Two-stage agent architecture:

Planner: Decomposes tasks into subtask sequences
Actor: Executes subtasks using ReAct reasoning with tool calls

Location: Orchestrator/mcpuniverse/agent/

5. LLM-as-Judge Evaluation

Multi-dimensional evaluation with:

5 scoring dimensions: Task fulfillment, Grounding, Tool choice, Tool execution, Requirement satisfaction
Multi-model voting: Uses multiple LLM judges for robustness
Majority voting: Final score from consensus

Location: Orchestrator/mcpuniverse/evaluator/

Project Structure

ToolGym/
├── README.md                    # This file
├── docs/                        # GitHub Pages website
│   └── index.html              # Leaderboard & documentation
│
├── task_creation_engine/        # Task synthesis
│   └── query_generate.py       # Workflow generation
│
├── tool_retrieval_index/        # Semantic tool search
│   └── server.py               # MCP server with search
│
├── toolgym/                     # Core library
│   └── state_controller/       # Robustness testing
│
├── Orchestrator/                # Agent framework
│   └── mcpuniverse/
│       ├── agent/              # Planner-Actor implementation
│       └── evaluator/          # LLM-as-Judge
│
├── MCP_INFO_MGR/                # Tool data management
│   ├── mcp_data/               # Tool metadata
│   └── semantic_search/        # FAISS index
│
├── runtime/                     # Agent runtime
│   └── run_react_agent.py      # CLI interface
│
└── evaluation/                  # Evaluation scripts

Dataset

The ToolGym dataset is available on HuggingFace:

🤗 ToolGym

Contents:

3,091 task instances with ground-truth tool sequences
Tool metadata for 5,571 tools across 204 applications
Constraint annotations and perturbation configurations

Citation

@inproceedings{toolgym2025,
  title={ToolGym: An Open-world Tool-using Environment for LLM Agent Evaluation},
  author={...},
  booktitle={Proceedings of ACL 2025},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on the Model Context Protocol (MCP) ecosystem
Tool data sourced from Smithery and other MCP registries
Evaluation framework inspired by recent LLM-as-Judge research

Website: https://ziqiao-git.github.io/ToolGym/ Dataset: https://huggingface.co/ToolGym GitHub: https://github.com/Ziqiao-git/ToolGym

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
MCP_INFO_MGR		MCP_INFO_MGR
Orchestrator		Orchestrator
analyze_scripts		analyze_scripts
docs		docs
ms-swift		ms-swift
runtime		runtime
short-horizon		short-horizon
task_creation_engine		task_creation_engine
tool_retrieval_index		tool_retrieval_index
toolgym		toolgym
.gitignore		.gitignore
COMMANDS.md		COMMANDS.md
README.md		README.md
eval.bash		eval.bash
identify_problematic_trajectories.py		identify_problematic_trajectories.py
requirements.txt		requirements.txt
traj.bash		traj.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolGym

Overview

Key Statistics

Architecture

Installation

Quick Start

Running an Agent

Core Components

1. Task Creation Engine

2. Tool Retrieval Index

3. State Controller

4. Planner-Actor Framework

5. LLM-as-Judge Evaluation

Project Structure

Dataset

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ToolGym

Overview

Key Statistics

Architecture

Installation

Quick Start

Running an Agent

Core Components

1. Task Creation Engine

2. Tool Retrieval Index

3. State Controller

4. Planner-Actor Framework

5. LLM-as-Judge Evaluation

Project Structure

Dataset

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages