SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

SamplingEvolve transforms Test-Time Scaling from "single independent trajectory sampling" to "experience-driven inter-trajectory evolution": by explicitly recording and managing historical trajectories and evaluation results, utilizing reusable examples and feedback to reconstruct trajectory generation tasks, and externalizing feedback as composable natural language "soft gradients" to guide iteration, thereby achieving continuous improvement in problem-solving capabilities during trajectory generation. Experiments show that just three rounds of evolution on the GAIA text collection achieved a cumulative accuracy of 70.37%, surpassing Pass@N's 62.96%, with continued evolution ultimately reaching 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%, with new problems being solved even in later evolution stages, validating sustained discovery capability under evolution tree expansion.

Theoretical Background

Reinforcement Learning with Verifiable Rewards (RLVR)

In the development of large language models, Scaling Laws have been the dominant theme: larger parameters, richer data, and stronger computational power typically mean better performance. However, when model parameter and data scaling encounters diminishing marginal returns, research focus has shifted toward Test-Time Scaling — leveraging longer, more structured reasoning and tool chains during inference to unlock model potential within existing parameter constraints.

Reinforcement Learning with Verifiable Rewards (RLVR) represents this paradigm, using rule-verifiable rewards and challenging prompts to guide models toward deep reasoning during sampling, thereby solving more difficult problems. This scaling manifests as natural growth in response length during the RL process. Three key elements are crucial: robust rewards that can continuously guide model optimization, extremely difficult problems that trigger deep thinking behaviors, and sampling trajectories that lead to correct answers.

RLVR Characteristics and Limitations

However, RLVR faces several practical issues that reduce trajectory sampling diversity and ultimately decrease the efficiency of sampling correct answers:

Models only reflect within single sampled trajectories, but attempts during the process lack feedback from outside the trajectory, making it difficult for models to specifically modify error steps or try completely new approaches.
After sampled model trajectories receive feedback, they are parameterized as gradients without explicit storage and experience reuse. The model cannot guarantee differentiation in subsequent sampling.
Entropy collapse easily occurs during RL, where multiple sampled trajectories gradually converge and become consistent as training progresses.

A more practical engineering challenge is that when policy models initially score poorly on a set of tasks, systems struggle to sample high-quality examples early on, leading to slow or stagnant optimization progress, forcing reliance on human annotation or strong model distillation for SFT cold-start to improve early hit rates. Such solutions are severely limited by human capabilities or strong model capabilities.

Thus, the problem is naturally restated as: How to more efficiently sample correct solution trajectories?

SamplingEvolve

We propose SamplingEvolve, which reconstructs inference from “single-shot independent sampling” into an “experience-driven inter-trajectory evolution system”. Intuitively, the system persists model-generated historical trajectories together with their evaluation feedback, and uses these externalized experiences to drive unlimited rounds of evolution during sampling.

Unlike traditional:

$$x \sim P_\theta(x)$$

SamplingEvolve explicitly conditions sampling on historical experience and feedback, so the sampling distribution becomes

$$x \sim P_\theta\big(x \mid \mathcal{Y}, R\big)$$

Where $\mathcal{Y}$ is a reusable set of successful/failed exemplars and R denotes feedback provided by an evaluator. During evolution, newly generated trajectories can read the problem-solving rationale and correctness signals of past trajectories, enabling targeted correction of common error patterns and differentiated exploration of existing solution strategies. This shift increases trajectory sampling diversity and the attainable performance ceiling: experience is no longer “locked” inside model weights but exists explicitly and traceably.

System Architecture

The system operates through collaboration of three roles: Evolution Engine handles multi-round iteration and parallel scheduling; Trajectory Pool persists all historical trajectories, feedback, and lineage information, ensuring experience is reusable and traceable; Evolutors select multiple trajectories and summarize each trajectory's approach, guiding the model to generate new trajectories based on past feedback. Each trajectory records complete conversation messages, tool call sequences, and parent trajectory IDs, gradually forming a clear evolution tree.

Key Features

Trajectory Evolution: Records and evolves solution attempts across multiple iterations
Trajectory Feedback Persistence: Persists all historical trajectories, feedback, and lineage information, ensuring experience is reusable and traceable
Tool Integration: Supports web search and LinkReader, LinkSummary tools for information gathering
Comprehensive Logging: Detailed logging system for tracking experiments and debugging

Evolution Process

The process employs the simplest design to ensure two key points: sampling diversity and iterative evolution:

Round 1 is initial generation, producing multiple candidates for each question and writing them to the trajectory pool;
Rounds 2-N are the evolution phase, where for unsolved questions, failed examples are extracted from the trajectory pool, evolutors generate new solutions, and evaluators (LLM-Judge) score and write back feedback.

The evolution engine schedules the next round accordingly.

Experimental Results

GAIA-Text Evolution Results

Round	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	Overall
Correct Questions	33	17	7	1	2	4	1	1	1	0	2	0	0	0	1	0	0	0	0	0	70
Cumulative Correct	33	50	57	58	60	64	65	66	67	67	69	69	69	69	70	70	70	70	70	70	70
Total Trajectories	81	96	93	96	113	105	85	80	74	70	70	60	60	60	60	55	55	55	55	55	1478
Cumulative Accuracy	40.74%	61.73%	70.37%	71.60%	74.07%	79.01%	80.25%	81.48%	82.72%	82.72%	85.19%	85.19%	85.19%	85.19%	86.42%	86.42%	86.42%	86.42%	86.42%	86.42%	86.42%

BrowseComp Evolution Results

Round	1	2	3	4	5	6	7	8	9	10	Overall
Correct Questions	8	5	4	6	5	5	4	4	1	1	43
Cumulative Correct	8	13	17	23	28	33	37	41	42	43	43
Total Trajectories	100	183	260	321	385	360	335	315	294	290	2843
Cumulative Accuracy	8.00%	13.00%	17.00%	23.00%	28.00%	33.00%	37.00%	41.00%	42.00%	43.00%	43.00%

Key Insights

On GAIA text-only questions and the first 100 BrowseComp problems, we conducted 20-round and 10-round evolution cycles respectively. Using Pass@N as the comparison baseline, for the same question, we ensure N = {total evolution trajectories for that question}. From the results, we observe two phenomena worth emphasizing.

First, early rounds show significant benefits: On GAIA, just three rounds of evolution achieved a cumulative accuracy of 70.37%, already exceeding Pass@N's 62.96%; continued evolution ultimately reached 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%.

Second, stronger long-term discovery capability: On GAIA, new correct solutions were found even in round 15, demonstrating sustained exploration capability under evolution tree expansion.

Comparative Analysis

It should be noted that these evolution scores simulate effects on training set/known answer questions, to evaluate the method's sampling diversity and answer discovery capability on recognized challenging benchmarks. The experiments used Groundtruth-based LLM-Judge to determine "trajectory correctness", providing clear feedback signals for evolution (models do not directly see Groundtruth, only the correctness of past trajectories). Numerical comparisons with methods using different evaluation procedures are not strictly comparable, and interpretation should be cautious.

Direct Applications: Replacing Pass@N in Data Construction

In training data construction, especially Reject Sampling/RL, SamplingEvolve can serve as the new default sampling logic for candidate generation. Using trajectories obtained through evolution to replace traditional Pass@N independent sampling, we can uncover more and more difficult high-value samples, making data utilization more efficient and performance ceiling higher. From this perspective, SamplingEvolve transforms data collection itself into a form of continuous evolutionary sampling: without modifying weights, it can still push capability boundaries outward. Models that have undergone Reject Sampling/RL can be reused as evolution models in the evolution system, further enhancing evolution effectiveness and achieving dual iterative improvement.

This is our next goal.

Architecture

SamplingEvolve/
├── *.zip                   # Evolution results
├── core/                    # Core components
│   ├── evolution_engine.py  # Main evolution engine
│   ├── trajectory.py        # Trajectory class for managing solution attempts
│   ├── trajectory_pool.py   # Pool for managing multiple trajectories
│   ├── traj_generator.py    # Trajectory generator using LLM
│   ├── traj_analyzer.py     # Trajectory analyzer for extracting insights
│   ├── llm_client.py        # LLM client for API calls
│   └── bon_evaluator.py     # Pass@N evaluation
├── operators/               # Evolution operators
│   ├── base.py             # Base operator class
│   ├── icl.py              # In-Context Learning operator
│   └── refine.py           # Refinement operator
├── dataset/                 # Dataset iterators
│   ├── base.py             # Base dataset iterator
│   ├── gaia_iterator.py    # GAIA dataset iterator
│   └── bc_iterator.py      # BrowseComp dataset iterator
├── utils/                   # Utility functions
│   └── tools.py            # Web search and link reader tools
└── log/                     # Logging system
    └── logger.py           # Global logging configuration

Installation

Prerequisites

Python 3.11 or higher
CUDA-capable GPU (optional, for faster tokenization)

Setup

Clone the repository:

git clone https://github.com/yourusername/SamplingEvolve.git
cd SamplingEvolve

Install dependencies:

pip install -r requirements.txt

Configure environment variables: Create a .env file in the project root with the following variables:

# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_ENDPOINT=your_endpoint_here
AZURE_OPENAI_API_VERSION=2024-03-01-preview
MODEL_NAME=gemini-2.5-pro

# Tool Configuration
SEARCH_API_URL=your_search_api_url

Usage

Basic Evolution Run

from core.evolution_engine import EvolutionEngine

# Initialize the evolution engine
engine = EvolutionEngine(
    data_cfg={
        "dataset_name": "GAIA",  # or "BC" for BrowseComp
        "data_path": "/path/to/your/dataset.parquet",
    },
    max_rounds=5,                    # Maximum evolution rounds
    max_trajectories_per_round=5,    # Max trajectories per round
    icl_sample_size=3,               # Number of examples for ICL
    output_dir="results/experiment1"
)

# Run evolution on 100 questions
results = engine.run(num_questions=100)

Pass@N Evaluation

from core.bon_evaluator import BestOfNEvaluator

# Initialize BoN evaluator
evaluator = BestOfNEvaluator(
    data_cfg={
        "dataset_name": "GAIA",
        "data_path": "/path/to/dataset.parquet",
    },
    n_samples=5,  # Number of samples per question
    output_dir="results/bon_evaluation"
)

# Run evaluation
results = evaluator.run(num_questions=50)

Using Individual Components

Trajectory Generation

from core.traj_generator import TrajGenerator

generator = TrajGenerator()
trajectory = generator.generate(
    prompt="What is the capital of France?",
    question_id="q1",
    max_turns=10
)

Trajectory Analysis

from core.traj_analyzer import TrajAnalyzer

analyzer = TrajAnalyzer()
analysis = analyzer.analyze(
    trajectory=trajectory,
    analysis_prompt="Summarize the solution approach"
)

Output

Results are saved in the specified output directory:

pool.json: Complete trajectory pool with all attempts
Individual trajectory files: {question_id}_{hash}.json
Evolution statistics and logs

Future Directions

The core of this work lies in abandoning the inherent thinking of "single model-generated sampling trajectories" and instead using sampling systems to systematically expand diversity. The current implementation is still relatively primitive: the selection of reference trajectories is largely random and has not yet identified and weighted high-potential branches; the evolution direction of trajectories mainly relies on the model's own error correction and differentiation instincts, lacking reusable structured design. To address this, we will proceed along two parallel lines:

Branch selection and resource allocation: Introduce branch potential assessment and adaptive weighting, prioritizing computational resources for lineages that are promising to produce new solutions, rather than uniform/random sampling.
Priors and heuristics for evolution direction: Drawing from works like Alpha Evolve, inject direction-differentiated prior designs (such as problem-solving patterns and differentiated tool usage rates), combined with heuristic diversity rewards, to systematically expand the solution family rather than relying solely on the model's "natural differentiation".

The goal is to upgrade the current experience-driven approach to measurable and controllable evolution strategies: both stably producing differentiated high-quality trajectories and continuously locating and achieving key breakthroughs under resource constraints.

Conclusion

SamplingEvolve transforms Test-Time Scaling from "single independent trajectory sampling" to "experience-driven inter-trajectory evolution": by explicitly recording and managing historical trajectories and evaluation results, utilizing reusable examples and feedback to reconstruct trajectory generation tasks, and externalizing feedback as composable natural language "soft gradients" to guide iteration, thereby achieving continuous improvement in problem-solving capabilities during trajectory generation. Experiments show that just three rounds of evolution on the GAIA text collection achieved a cumulative accuracy of 70.37%, surpassing Pass@N's 62.96%, with continued evolution ultimately reaching 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%, with new problems being solved even in later evolution stages, validating sustained discovery capability under evolution tree expansion. We will open-source the complete code and evolution results to promote reproducible research and advance trajectory evolution-based Test-Time Scaling as an important path for next-generation model capability enhancement.

Citation

@misc{Li2025SamplingEvolve,
  author       = {Minghao Li and Ying Zeng and Cong Ma and Siyao Song and Kai Jia},
  title        = {SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution},
  year         = {2025},
  howpublished = {\url{https://bytedancebandai.notion.site/samplingevolve-en}},
  note         = {Blog post. Accessed: 2025-09-16}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
core		core
dataset		dataset
log		log
operators		operators
utils		utils
.gitignore		.gitignore
BC_Evolution_0908_R10.zip		BC_Evolution_0908_R10.zip
BandAI.png		BandAI.png
GAIA_Evolution_0917_R20.zip		GAIA_Evolution_0917_R20.zip
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Theoretical Background

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR Characteristics and Limitations

SamplingEvolve

System Architecture

Key Features

Evolution Process

Experimental Results

GAIA-Text Evolution Results

BrowseComp Evolution Results

Key Insights

Comparative Analysis

Direct Applications: Replacing Pass@N in Data Construction

Architecture

Installation

Prerequisites

Setup

Usage

Basic Evolution Run

Pass@N Evaluation

Using Individual Components

Trajectory Generation

Trajectory Analysis

Output

Future Directions

Conclusion

Citation

About

Uh oh!

Releases

Packages

Languages

License

ByteDance-BandAI/SamplingEvolve

Folders and files

Latest commit

History

Repository files navigation

SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution

If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

Theoretical Background

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR Characteristics and Limitations

SamplingEvolve

System Architecture

Key Features

Evolution Process

Experimental Results

GAIA-Text Evolution Results

BrowseComp Evolution Results

Key Insights

Comparative Analysis

Direct Applications: Replacing Pass@N in Data Construction

Architecture

Installation

Prerequisites

Setup

Usage

Basic Evolution Run

Pass@N Evaluation

Using Individual Components

Trajectory Generation

Trajectory Analysis

Output

Future Directions

Conclusion

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages