Skip to content

An experience-guided framework for evolving model trajectories at test time.

License

Notifications You must be signed in to change notification settings

ByteDance-BandAI/SamplingEvolve

Repository files navigation

Logo

SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution

license
If you like our project, please give us a star ⭐ on GitHub for the latest update.

Overview

SamplingEvolve transforms Test-Time Scaling from "single independent trajectory sampling" to "experience-driven inter-trajectory evolution": by explicitly recording and managing historical trajectories and evaluation results, utilizing reusable examples and feedback to reconstruct trajectory generation tasks, and externalizing feedback as composable natural language "soft gradients" to guide iteration, thereby achieving continuous improvement in problem-solving capabilities during trajectory generation. Experiments show that just three rounds of evolution on the GAIA text collection achieved a cumulative accuracy of 70.37%, surpassing Pass@N's 62.96%, with continued evolution ultimately reaching 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%, with new problems being solved even in later evolution stages, validating sustained discovery capability under evolution tree expansion.

Theoretical Background

Reinforcement Learning with Verifiable Rewards (RLVR)

In the development of large language models, Scaling Laws have been the dominant theme: larger parameters, richer data, and stronger computational power typically mean better performance. However, when model parameter and data scaling encounters diminishing marginal returns, research focus has shifted toward Test-Time Scaling — leveraging longer, more structured reasoning and tool chains during inference to unlock model potential within existing parameter constraints.

Reinforcement Learning with Verifiable Rewards (RLVR) represents this paradigm, using rule-verifiable rewards and challenging prompts to guide models toward deep reasoning during sampling, thereby solving more difficult problems. This scaling manifests as natural growth in response length during the RL process. Three key elements are crucial: robust rewards that can continuously guide model optimization, extremely difficult problems that trigger deep thinking behaviors, and sampling trajectories that lead to correct answers.

RLVR Characteristics and Limitations

However, RLVR faces several practical issues that reduce trajectory sampling diversity and ultimately decrease the efficiency of sampling correct answers:

  1. Models only reflect within single sampled trajectories, but attempts during the process lack feedback from outside the trajectory, making it difficult for models to specifically modify error steps or try completely new approaches.
  2. After sampled model trajectories receive feedback, they are parameterized as gradients without explicit storage and experience reuse. The model cannot guarantee differentiation in subsequent sampling.
  3. Entropy collapse easily occurs during RL, where multiple sampled trajectories gradually converge and become consistent as training progresses.

A more practical engineering challenge is that when policy models initially score poorly on a set of tasks, systems struggle to sample high-quality examples early on, leading to slow or stagnant optimization progress, forcing reliance on human annotation or strong model distillation for SFT cold-start to improve early hit rates. Such solutions are severely limited by human capabilities or strong model capabilities.

Thus, the problem is naturally restated as: How to more efficiently sample correct solution trajectories?

SamplingEvolve

We propose SamplingEvolve, which reconstructs inference from “single-shot independent sampling” into an “experience-driven inter-trajectory evolution system”. Intuitively, the system persists model-generated historical trajectories together with their evaluation feedback, and uses these externalized experiences to drive unlimited rounds of evolution during sampling.

Unlike traditional:

$$x \sim P_\theta(x)$$

SamplingEvolve explicitly conditions sampling on historical experience and feedback, so the sampling distribution becomes

$$x \sim P_\theta\big(x \mid \mathcal{Y}, R\big)$$

Where $\mathcal{Y}$ is a reusable set of successful/failed exemplars and R denotes feedback provided by an evaluator. During evolution, newly generated trajectories can read the problem-solving rationale and correctness signals of past trajectories, enabling targeted correction of common error patterns and differentiated exploration of existing solution strategies. This shift increases trajectory sampling diversity and the attainable performance ceiling: experience is no longer “locked” inside model weights but exists explicitly and traceably.

System Architecture

The system operates through collaboration of three roles: Evolution Engine handles multi-round iteration and parallel scheduling; Trajectory Pool persists all historical trajectories, feedback, and lineage information, ensuring experience is reusable and traceable; Evolutors select multiple trajectories and summarize each trajectory's approach, guiding the model to generate new trajectories based on past feedback. Each trajectory records complete conversation messages, tool call sequences, and parent trajectory IDs, gradually forming a clear evolution tree.

Key Features

  • Trajectory Evolution: Records and evolves solution attempts across multiple iterations
  • Trajectory Feedback Persistence: Persists all historical trajectories, feedback, and lineage information, ensuring experience is reusable and traceable
  • Tool Integration: Supports web search and LinkReader, LinkSummary tools for information gathering
  • Comprehensive Logging: Detailed logging system for tracking experiments and debugging

Evolution Process

The process employs the simplest design to ensure two key points: sampling diversity and iterative evolution:

  • Round 1 is initial generation, producing multiple candidates for each question and writing them to the trajectory pool;
  • Rounds 2-N are the evolution phase, where for unsolved questions, failed examples are extracted from the trajectory pool, evolutors generate new solutions, and evaluators (LLM-Judge) score and write back feedback.

The evolution engine schedules the next round accordingly.

Experimental Results

GAIA-Text Evolution Results

Round 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Overall
Correct Questions 33 17 7 1 2 4 1 1 1 0 2 0 0 0 1 0 0 0 0 0 70
Cumulative Correct 33 50 57 58 60 64 65 66 67 67 69 69 69 69 70 70 70 70 70 70 70
Total Trajectories 81 96 93 96 113 105 85 80 74 70 70 60 60 60 60 55 55 55 55 55 1478
Cumulative Accuracy 40.74% 61.73% 70.37% 71.60% 74.07% 79.01% 80.25% 81.48% 82.72% 82.72% 85.19% 85.19% 85.19% 85.19% 86.42% 86.42% 86.42% 86.42% 86.42% 86.42% 86.42%

BrowseComp Evolution Results

Round 1 2 3 4 5 6 7 8 9 10 Overall
Correct Questions 8 5 4 6 5 5 4 4 1 1 43
Cumulative Correct 8 13 17 23 28 33 37 41 42 43 43
Total Trajectories 100 183 260 321 385 360 335 315 294 290 2843
Cumulative Accuracy 8.00% 13.00% 17.00% 23.00% 28.00% 33.00% 37.00% 41.00% 42.00% 43.00% 43.00%

Key Insights

On GAIA text-only questions and the first 100 BrowseComp problems, we conducted 20-round and 10-round evolution cycles respectively. Using Pass@N as the comparison baseline, for the same question, we ensure N = {total evolution trajectories for that question}. From the results, we observe two phenomena worth emphasizing.

First, early rounds show significant benefits: On GAIA, just three rounds of evolution achieved a cumulative accuracy of 70.37%, already exceeding Pass@N's 62.96%; continued evolution ultimately reached 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%.

Second, stronger long-term discovery capability: On GAIA, new correct solutions were found even in round 15, demonstrating sustained exploration capability under evolution tree expansion.

Comparative Analysis

It should be noted that these evolution scores simulate effects on training set/known answer questions, to evaluate the method's sampling diversity and answer discovery capability on recognized challenging benchmarks. The experiments used Groundtruth-based LLM-Judge to determine "trajectory correctness", providing clear feedback signals for evolution (models do not directly see Groundtruth, only the correctness of past trajectories). Numerical comparisons with methods using different evaluation procedures are not strictly comparable, and interpretation should be cautious.

Direct Applications: Replacing Pass@N in Data Construction

In training data construction, especially Reject Sampling/RL, SamplingEvolve can serve as the new default sampling logic for candidate generation. Using trajectories obtained through evolution to replace traditional Pass@N independent sampling, we can uncover more and more difficult high-value samples, making data utilization more efficient and performance ceiling higher. From this perspective, SamplingEvolve transforms data collection itself into a form of continuous evolutionary sampling: without modifying weights, it can still push capability boundaries outward. Models that have undergone Reject Sampling/RL can be reused as evolution models in the evolution system, further enhancing evolution effectiveness and achieving dual iterative improvement.

This is our next goal.

Architecture

SamplingEvolve/
├── *.zip                   # Evolution results
├── core/                    # Core components
│   ├── evolution_engine.py  # Main evolution engine
│   ├── trajectory.py        # Trajectory class for managing solution attempts
│   ├── trajectory_pool.py   # Pool for managing multiple trajectories
│   ├── traj_generator.py    # Trajectory generator using LLM
│   ├── traj_analyzer.py     # Trajectory analyzer for extracting insights
│   ├── llm_client.py        # LLM client for API calls
│   └── bon_evaluator.py     # Pass@N evaluation
├── operators/               # Evolution operators
│   ├── base.py             # Base operator class
│   ├── icl.py              # In-Context Learning operator
│   └── refine.py           # Refinement operator
├── dataset/                 # Dataset iterators
│   ├── base.py             # Base dataset iterator
│   ├── gaia_iterator.py    # GAIA dataset iterator
│   └── bc_iterator.py      # BrowseComp dataset iterator
├── utils/                   # Utility functions
│   └── tools.py            # Web search and link reader tools
└── log/                     # Logging system
    └── logger.py           # Global logging configuration

Installation

Prerequisites

  • Python 3.11 or higher
  • CUDA-capable GPU (optional, for faster tokenization)

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/SamplingEvolve.git
cd SamplingEvolve
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables: Create a .env file in the project root with the following variables:
# Azure OpenAI Configuration
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_ENDPOINT=your_endpoint_here
AZURE_OPENAI_API_VERSION=2024-03-01-preview
MODEL_NAME=gemini-2.5-pro

# Tool Configuration
SEARCH_API_URL=your_search_api_url

Usage

Basic Evolution Run

from core.evolution_engine import EvolutionEngine

# Initialize the evolution engine
engine = EvolutionEngine(
    data_cfg={
        "dataset_name": "GAIA",  # or "BC" for BrowseComp
        "data_path": "/path/to/your/dataset.parquet",
    },
    max_rounds=5,                    # Maximum evolution rounds
    max_trajectories_per_round=5,    # Max trajectories per round
    icl_sample_size=3,               # Number of examples for ICL
    output_dir="results/experiment1"
)

# Run evolution on 100 questions
results = engine.run(num_questions=100)

Pass@N Evaluation

from core.bon_evaluator import BestOfNEvaluator

# Initialize BoN evaluator
evaluator = BestOfNEvaluator(
    data_cfg={
        "dataset_name": "GAIA",
        "data_path": "/path/to/dataset.parquet",
    },
    n_samples=5,  # Number of samples per question
    output_dir="results/bon_evaluation"
)

# Run evaluation
results = evaluator.run(num_questions=50)

Using Individual Components

Trajectory Generation

from core.traj_generator import TrajGenerator

generator = TrajGenerator()
trajectory = generator.generate(
    prompt="What is the capital of France?",
    question_id="q1",
    max_turns=10
)

Trajectory Analysis

from core.traj_analyzer import TrajAnalyzer

analyzer = TrajAnalyzer()
analysis = analyzer.analyze(
    trajectory=trajectory,
    analysis_prompt="Summarize the solution approach"
)

Output

Results are saved in the specified output directory:

  • pool.json: Complete trajectory pool with all attempts
  • Individual trajectory files: {question_id}_{hash}.json
  • Evolution statistics and logs

Future Directions

The core of this work lies in abandoning the inherent thinking of "single model-generated sampling trajectories" and instead using sampling systems to systematically expand diversity. The current implementation is still relatively primitive: the selection of reference trajectories is largely random and has not yet identified and weighted high-potential branches; the evolution direction of trajectories mainly relies on the model's own error correction and differentiation instincts, lacking reusable structured design. To address this, we will proceed along two parallel lines:

  • Branch selection and resource allocation: Introduce branch potential assessment and adaptive weighting, prioritizing computational resources for lineages that are promising to produce new solutions, rather than uniform/random sampling.
  • Priors and heuristics for evolution direction: Drawing from works like Alpha Evolve, inject direction-differentiated prior designs (such as problem-solving patterns and differentiated tool usage rates), combined with heuristic diversity rewards, to systematically expand the solution family rather than relying solely on the model's "natural differentiation".

The goal is to upgrade the current experience-driven approach to measurable and controllable evolution strategies: both stably producing differentiated high-quality trajectories and continuously locating and achieving key breakthroughs under resource constraints.

Conclusion

SamplingEvolve transforms Test-Time Scaling from "single independent trajectory sampling" to "experience-driven inter-trajectory evolution": by explicitly recording and managing historical trajectories and evaluation results, utilizing reusable examples and feedback to reconstruct trajectory generation tasks, and externalizing feedback as composable natural language "soft gradients" to guide iteration, thereby achieving continuous improvement in problem-solving capabilities during trajectory generation. Experiments show that just three rounds of evolution on the GAIA text collection achieved a cumulative accuracy of 70.37%, surpassing Pass@N's 62.96%, with continued evolution ultimately reaching 86.42% (+23.46%). On BrowseComp, the highest score was 43.00%, with new problems being solved even in later evolution stages, validating sustained discovery capability under evolution tree expansion. We will open-source the complete code and evolution results to promote reproducible research and advance trajectory evolution-based Test-Time Scaling as an important path for next-generation model capability enhancement.

Citation

@misc{Li2025SamplingEvolve,
  author       = {Minghao Li and Ying Zeng and Cong Ma and Siyao Song and Kai Jia},
  title        = {SamplingEvolve: Test-Time Scaling through Experience-Guided Trajectory Evolution},
  year         = {2025},
  howpublished = {\url{https://bytedancebandai.notion.site/samplingevolve-en}},
  note         = {Blog post. Accessed: 2025-09-16}
}

About

An experience-guided framework for evolving model trajectories at test time.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages