Temporal Forgetting

This is the official repository for "Temporal Sampling for Forgotten Reasoning in LLMs".

Abstract

Fine-tuning large language models (LLMs) is intended to improve their reasoning capabilities, yet we uncover a counterintuitive effect: models often forget how to solve problems they previously answered correctly during training. We term this phenomenon temporal forgetting and show that it is widespread across model sizes, fine-tuning methods (both Reinforcement Learning and Supervised Fine-Tuning), and multiple reasoning benchmarks.

To address this gap, we introduce Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. This approach recovers forgotten solutions without retraining or ensembling, and leads to substantial improvements in reasoning performance, with gains from 4 to 19 points in Pass@k and consistent gains in Majority@k across several benchmarks. We further extend our method to LoRA-adapted models, demonstrating that storing only adapter weights across checkpoints achieves similar benefits with minimal storage cost.

By leveraging the temporal diversity inherent in training, Temporal Sampling offers a practical, compute-efficient way to surface hidden reasoning ability and rethink how we evaluate LLMs.

Overview

Getting Started

Get Code

git clone https://github.com/uw-nsl/Temporal_Forgetting

Training Checkpoints

We have released our RL training checkpoints of the Qwen2.5-7B model on Hugging Face. The model is trained on 4k randomly selected samples from the DeepScaleR dataset. We follow the training hyperparameters of DeepScaleR. We save 8 checkpoints with intervals of 32 steps. The training was conducted using the VERL framework.

Temporal Sampling

We implement Temporal Sampling in an offline way. We decouple the pipeline by first sampling 64 responses from different checkpoints, then computing Pass@k, Majority Voting, and Best-of-N with Temporal Sampling under various sampling quotas.

Sampling Responses Data

For your convenience, we have provided the 64 sampling responses of each checkpoint for AIME24, AIME25, and AMC in sample_64_responses.zip. To extract:

cd Temporal_Forgetting
unzip sample_64_responses.zip

Each folder contains:

samples_{task}: All questions and complete model responses
model_final_answer_{task}: Extracted model final answers with correctness judgments
QwenReward_{task}: ORM and PRM scores from Qwen/Qwen2.5-Math-PRM-7B and Qwen/Qwen2.5-Math-PRM-72B
- ORM: Calculated by the last step score
- PRM: Average score for all steps

Inference On Other Tasks

You can also evaluate the models on other tasks by yourself~ We use lm-evaluation-harness to generate model responses during inference time.

Environment Setup

cd lm-evaluation-harness
conda create -n lm_eval python=3.10
conda activate lm_eval
pip install -e .
pip install vllm==0.6.5

Available Tasks

AIME24
AIME25
AMC
Olympiad
hendrycks_math_500

Note: To generate different numbers of samples, modify the repeats parameter in lm-evaluation-harness/lm_eval/tasks/{task_name}/{task_name}.yaml. The default sampling number is set to 64.

Answer Judgement

We found that rule-based matching has many edge cases and failure instances (for MATH-500 and Olympiad). Therefore, our scoring process includes two steps:

Initial rule-based matching
LLM-based verification: For any incorrect results from rule-based matching, we use Qwen2.5-32B-Instruct to directly compare the model-generated answer in \boxed{} with the ground truth final answer.

Note: Qwen2.5-7B-Instruct as a judge may still have limitations in some cases for MATH-500 and Olympiad.

Scoring Results

Execute the following command to score the evaluation results:

python math_metric_llm_eval_general.py --directory_path ${output_path}/${SANITIZED_MODEL_SAVE_LABEL} --task ${task}

You can check llm_answer_result_check_{task_name} to see the correctness of different sampling responses for each question.

Complete Evaluation Script

#!/bin/bash

# Model configurations
models=(
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_32"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_64"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_96"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_128"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_160"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_192"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_224"
    "UWNSL/Qwen2.5-7B-deepscaler_4k_step_256"
    "Qwen/Qwen2.5-7B"
)

# Task configurations
tasks=("AIME24" "AIME25" "AMC")

# Model parameters
max_model_tokens=16000
max_gen_tokens=16000
model_args="tensor_parallel_size=1,data_parallel_size=4,gpu_memory_utilization=0.8,max_model_len=$max_model_tokens,dtype=bfloat16"
output_path="Sampling_Results"
batch_size="auto"

# Main evaluation loop
for task in "${tasks[@]}"; do
    for model in "${models[@]}"; do
        # Sanitize model name for file paths
        SANITIZED_MODEL_SAVE_LABEL=$(echo ${model} | sed 's/\//__/g')
        SAMPLE_FILE="${output_path}/${SANITIZED_MODEL_SAVE_LABEL}/samples_${task}_"*".jsonl"
        
        # Check if results already exist
        if ls $SAMPLE_FILE 2>/dev/null; then
            echo "Results already exist for: $model - $task"
            continue
        fi    

        echo "Running evaluation for model: $model, task: $task"
        
        # Run lm-evaluation-harness
        lm_eval --model vllm \
            --model_args pretrained="$model",$model_args \
            --gen_kwargs do_sample=True,temperature=0.6,top_p=0.95,max_gen_toks=$max_gen_tokens\
            --tasks "$task" \
            --batch_size "$batch_size" \
            --log_samples \
            --trust_remote_code \
            --output_path "$output_path" \
            --apply_chat_template
       
        # Score the results
        echo "Scoring results for: ${SANITIZED_MODEL_SAVE_LABEL}"
        python math_metric_llm_eval_general.py \
            --directory_path ${output_path}/${SANITIZED_MODEL_SAVE_LABEL} \
            --task ${task}
    done
done

Inference-time Scaling with Temporal Sampling

To reproduce our results with Temporal Sampling, follow the instructions in the notebooks:

Pass@k

See Temporal_sampling/calculate_pass_at_k.ipynb

Majority Voting

See Temporal_sampling/calculate_Majority_Voting.ipynb

Best-of-N

See Temporal_sampling/calculate_BoN.ipynb (see Temporal_Forgetting/best-of-n/README.md for more details of how to calculate rewards.)

The notebooks will generate visualization results similar to:

We welcome any questions or feedback! 🚀

Acknowledgements

This repository is built upon lm-evaluation-harness. We would like to thank all contributors for their support.

Citation

@article{li2025temporal,
  title={Temporal Sampling for Forgotten Reasoning in LLMs},
  author={Li, Yuetai and Xu, Zhangchen and Jiang, Fengqing and Ramasubramanian, Bhaskar and Niu, Luyao and Lin, Bill Yuchen and Yue, Xiang and Poovendran, Radha},
  journal={arXiv preprint arXiv:2505.20196},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Temperal_sampling		Temperal_sampling
best-of-n		best-of-n
figs		figs
lm-evaluation-harness		lm-evaluation-harness
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sampling_64_responses.zip		sampling_64_responses.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal Forgetting

Abstract

Overview

Getting Started

Training Checkpoints

Temporal Sampling

Sampling Responses Data

Inference On Other Tasks

Inference-time Scaling with Temporal Sampling

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

uw-nsl/Temporal_Forgetting

Folders and files

Latest commit

History

Repository files navigation

Temporal Forgetting

Abstract

Overview

Getting Started

Training Checkpoints

Temporal Sampling

Sampling Responses Data

Inference On Other Tasks

Inference-time Scaling with Temporal Sampling

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages