
In this project, we provide an implementation of CLIPO (Contrastive Learning in Policy Optimization). CLIPO introduces a lightweight contrastive head and an intra-group contrastive objective into RLVR-style policy optimization. By aligning multiple successful reasoning trajectories and contrasting them against incorrect ones, the method extracts invariant reasoning structures shared by correct rollouts. This yields a dense contrastive reward that complements standard outcome-based verifiable rewards, improving robustness and generalization, especially under distribution shifts and hard mathematical reasoning benchmarks.
Figure 1. Overview of the CLIPO framework. We introduce a Contrastive Head to provide fine-grained, intra-group contrastive reward signals for each response.
We use the official verl Docker image with verl 0.6 and sglang 0.5.2:
Patch SGLang for sequence-level hidden states. We obtain sequence-level hidden states via mean pooling during rollout. This is implemented by patching sglang’s scheduler_output_processor_mixin.py. From the repo root, run:
bash modified/sglang_modify.shThe script copies modified/scheduler_output_processor_mixin.py into your sglang installation (and backs up the original).
For convenience, we provide the full training and evaluation data used in the paper under the data directory:
- Training data:
data/train/ - Test data:
data/test/
Clone this repository to your working directory and prepare a base model (e.g. Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct). Set CLIPO_PATH and MODEL_PATH in the training scripts accordingly.
Prerequisites: Set CLIPO_PATH (repository root) and MODEL_PATH (base model path) at the top of the script. Ensure the SGLang patch has been applied and data/train/ and data/test/ are in place.
From the repository root, run:
| Track | Script | Training data | Evaluation sets |
|---|---|---|---|
| Track I | scripts/train_gsm8k.sh |
gsm8k.parquet |
GSM8K, GSM variants, TheoremQA, MMLU, TruthfulQA, CommonsenseQA |
| Track II | scripts/train_math.sh |
math75.parquet |
MATH, AIME, AIME 2025, AMC, MATH perturb (simple/hard) |
bash scripts/train_gsm8k.sh # Track I: GSM8K
bash scripts/train_math.sh # Track II: MATHLogs are appended to train_gsm8k.log and train_math.log respectively.
Key CLIPO (contrastive head) hyperparameters:
| Parameter | Default | Description |
|---|---|---|
con_lm_head_output_size |
512 (gsm8k) / 2048 (math) | Projection dimension after the contrastive head. |
con_lm_head_loss_type |
infonce_loss |
Loss type: infonce_loss, supcon_loss_out, supcon_loss_in, soft_nn_loss. |
con_lm_head_temperature |
0.05 | Temperature for similarity logits; lower values focus more on hard negatives. |
con_lm_head_lambda |
0.2 | Weight of contrastive reward: contrastive_reward = - lambda * loss (min reward is clamped to -0.5). |
For the full set of contrastive-head options (e.g. con_lm_head_type, con_lm_head_masked_label, con_lm_head_contrastive_level, con_lm_head_debias), see verl/trainer/config/model/hf_model.yaml.
public-clipo/
├── README.md
├── modified/ # SGLang patch (mean pooling for sequence-level hidden states)
│ ├── scheduler_output_processor_mixin.py
│ └── sglang_modify.sh
├── scripts/ # Training scripts
│ ├── train_gsm8k.sh # Track I: GSM8K
│ └── train_math.sh # Track II: MATH
├── verl/ # verl + CLIPO
│ ├── trainer/
│ │ ├── main_ppo.py # entrypoint
│ │ ├── config/
│ │ │ └── model/hf_model.yaml # Model & contrastive-head config
│ │ └── ppo/
│ │ └── ray_trainer.py # 🌟 Main CLIPO implementation (ContrastiveHead, supcon loss, contrastive rewards)
│ ├── workers/ # Actor, rollout, ref; fsdp_workers exposes hidden_size for con_lm_head
│ │ └── config/model.py # HFModelConfig
│ ├── models/
│ └── ...
└── data/ # Train and test data
├── train/
└── test/
Core implementations (verl/trainer/ppo/ray_trainer.py):
- class
ContrastiveHead - function
supcon_loss - function
compute_contrastive_rewards
This project is built upon several fantastic open-source libraries. We would like to extend our heartfelt gratitude to the developers and communities of:
If you find our work useful in your research, please consider citing our paper:
@misc{cui2026clipo,
title={CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR},
author={Sijia Cui and Pengyu Cheng and Jiajun Song and Yongbo Gai and Guojun Zhang and Zhechao Yu and Jianhe Lin and Xiaoxi Jiang and Guanjun Jiang},
year={2026},
eprint={2603.10101},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.10101},
}