WALAR

Overview

We propose WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages. Our key insight is based on mending the holes of current state-of-the-art neural machine translation metrics, as training directly on these metrics will amplify such holes in trained LLMs. Specifically, we integrate quality estimation score, word alignment score and language alignment into WALAR's reward to mitigate the reward hacking brought by the holes. Finally, we trained three LLMs using WALAR. Extensive experiments on over 1400 language directions demonstrate that our model outperforms the strongest prior multilingual model of the same size.

📐 Experimental Results

📊 FLORES-101

We conducted extensive experiments on FLORES-101 and reported xCOMET and MetricX scores for over 1400 language directions. Results demonstrate that WALAR improves LLM translation quality by a large margin. By comparing Qwen3-8B, Translategemma-4B-it and LLaMAX3-8B-Alpaca before and after training with WALAR, we observe significant average improvements across all metrics, demonstrating the generalizability of WALAR across different model families.

We also leveraged Gemini 3 Flash to perform LLM-as-a-Judge to provide a more comprehensive evaluation of the translations generated by LLaMAX and LLaMAX+WALAR. Results show that LLaMAX3-8B-Alpaca trained with WALAR outperforms the base model on all language directions and the average score achieved by WALAR-trained LLaMAX3-8B-Alpaca is higher than 66, corresponding to translations with only minor issues according to the judging rubric.

📄 Language Consistency

To systematically assess an LLM's ability to generate translations in the desired target language, we define the Language Consistency Rate (LCR) as the proportion of test instances whose outputs are identified as being in the correct target language. As shown in the figure below, WALAR also improves language consistency by a large margin, especially for low-resource target language, such as Swahili.

📈Generalization of WALAR

Our model trained with WALAR also demonstrated strong generalization ability on language directions that are unseen during training. These results indicate that the improvements induced by WALAR can transfer beyond the training language set, potentially reducing the amount of parallel data and the number of language directions required to train massive multilingual models.

🔧 Training Guideline

Step 0: Configure environment & Download models

Configure the environment

pip install -r requirements.txt

Download Models

LLaMAX: https://huggingface.co/LLaMAX/LLaMAX3-8B-Alpaca

MetricX: https://huggingface.co/google/metricx-24-hybrid-xxl-v2p6-bfloat16

MetricX Tokenizer: https://huggingface.co/google/mt5-xl

Masklid model:

wget https://huggingface.co/cis-lmu/glotlid/resolve/main/model_v3.bin

Language Detector: https://huggingface.co/cis-lmu/glotlid

Word-alignment (Bge-m3): https://huggingface.co/BAAI/bge-m3

HanLP (Chinese tokenizer): https://file.hankcs.com/hanlp/tok/coarse_electra_small_20220616_012050.zip

Step 1: Set up WALAR's Reward

Prerequisite: 1 gpu needed

Please replace all the paths in serve_rm.py with the models you downloaded in Step 0.

Run bash serve_rm.sh under scripts/

bash serve_rm.sh

Parameter Explanation

model_name: the Quality Estimation (QE) model you would like to use. Could be set to metricX or XComet
base_model: the base model you want to evaluate. The paths for the models are hard-coded in model_path_dict.
port: The port of the reward model on your machine.
max_len: The maximum input sequence length.
lang_detect: whether to turn on language detector or not. Set True to turn it on.
align: whether to use word-alignment or not. Set True will turn it on.
masklid: whether to mask the code-mixing part in the translation outputs. Set True will turn it on.
alpha: The word alignment score weight. The default value is 20.
batch_size: the batch size for the qe model to evaluate each time

Step 2: Run RL

Prerequisite: 4 or more gpus recommended

Run bash examples/scripts/train.sh under openrlhf/

bash examples/scripts/train.sh

Parameter Explanation

model: The model you want to use. Please follow the path_dict. (You can add more models by directly modifying path_dict)
dataname: The dataset you want to use. (e.g., The dataset you want to use is called: abc.jsonl, then please set dataname=abc)
size: The model size you want to use. You can set whatever you want. It will only affect the name of your checkpoint directory and wandb logging, and won't affect the final results.
reward_name: The reward name you want to use. You can set whatever you want. It will only affect the name of your checkpoint directory and wandb logging, and won't affect the final results.

For the usage of other parameters, please refer to the documnetation of OpenRLHF

Hyperparameters

Train Batch Size: 1024
Epochs: 1
Learning Rates: 5e-7
Rollout Batch Size: 128
Rollout Nums: 8
Temperature: 1

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
code		code
data/train		data/train
fig		fig
openrlhf		openrlhf
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WALAR

Overview

📐 Experimental Results

📊 FLORES-101

📄 Language Consistency

📈Generalization of WALAR

🔧 Training Guideline

Step 0: Configure environment & Download models

Step 1: Set up WALAR's Reward

Step 2: Run RL

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WALAR

Overview

📐 Experimental Results

📊 FLORES-101

📄 Language Consistency

📈Generalization of WALAR

🔧 Training Guideline

Step 0: Configure environment & Download models

Step 1: Set up WALAR's Reward

Step 2: Run RL

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages