This repository contains the NPU-supported implementation of:
We propose MoE-LPR (Mixture-of-Experts with Language Priors Routing),
a two-stage training framework that enhances an LLM’s multilingual capability.
-
Stage 1 – Post-pretraining with MoE:
Upcycle a dense model into a Mixture-of-Experts (MoE) architecture, freezing the original parameters while adding new experts.
This stage focuses on improving new-language capabilities without using original-language data. -
Stage 2 – Review with LPR:
Train only the router using language priors, guiding the model to retain knowledge of the original languages while maintaining new-language performance.
Evaluations show MoE-LPR achieves superior performance and catastrophic forgetting resistance across multiple multilingual benchmarks.
As shown below, we upcycle the dense model to an MoE structure.
The parameters of the original model are frozen, preserving previously learned knowledge, while new experts and routers are trainable.
This design allows the model to reuse old knowledge or store new knowledge adaptively.
After post-pretraining, the router might misassign experts for previously supported languages.
To address this, we train only the router using language priors that encourage:
- Original-language tokens → routed to frozen experts
- Expanded-language tokens → routing unchanged
We use a cross-entropy loss guided by token languages.
See our paper for detailed mathematical formulations.
MoE-LPR shows consistent improvements across benchmarks like ARC-Challenge, HellaSwag, and Belebele,
demonstrating superior multilingual learning and robustness against catastrophic forgetting.
Additionally, experiments on LLaMA3-8B confirm the method’s scalability:
Below are the instructions for training MoE-LPR using this repository.
Ensure these commands exist on your NPU server:
which lspci modprobe udevadm modinfoExpected output:
/usr/bin/lspci
/usr/sbin/modprobe
/usr/bin/udevadm
/usr/sbin/modinfoIf missing, install them (with sudo if not root):
sudo apt update && sudo apt install -y pciutils kmod udev libaio-dev- Ensure CANN toolkit and CANN kernels are installed.
- If not, follow the Ascend installation guide according to your server type (inference or training).
Python 3.10 (recommended)
pip install -r requirements.txtFor deepspeed installation, see
昇腾开源 - Deepspeed 安装指南
Convert a pretrained model to llama_moelpr type via MoE upcycling:
# Under MoE-LPR/
python3 upcycling.py \
--model_path path/to/your/model \
--output_path path/to/output/dir \
--num_experts 4We provide example data under:
MoE-LPR/LLaMA-Factory/data/moelpr_examples
ja.jsonl→ new language corpuszh.jsonl→ original language corpus
Stage 1:
bash scripts/stage1.shStage 2:
bash scripts/stage2.shPlace monolingual documents here:
MoE-LPR/LLaMA-Factory/data
Each file should be .jsonl or .json formatted as:
{"text": "your one doc"}Then register it in:
MoE-LPR/LLaMA-Factory/data/dataset_info.json
Example:
"hu_1b": {
"file_name": "hu_part1b_00000.jsonl",
"file_sha1": "e70375e28eda542a90c68213640cc371898ce184",
"language": "new",
"columns": {
"prompt": "text"
}
}language can be "old" or "new", used during Stage 2 training.
Control this with the hyperparameter generate_lang_mask.
Execute Stage 1:
bash scripts/stage1.shKey hyperparameters:
--moe_num_experts: total number of experts--topk: number of experts selected per token--aux_loss_coef: weight of load balancing loss
Logs include per-expert scores and load-balancing metrics. Full details are in the paper.
Execute Stage 2:
bash scripts/stage2.shKey hyperparameters:
--lpr_loss_coef: weight of LPR loss--max_samples: number of documents per language
Logs show LPR loss and average expert selections.
We use PEFT to manage parameters:
from transformers import AutoModelForCausalLM
peftpath = ""
model = AutoModelForCausalLM.from_pretrained(peftpath)Evaluation via lm-evaluation-harness:
lm_eval --model hf \
--model_args pretrained=$BASE_MODEL_PATH,peft=$PEFT_MODEL_PATH,dtype="float16" \
--tasks hellaswag_tr \
--device cuda:0 \
--num_fewshot 10 \
--output_path $OUTPUT_PATH \
--batch_size $BATCH_SIZEIf you have questions about the paper or code, contact: Hao Zhou (zhouh@smail.nju.edu.cn) Zhijun Wang (wangzj@smail.nju.edu.cn)
Or open an issue on GitHub with detailed information.
For questions about NPU, contact: Shen Yunzhi (shenyunzhi@smail.nju.edu.cn)
Please cite MoE-LPR if you use it in your work:
@inproceedings{zhou2024MoE-LPR,
author = {Zhou, Hao and Wang, Zhijun and Huang, Shujian and Huang, Xin and Han, Xue and Feng, Junlan and Deng, Chao and Luo, Weihua and Chen, Jiajun},
journal = {arXiv preprint arXiv:2408.11396},
title = {MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing},
url = {https://arxiv.org/abs/2408.11396},
year = {2024}
}

