Skip to content

NJUNLP/MoE-LPR-NPU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoE-LPR-NPU

This repository contains the NPU-supported implementation of:

MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Quick Links

Overview

We propose MoE-LPR (Mixture-of-Experts with Language Priors Routing),
a two-stage training framework that enhances an LLM’s multilingual capability.

  1. Stage 1 – Post-pretraining with MoE:
    Upcycle a dense model into a Mixture-of-Experts (MoE) architecture, freezing the original parameters while adding new experts.
    This stage focuses on improving new-language capabilities without using original-language data.

  2. Stage 2 – Review with LPR:
    Train only the router using language priors, guiding the model to retain knowledge of the original languages while maintaining new-language performance.

Evaluations show MoE-LPR achieves superior performance and catastrophic forgetting resistance across multiple multilingual benchmarks.

Stage 1: Post-pretraining with MoE

As shown below, we upcycle the dense model to an MoE structure.
The parameters of the original model are frozen, preserving previously learned knowledge, while new experts and routers are trainable.
This design allows the model to reuse old knowledge or store new knowledge adaptively.

Stage 2: Review with LPR

After post-pretraining, the router might misassign experts for previously supported languages.
To address this, we train only the router using language priors that encourage:

  • Original-language tokens → routed to frozen experts
  • Expanded-language tokens → routing unchanged

We use a cross-entropy loss guided by token languages.
See our paper for detailed mathematical formulations.

Main Results

MoE-LPR shows consistent improvements across benchmarks like ARC-Challenge, HellaSwag, and Belebele,
demonstrating superior multilingual learning and robustness against catastrophic forgetting.

Additionally, experiments on LLaMA3-8B confirm the method’s scalability:

Train MoE-LPR

Below are the instructions for training MoE-LPR using this repository.

Tool Installation

Ensure these commands exist on your NPU server:

which lspci modprobe udevadm modinfo

Expected output:

/usr/bin/lspci
/usr/sbin/modprobe
/usr/bin/udevadm
/usr/sbin/modinfo

If missing, install them (with sudo if not root):

sudo apt update && sudo apt install -y pciutils kmod udev libaio-dev

Ascend Env Installation

  • Ensure CANN toolkit and CANN kernels are installed.
  • If not, follow the Ascend installation guide according to your server type (inference or training).

Python Env Installation

Python 3.10 (recommended)

pip install -r requirements.txt

For deepspeed installation, see 昇腾开源 - Deepspeed 安装指南

Make your model upcycled

Convert a pretrained model to llama_moelpr type via MoE upcycling:

# Under MoE-LPR/
python3 upcycling.py \
  --model_path path/to/your/model \
  --output_path path/to/output/dir \
  --num_experts 4

Train

Toy Example

We provide example data under: MoE-LPR/LLaMA-Factory/data/moelpr_examples

  • ja.jsonl → new language corpus
  • zh.jsonl → original language corpus

Stage 1:

bash scripts/stage1.sh

Stage 2:

bash scripts/stage2.sh

Data

Place monolingual documents here:

MoE-LPR/LLaMA-Factory/data

Each file should be .jsonl or .json formatted as:

{"text": "your one doc"}

Then register it in:

MoE-LPR/LLaMA-Factory/data/dataset_info.json

Example:

"hu_1b": {
  "file_name": "hu_part1b_00000.jsonl",
  "file_sha1": "e70375e28eda542a90c68213640cc371898ce184",
  "language": "new",
  "columns": {
    "prompt": "text"
  }
}

language can be "old" or "new", used during Stage 2 training. Control this with the hyperparameter generate_lang_mask.

Post-pretraining

Execute Stage 1:

bash scripts/stage1.sh

Key hyperparameters:

  • --moe_num_experts: total number of experts
  • --topk: number of experts selected per token
  • --aux_loss_coef: weight of load balancing loss

Logs include per-expert scores and load-balancing metrics. Full details are in the paper.

Language Priors Router Training

Execute Stage 2:

bash scripts/stage2.sh

Key hyperparameters:

  • --lpr_loss_coef: weight of LPR loss
  • --max_samples: number of documents per language

Logs show LPR loss and average expert selections.

Evaluate

We use PEFT to manage parameters:

from transformers import AutoModelForCausalLM

peftpath = ""
model = AutoModelForCausalLM.from_pretrained(peftpath)

Evaluation via lm-evaluation-harness:

lm_eval --model hf \
        --model_args pretrained=$BASE_MODEL_PATH,peft=$PEFT_MODEL_PATH,dtype="float16" \
        --tasks hellaswag_tr \
        --device cuda:0 \
        --num_fewshot 10 \
        --output_path $OUTPUT_PATH \
        --batch_size $BATCH_SIZE

Bugs or Questions?

If you have questions about the paper or code, contact: Hao Zhou (zhouh@smail.nju.edu.cn) Zhijun Wang (wangzj@smail.nju.edu.cn)

Or open an issue on GitHub with detailed information.

For questions about NPU, contact: Shen Yunzhi (shenyunzhi@smail.nju.edu.cn)

Citation

Please cite MoE-LPR if you use it in your work:

@inproceedings{zhou2024MoE-LPR,
  author = {Zhou, Hao and Wang, Zhijun and Huang, Shujian and Huang, Xin and Han, Xue and Feng, Junlan and Deng, Chao and Luo, Weihua and Chen, Jiajun},
  journal = {arXiv preprint arXiv:2408.11396},
  title = {MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing},
  url = {https://arxiv.org/abs/2408.11396},
  year = {2024}
}

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors