Edge MoE: A Survey of Optimization Strategies for Mixture-of-Experts LLMs on the Edge

This repository provides a comprehensive collection of research papers, open-source projects, and optimization strategies for deploying Mixture-of-Experts (MoE) Large Language Models on Edge Devices. It includes contents from our survey paper 📖"Edge MoE: A Survey of Optimization Strategies for Mixture-of-Experts LLMs on the Edge" and will be continuously updated.

🤗 You are very welcome to contribute to this repository by launching an issue or a pull request.

📫 Contact us via emails: zhaoyong@uestc.edu.cn

📜 Catalog

Awesome Edge MoE

🔥 News

📖 Overview

📚 Related Survey

🪴 Taxonomy

System Optimization

Architecture Optimization

Parameter Optimization

🚀 Application Scenarios

📃 Citation

🔥 News

[2026-03] 🔥🔥 Our survey on Edge MoE optimization strategies is released!

📚 Related Survey Papers

(arXiv'25) A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications [paper]
(arXiv'24) A Survey on Mixture of Experts in Large Language Models[paper]
(ACM CSUR'24) A Review on Edge Large Language Models: Design, Execution, and Applications[paper]
(TMLR'24) Efficient Large Language Models: A Survey[paper]
(INLG'2025) Taming the Titans: A Survey of Efficient LLM Inference Serving[paper]
(Proc. IEEE'24) Edge intelligence: Paving the last mile of artificial intelligence with edge computing[paper]
(COMST'17) Mobile edge computing: A survey on architecture and computation offloading[paper]
(COMST'17) A survey on mobile edge computing: The communication perspective[paper]
(Computer'17) The emergence of edge computing[paper]
(IoT-J'16) Edge computing: Vision and challenges[paper]
(COMST'29) Convergence of edge computing and deep learning: A comprehensive survey[paper]
(arXiv'20) Communication-Efficient edge ai: Algorithms and systems[paper]
(COMST'20) Federated learning in mobile edge networks: A comprehensive survey[paper]
(COMST'21) Federated learning for internet of things: A comprehensive survey[paper]
(COMST'21) Federated learning for internet of things: Recent advances, taxonomy, and open challenges[paper]
(COMST'25) Mobile edge intelligence for large language models: A contemporary survey[paper]
(TKDE'25) A survey on mixture of experts in large language models[paper]

🪴 Taxonomy

System Optimization

Memory Management

Hierarchical Storage

(TMC'25) EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [paper]
(arXiv'24) Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing [paper]
(RAICS'24) Accelerating native inference model performance in edge devices using tensorrt [paper]
(SOSP'25) Ktransformers: Unleashing the full potential of cpu/gpu hybrid inference for moe models[paper]

Expert Caching & Prefetching

(TMC'25) EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [paper]
(arXiv'24) Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference[paper]
(arXiv'25) Mixture of cache-conditional experts for efficient mobile device inference[paper]
(ACL'23) Adaptive gating in mixture-of-experts based language models[paper]

Expert Swapping & Offloading

(Euro-Par'25) Cache management for mixture-of-experts llms,” in European Conference on Parallel Processing[paper]
(TC'25'25) Serving moe models on resource-constrained edge devices via dynamic expert swapping[paper]
(CAL'25) Ssd offloading for llm mixture-of-experts weights considered harmful in energy efficiency[paper]

Pipeline Scheduling

Communication-Computation Co-scheduling

(INFOCOM'23) Pipemoe: Accelerating mixture-of-experts through adaptive pipelining[paper]
(INFOCOM'24) Parm: Efficient training of large sparsely-activated models with dedicated schedules[paper]
(ICDCS'25) Mast: Efficient training of mixture-of-experts transformers with task pipelining and ordering[paper]
(NeurIPS'25) Flowmoe: A scalable pipeline scheduling framework for distributed mixture-of-experts training[paper]
(ICDCS'25) Mitigating contention in stream multiprocessors for pipelined mixture of experts: An sm-aware scheduling approach[paper]
(PPOPP'25) Harnessing inter-GPU shared memory for seamless moe communication-computation fusion[paper]

Expert Resource Elastic Management

(PACMMOD'23) Flexmoe: Scaling large-scale sparse pre-trained model training via dynamic device placement[paper]
(ASPLO'25) Klotski: Efficient mixture-of-expert inference via expert-aware multi-batch pipeline[paper]
(PPOPP'25) Harnessing inter-GPU shared memory for seamless moe communication-computation fusion[paper]
(TPDS'24) Mpmoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism[paper]

Hardware Adaptation & Co-design

Heterogeneous Compute Resource Coordination

(ICLR'25) Fiddler: Cpu-GPU orchestration for fast inference of mixture-of-experts models[paper]
(aXiv'25) eiq neutron: Redefining edge-ai inferencewith integrated npu and compiler innovations[paper]
(DAC'25) Pimoe:Towards efficient moe transformer deployment on npu-pim system through throttle-aware task offloading[paper]
(ASPLOS'24) Ianus: Integrated accelerator based on npu-pim unified memory system[paper]

Domain-Specific Hardware Architecture Design

(ISCA'21) Elsa: hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks[paper]
(HPCA'22) Transpim: A memory-based acceleration via software-hardware co-design for transformer[paper]
(TECS'25) Slim:A heterogeneous accelerator for edge inference of sparse large language model via adaptive thresholding[paper]
(TCAD'25) Atleus: Accelerating transformers on the edge enabled by 3d heterogeneous manycore architectures[paper]
(ICLR'21) Gshard: Scaling giant models with conditional computation and automatic sharding[paper]

Distributed & Collaborative Deployment

Topology-Aware Communication and Routing

(IJCAI'24) Locmoe: a low-overhead moe for large language model training[paper]
(aXiv'25) Grace-moe: Grouping and replication with locality-aware routing for efficient distributed moe inference[paper]
(IPDPS'24) Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference[paper]
(EuroSys'24) “Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling[paper]
(USENIX ATC'23) Accelerating distributed {MoE} training and inference with lina[paper]

Edge-Cloud Collaborative Inference

(aXiv'25) Ec2moe: Adaptive end-cloud pipeline collaboration enabling scalable mixture-of-experts inference[paper]
(INFOCOM'25) Multi-tier multi-node scheduling of llm for collaborative ai computing[paper]
(IPDPS'24) Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference[paper]

Cost-Driven Elastic Deployment

(TSC'24) Moesys: A distributed and efficient mixture-of-experts training and inference system for internet services[paper]
(USENIX ATC'23) Accelerating distributed {MoE} training and inference with lina[paper]
(INFOCOM'25) “Optimizing distributed deployment of mixture-of-experts model inference in serverless computing[paper]

Architecture Optimization

Efficient Attention

Sparse Attention

(HPCA'21) Spatten: Efficient sparse attention architecture with cascade token and head pruning[paper]
(ICML'24) Quest:query-aware sparsity for efficient long-context llm inference[paper]
(NeurIPS'24) Infllm: Training-free long-context extrapolation for llms with an efficient context memory[paper]

Linear / Kernelized Attention

(ICLR'25) LoLCATs: On low-rank linearizing of large language models[paper]
(MM'25) Elfatt: Efficient linear fast attention for vision transformers[paper]
(PMLR'24) Mobile attention: Mobile-friendly linear-attention for vision transformers[paper]

Mixture of Attention Experts

(EMNLP'22) Mixture of attention heads: Selecting attention heads per token[paper]
(NeurIPS'24) Switchhead:Accelerating transformers with mixture-of-experts attention[paper]

Tailored Routers

Token-Aware Sparsity

(ACL'24) Harder task needs more experts: Dynamic routing in MoE models[paper]
(ACL'23) Adaptive gating in mixture-of-experts based language models[paper]
(ICCAD'25) AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference[paper]

Cache-Aware Routing

(TMLR'25) Mixture of cache-conditional experts for efficient mobile device inference[paper]
(arXiv'24) Expertflow: Optimized expert activation and token allocation for efficient mixture-of-experts inference[paper]
(arXiv'24) Promoe: Fast moe-based llm serving using proactive caching[paper]

Load Balance Optimization

(NeurIPS'24) Toward efficient inference for mixture of experts[paper]
(PPOPP'25) Harnessing inter-GPU shared memory for seamless moe communication-computation fusion[paper]
(ICLR'25) Remoe: Fully differentiable mixture-of-experts with reLU routing[paper]

Diverse Experts

Internal Expert Architecture

(SenSys'24) Litemoe: Customizing on-device llm serving via proxy submodel tuning[paper]
(ACL'24) Swapmoe: Serving off-the-shelf moe-based large language models with tunable memory budget[paper]
(EMNLP'25) HMoE: Heterogeneous mixture of experts for language modeling[paper]

Expert Heterogeneity

(TMC'25) EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [paper]
(COMMAG'25) The moe-empowered edge llms deployment: Architecture, challenges, and opportunities[paper]
(MobiCom'25) D2MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving[paper]
(aXiv'24) Hobbit: A mixed precision expert offloading system for fast moe inference[paper]

Inter-Expert Parameter Sharing

(ACL'24) DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[paper]
(ICML'25) Delta decompression for moe-based LLMs compression[paper]
(ICML'25) Moe-SVD: Structured mixture-of-experts LLMs compression via singular value decomposition[paper]
(EMNLP'25) Genpoe: Generative passage-level mixture of experts for knowledge enhancement of llms[paper]

Parameter Optimization

Multi-Granularity Quantization

Expert-Level Granularity

(ICCV'25) Mopeq: Mixture of mixed precision quantized experts[paper]
(ICLR'26) Towards global expert-level mixed-precision quantization for mixture-of-experts LLMs[paper]
(TC'25) Serving moe models on resource-constrained edge devices via dynamic expert swapping[paper]
(ICCAD'25) AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference[paper]

Intra-Layer Component Granularity

(ICML'25) Mxmoe: Mixed-precision quantization for moe with accuracy and performance co-design[paper]
(TC'24) Edgempq: Layer-wise mixed-precision quantization with tightly integrated versatile inference units for edge computing[paper]
(DAC'25) Pimoe:Towards efficient moe transformer deployment on npu-pim system through throttle-aware task offloading[paper]
(TCAD'24) Block-wise mixed-precision quantization: Enabling high efficiency for practical reram-based dnn accelerators[paper]

Channel-Level Granularity

(ACL'25) Automated fine-grained mixture-of-experts quantization[paper]
(TCAD'25) Oiso:Outlier-isolated data format for low-bit large language model quantization[paper]

Low-Rank Approximation

Structured Low-Rank Approximation

(TCAD'25) Ultra memory-efficient on-fpga training of transformers via tensor-compressed optimization[paper]
(NeurIPS'23) Qlora:efficient finetuning of quantized llms[paper]
(Access'25) Toward generating quality test questions and answers using quantized low-rank adapters in llms[paper]

Post-Training Low-Rank Approximation

(aXiv'23) Qmoe: Practical sub-1-bit compression of trillion-parameter models[paper]
(ICLR'25) Mixture compressor for mixture-of-experts LLMs gains more[paper]

Adaptive Rank Selection

(ACL'25) MoRE: A mixture of low-rank experts for adaptive multi-task learning[paper]
(ACL'24) LoRAMoE: Alleviating world knowledge forgetting in large language models via MoE-style plugin[paper]
(TPDS'25) Cannikin:No lagger of slo in concurrent multiple lora llm serving[paper]

Pruning

Structured Pruning

(JMLR'22) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[paper]
(TC'25) Serving moe models on resource-constrained edge devices via dynamic expert swapping[paper]
(ISCA'24) Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference[paper]
(NeurIPS'22) Confident adaptive language modeling[paper]
(JMLR'24) Memory3 : Language modeling with explicit memory[paper]

Unstructured Pruning

(TCAD'23) Transcode: Co-design of transformers and accelerators for efficient training and inference[paper]
(TCAD'24) Mobile transformer accelerator exploiting various line sparsity and tile-based dynamic quantization[paper]
(TPDS'25) Efficientmoe: Optimizing mixture-of-experts model training with adaptive load balance[paper]
(aXiv'22) Megablocks:Efficient sparse training with mixture-of-experts[paper]

Hybrid Pruning

(ICLR'25) Mixture compressor for mixture-of-experts LLMs gains more[paper]
(NeurIPS'22) Mixture-of-experts with expert choice routing[paper]
(IPDPS'23) Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism[paper]

Knowledge Distillation

MoE-to-Dense Distillation

(JMLR'22) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[paper]
(aXiv'22) One student knows all experts know: From sparse to dense[paper]

MoE-to-Small-MoE Distillation

(COLM'25) Slimmoe: Structured compression of large moe models via expert slimming and distillation[paper]
(EMNLP'23) Scaling vision-language models with sparse mixture of experts[paper]
(ICLR'25) LLaVA-mod: Making LLaVA tiny via moe-knowledge distillation[paper]
(AAAI'24) “Mode: a mixture-of-experts model with mutual distillation among the experts[paper]
(NeurIPS'24) Exploiting activation sparsity with dense to dynamic-k mixture-of-experts conversion[paper]
(PMLR'20) Deep mixture of experts via shallow embedding[paper]

Task-Specific Distillation

(aXiv'26) Distilling lightweight domain experts from large ml models by identifying relevant subspaces[paper]
(EMNLP'21) Muppet: Massive multi-task representations with pre-finetuning[paper]
(TMC'25) EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [paper]

🚀 Application Scenarios

On-Device Intelligent Assistants:

Supporting real-time conversations, content generation, and context-aware interactions locally on smartphones and wearables

(TMC'25) EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices [paper]
(SAGE'23) Towards large language models at the edge on mobile, augmented reality, and virtual reality devices with unity [paper]
(aXiv'25) A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications [paper]
(ICML'24) Mobilellm: Optimizing sub-billion parameter language models for on-device use cases [paper]
(SOSP'24) Powerinfer: Fast large language model serving with a consumer-grade gpu [paper]

Autonomous Driving:

Enabling real-time road condition analysis, V2V collision warning, and trajectory prediction while maintaining a strict 20-100 ms latency constraint without sending sensitive raw sensor data to the cloud.

(COMST'25) Mobile edge intelligence for large language models: A contemporary survey[paper]
(aXiv'17) Outrageously large neural networks: The sparsely-gated mixture-of-experts layer [paper]
(JMLR'22) Switch transformers: scaling to trillion parameter models with simple and efficient sparsity[paper]
(Computer'17) The emergence of edge computing[paper]
(IoT-J'16) Edge computing: Vision and challenges[paper]
(AAAI'25) Language prompt for autonomous driving [paper]
(aXiv'23) ** Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving** [paper]
(aXiv'23) Gpt-driver: Learning to drive with gpt [paper]
(WACV'24) Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles [paper]
(CVPR'23) Planning-oriented autonomous driving [paper]
(TMC'24) Pacp: Priority-aware collaborative perception for connected and autonomous vehicles [paper]
(TIV'16) A Survey of Motion Planning and Control Techniques for Self-driving Urban Vehicles [paper]
(Access'20) V2x support in 3gpp specifications: From 4g to 5g and beyond [paper]
(MONET'21) Vehicular Edge Computing and Networking: A Survey [paper]
(ASPLOS'17) Neurosurgeon: Collaborative intelligence between the cloud and mobile edge [paper]
(Netw.'20) Security and privacy challenges in 5g-enabled vehicular networks [paper]
(ECVA'23) Drivelm: Driving with graph visual question answering [paper]
(aXiv'23) Universal and transferable adversarial attacks on aligned language models [paper]

Intelligent Healthcare:

Assisting clinicians in local diagnostic reasoning and patient monitoring, mitigating the privacy risks of centralized cloud inference for highly sensitive patient medical records.

(Proc. IEEE'24) Edge intelligence: Paving the last mile of artificial intelligence with edge computing[paper]
(COMST'25) Mobile edge intelligence for large language models: A contemporary survey[paper]
(ICLR'17) Outrageously large neural networks: The sparsely-gated mixture-of-experts layer [paper]
(TMI'23) Lvit: language meets vision transformer in medical image segmentation [paper]
(Nature'23) Large language models encode clinical knowledge [paper]
(ETT'22) Edge computing in smart health care systems: Review, challenges, and research directions [paper]
(S&P'24) Gpu. zip: On the side-channel implications of hardware-based graphical data compression [paper]

Embodied Robotics:

Operating in smart factories and humanoid platforms with strict 10-100 ms end-to-end latency requirements, reducing massive communication bandwidth overheads while interacting with physical environments.

(COMST'25) Mobile edge intelligence for large language models: A contemporary survey[paper]
(ICML'22) GLaM: Efficient scaling of language models with mixture�of-experts [paper]
(aXiv'24) OpenVLA: An Open-Source Vision-Language-Action Model [paper]
(CUI'23) Harnessing Large Language Models for Cognitive Assistants in Factories [paper]
(TS 23.401) 3rd generation partnership project; technical specification group services and system aspects; general packet radio service (gprs) enhancements for evolved universal terrestrial radio access network (e-utran) access [paper]
(IPC'09) The case for vm-based cloudlets in mobile computing [paper]
(TWC'19) Edge ai: On-demand accelerating deep neural network inference via edge computing [paper]
(TII'19) Sparse feature learning for correlation filter tracking toward 5g-enabled tactile internet [paper]
(PIEEE'19) Edge computing for autonomous driving: Opportunities and challenges [paper]
(PMLR'17) Communication-efficient learning of deep networks from decentralized data [paper]
(PMLR'23) Rt-2: Vision-language-action models transfer web knowledge to robotic control [paper]
(aXiv'23) VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper]

📃 Citation

@article{mo2025edge,
  title={Edge MoE: A Survey of Optimization Strategies for Mixture-of-Experts LLMs on the Edge},
  author={Mo, Zhenjia and Zhao, Yong and He, Qiang and Zhang, Mingjin and Chen, Yicong and Li, Ruitao and Qiu, Zihao and Wen, Hao and Chen, Shengyuan and Zhang, Qinggang and Ren, Wei and Cao, Jiannong},
  journal={arXiv preprint},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation