Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
设计目标
技术架构
典型MoE模型中,dispatch和combine算子位于Moe层,算子调度顺序如下:
[token] → [dispatch] → [gmm] → [swiglu] → [gmm] → [combine]其中
dispatch和combine属于通信过程,[gmm] → [swiglu] → [gmm]属于专家计算过程。当前的单流流程如下:
昇腾芯片支持通过张量/向量计算单元、通信及访存资源的并行使用,最大化算力与带宽利用率。其任务以流为单位串行提交至硬件引擎,当任务无依赖且资源无冲突时,多流机制可并行执行任务,提升资源利用率。这里我们将专家切分为两组,GroupA和GroupB,通过二者的通信-计算并行,协同提升通信带宽与算力利用率,双流并行流程如下:
graph LR subgraph 主流 B1 --> C1[专家计算] C1 --> E1[Combine通信] end subgraph 辅流 B2 --> C2[专家计算] C2 --> E2[Combine通信] end A[输入] --> A1{专家均衡切分} A1--> |Group A| B1[Dispatch通信] A1 --> |Group B| B2[Dispatch通信] E1 --> F[结果聚合] E2 --> F F --> J[输出]当前状态:挂起
DONE:已实现基于原始dispatch和combine的双流并行
TODO:因为
dispatch和combine在通信流程中使用了AICore,导致这两种融合算子并非单一的通信算子,无法与我们之前定义的计算流程[gmm] → [swiglu] → [gmm]真正并行。为实现通信-计算双流并行,需要对dispatch和combine融合算子拆分通信部分和计算部分,在计算部分结束后,即可释放对计算资源的占用,使得计算资源可同步用于MoE的计算流程。待此部分算子提供后继续适配,算子需求已提交。待此部分算子适配后进行性能测试。