[RFC][FastDeploy] MetaX C500 PaddleOCR-VL-1.5 Performance Analysis (oldzhu) by oldzhu · Pull Request #1360 · PaddlePaddle/community

oldzhu · 2026-04-30T12:12:55Z

Summary

This RFC documents the Phase 1 performance analysis and Phase 2 optimization results for running PaddleOCR-VL-1.5 on a MetaX C500 GPU (64 GB, MACA 3.3.0) via FastDeploy 2.5.

Environment

GPU: MetaX C500 (64 GB GDDR6, MACA 3.3.0.15)
Model: PaddleOCR-VL-1.5 (0.9B parameters, bfloat16)
Framework: FastDeploy 2.5 (release/2.5)
Profiler: mcTracer attach mode

Key Findings (Phase 1)

Kernel Hotspots (628 input / 165 output tokens, 4.38s wall time)

Kernel	% GPU time
FlashAttention (SDPA)	33.4%
GEMV (vocab proj)	12.3%
GEMV (MLP)	6.3%
RMSNorm	5.8%
TopK	3.3%
SigLIP backbone	1.5%

GPU utilization: 19.5% (67 W / 350 W TDP)
Root cause: Python/CPU dispatch overhead = 80.5% of wall time (~21 ms/step CPU vs ≤2 ms GPU kernels per decode step)

Phase 2 Optimizations

Action	Outcome	Improvement
8.1 SOT pre-compile	DISCARDED (MACA 3.3.0 crash)	—
8.2 MACA shader cache warm-up	KEEP (warm start benefit)	cold: 135.2s → 4.28s (−97%)
8.3 Concurrent batching	KEEP (async client pool)	~10 → ~88 tok/s aggregate (+780%)
8.4 RMSNorm+Linear fusion	BLOCKED / Future Work	needs custom MACA kernel

Concrete improvements via batching (Action 8.3) yield >20% improvement requirement met and far exceeded.

Files Added

rfcs/FastDeploy/20260430_metax_paddleocr_vl_perf_analysis.md — English
rfcs/FastDeploy/20260430_metax_paddleocr_vl_perf_analysis.zh.md — Chinese

Hackathon track: 文心伙伴赛道-沐曦-进阶

… report Add Phase 1 performance bottleneck analysis for PaddleOCR-VL-1.5 running via FastDeploy 2.5 on MetaX C500 GPU (MACA 3.3.0). Report includes: - Inference framework scheduling analysis (vLLM-style continuous batching) - GPU utilization profile (19.5%, 67W/350W TDP) - 6 kernel function analyses (FlashAttention 33.4%, GEMV 12.3%+6.3%, RMSNorm 5.8%, TopK 3.3%, SigLIP 1.5%) - Phase 2 optimization validation results: - Action 8.1 (SOT graph compile): DISCARDED - crash on MACA 3.3.0 - Action 8.2 (MACA shader cache): KEEP - 135.2s -> 4.28s cold start (-97%) - Action 8.3 (concurrent batching): KEEP - ~10 -> ~88 tok/s (+780%) - Action 8.4 (RMSNorm+Linear fusion): Future Work - no kernel in stack GitHub: https://github.com/oldzhu/paddle-mx Author: oldzhu

paddle-bot · 2026-04-30T12:13:02Z

你的PR提交成功，感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备，具体请参考示例和模版。
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

CLAassistant · 2026-04-30T12:15:19Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot Bot added the contributor label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][FastDeploy] MetaX C500 PaddleOCR-VL-1.5 Performance Analysis (oldzhu)#1360

[RFC][FastDeploy] MetaX C500 PaddleOCR-VL-1.5 Performance Analysis (oldzhu)#1360
oldzhu wants to merge 1 commit intoPaddlePaddle:masterfrom
oldzhu:metax-paddleocr-vl-perf-analysis

oldzhu commented Apr 30, 2026

Uh oh!

paddle-bot Bot commented Apr 30, 2026

Uh oh!

CLAassistant commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oldzhu commented Apr 30, 2026

Summary

Environment

Key Findings (Phase 1)

Kernel Hotspots (628 input / 165 output tokens, 4.38s wall time)

Phase 2 Optimizations

Files Added

Uh oh!

paddle-bot Bot commented Apr 30, 2026

Uh oh!

CLAassistant commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants