Skip to content

[RFC][FastDeploy] MetaX C500 PaddleOCR-VL-1.5 Performance Analysis (oldzhu)#1360

Open
oldzhu wants to merge 1 commit intoPaddlePaddle:masterfrom
oldzhu:metax-paddleocr-vl-perf-analysis
Open

[RFC][FastDeploy] MetaX C500 PaddleOCR-VL-1.5 Performance Analysis (oldzhu)#1360
oldzhu wants to merge 1 commit intoPaddlePaddle:masterfrom
oldzhu:metax-paddleocr-vl-perf-analysis

Conversation

@oldzhu
Copy link
Copy Markdown

@oldzhu oldzhu commented Apr 30, 2026

Summary

This RFC documents the Phase 1 performance analysis and Phase 2 optimization results for running PaddleOCR-VL-1.5 on a MetaX C500 GPU (64 GB, MACA 3.3.0) via FastDeploy 2.5.

Environment

  • GPU: MetaX C500 (64 GB GDDR6, MACA 3.3.0.15)
  • Model: PaddleOCR-VL-1.5 (0.9B parameters, bfloat16)
  • Framework: FastDeploy 2.5 (release/2.5)
  • Profiler: mcTracer attach mode

Key Findings (Phase 1)

Kernel Hotspots (628 input / 165 output tokens, 4.38s wall time)

Kernel % GPU time
FlashAttention (SDPA) 33.4%
GEMV (vocab proj) 12.3%
GEMV (MLP) 6.3%
RMSNorm 5.8%
TopK 3.3%
SigLIP backbone 1.5%
  • GPU utilization: 19.5% (67 W / 350 W TDP)
  • Root cause: Python/CPU dispatch overhead = 80.5% of wall time (~21 ms/step CPU vs ≤2 ms GPU kernels per decode step)

Phase 2 Optimizations

Action Outcome Improvement
8.1 SOT pre-compile DISCARDED (MACA 3.3.0 crash)
8.2 MACA shader cache warm-up KEEP (warm start benefit) cold: 135.2s → 4.28s (−97%)
8.3 Concurrent batching KEEP (async client pool) ~10 → ~88 tok/s aggregate (+780%)
8.4 RMSNorm+Linear fusion BLOCKED / Future Work needs custom MACA kernel

Concrete improvements via batching (Action 8.3) yield >20% improvement requirement met and far exceeded.

Files Added

  • rfcs/FastDeploy/20260430_metax_paddleocr_vl_perf_analysis.md — English
  • rfcs/FastDeploy/20260430_metax_paddleocr_vl_perf_analysis.zh.md — Chinese

Hackathon track: 文心伙伴赛道-沐曦-进阶

… report

Add Phase 1 performance bottleneck analysis for PaddleOCR-VL-1.5 running
via FastDeploy 2.5 on MetaX C500 GPU (MACA 3.3.0).

Report includes:
- Inference framework scheduling analysis (vLLM-style continuous batching)
- GPU utilization profile (19.5%, 67W/350W TDP)
- 6 kernel function analyses (FlashAttention 33.4%, GEMV 12.3%+6.3%,
  RMSNorm 5.8%, TopK 3.3%, SigLIP 1.5%)
- Phase 2 optimization validation results:
  - Action 8.1 (SOT graph compile): DISCARDED - crash on MACA 3.3.0
  - Action 8.2 (MACA shader cache): KEEP - 135.2s -> 4.28s cold start (-97%)
  - Action 8.3 (concurrent batching): KEEP - ~10 -> ~88 tok/s (+780%)
  - Action 8.4 (RMSNorm+Linear fusion): Future Work - no kernel in stack

GitHub: https://github.com/oldzhu/paddle-mx
Author: oldzhu
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 30, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants