feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188

sufubao · 2026-01-22T00:57:26Z

Summary

Add complete GLM-4.7-Flash (glm4_moe_lite) model family support with MoE architecture
Add MTP (Multi-Token Prediction) layer support for speculative decoding
Add GLM-4.7 function call parser with XML-style argument format
Add benchmark orchestrator and launcher scripts for performance testing
Increase decode attention batch size limit from 2048 to 8192 for better throughput

Model Architecture

Grouped MoE with top-k expert selection
Support for vanilla_with_att and eagle_with_att MTP modes
Compatible with FlashAttention3 backend

Function Call Support

New Glm47Detector class for parsing GLM-4.7's XML-style tool calls
Format: <tool_call>func_name\n<arg_key>key</arg_key><arg_value>value</arg_value></tool_call>
Full streaming support for incremental parsing

Recommended Launch Script (H200)

LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \
python -m lightllm.server.api_server \
    --model_dir /path/to/GLM-4.7-Flash/ \
    --tp 1 \
    --max_req_total_len 202752 \
    --chunked_prefill_size 8192 \
    --llm_prefill_att_backend flashinfer \
    --llm_decode_att_backend flashinfer \
    --graph_max_batch_size 512 \
    --tool_call_parser glm47 \
    --reasoning_parser glm45 \
    --host 0.0.0.0 \
    --port 8000

Note: add --mtp_step 4 and --mtp_mode eagle_with_att to use mtp .

Function Calling Test Results (BFCL v3)

Category	LightLLM
simple	62.50%
multiple	54.50%
parallel	69.50%
parallel_multiple	61.50%
java	66.00%
javascript	48.00%
irrelevance	83.33%
live_simple	45.74%
live_multiple	34.00%
live_parallel	25.00%
live_parallel_multiple	37.50%
rest	2.86%
sql	28.00%
OVERALL	49.12%

Speed Test Results (ShareGPT 2000 prompts, 4XH200)

LightLLM Performance by Workload

Workload	Output (tok/s)	TTFT (ms)	E2E Latency (ms)
burst	6442	11476	27719
high-conc (512)	6728	1099	11240
moderate (10 req/s)	1798	196	5746
steady (5 req/s)	917	154	2797

Test Plan

Code imports successfully (Glm47Detector, Glm4MoeLiteMTPModel, ModeBackend)
Black/flake8 formatting passes
Function calling test: BFCL v3 benchmark (49.12% overall, aligned with SGLang)
Speed test: ShareGPT benchmark (6442 tok/s, comparable to SGLang)

Hardware

4× NVIDIA H200 (80GB HBM3 each)
NVLink 4.0 interconnect

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the GLM-4.7-Flash model, a powerful Mixture-of-Experts (MoE) model, by integrating its unique architectural features like Multi-Head Latent Attention (MLA) and a specialized Multi-Token Prediction (MTP) layer for speculative decoding. It also adds a dedicated parser for the model's XML-style function calls, enhancing its utility in tool-use scenarios. Furthermore, a new benchmarking framework has been added to facilitate performance comparisons, alongside an increase in the decode attention batch size to boost overall throughput.

Highlights

GLM-4.7-Flash Model Support: Added comprehensive support for the GLM-4.7-Flash (glm4_moe_lite) model family, including its Mixture-of-Experts (MoE) architecture, which features Grouped MoE with top-k expert selection and a sigmoid scoring function for routing.
Multi-Token Prediction (MTP) Layer Integration: Implemented support for the GLM-4.7-Flash's Multi-Token Prediction (MTP) layer, enabling speculative decoding with both vanilla_with_att and eagle_with_att modes for enhanced inference speed.
GLM-4.7 Function Call Parser: Introduced a new Glm47Detector class to parse GLM-4.7's unique XML-style tool call format, providing full streaming support for incremental parsing of function arguments.
Benchmarking Orchestrator and Scripts: Added a new benchmarking suite, including an orchestrator script and dedicated launcher scripts, to compare the performance of LightLLM and SGLang inference frameworks under various configurations and workloads.
Increased Decode Attention Batch Size: The decode attention batch size limit has been significantly increased from 2048 to 8192, improving throughput for larger batch inference scenarios.
MLA Attention Adaptations: Modified the Multi-Head Latent Attention (MLA) kernels to correctly handle GLM-4.7-Flash's architecture where v_head_dim can differ from qk_nope_head_dim, ensuring accurate output projection.
MoE Gate Precision Enhancement: The MoE gate computation for GLM-4.7-Flash now uses float32 precision, which is crucial for the sigmoid scoring function and improves expert routing accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the GLM-4.7-Flash model, including its MoE architecture, MTP speculative decoding, and a custom XML-style function call parser. The changes are extensive, introducing new model implementations, benchmark scripts, and updates to the attention kernels. Overall, the implementation appears solid and well-integrated. I've provided a couple of suggestions to improve maintainability in the new benchmark script and to enhance the robustness of the new function call parser's streaming logic.

gemini-code-assist · 2026-01-22T01:00:50Z

lightllm/server/function_call_parser.py

+                            sent = len(self.streamed_args_for_tool[self.current_tool_id])
+                            argument_diff = current_args_json[sent:]


The logic for calculating argument_diff by slicing the new JSON string (current_args_json[sent:]) is fragile. It assumes that the new JSON string is always an append-only extension of the previously streamed string. This assumption may not hold if the underlying dictionary's key order changes for any reason during serialization, or if a value is modified rather than appended to. If the new JSON string is not a simple superset, this slicing logic could lead to incorrect or malformed argument chunks being sent to the client.

A more robust approach would be to use a diffing mechanism that doesn't rely on simple string extension, for example by finding the common prefix between the old and new JSON strings before calculating the difference.

benchmark_scripts/benchmark_orchestrator.py

sufubao · 2026-01-22T08:25:32Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- Replace hardcoded batch size limits with dynamic MAX_BATCH_SIZE parameter - Use triton.next_power_of_2() instead of custom _next_power_of_2 function - Remove hardcoded assert statements for batch size limits

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

sufubao force-pushed the glm-4.7-flash branch from 04e1928 to 09fa05d Compare January 22, 2026 01:54

feat: add GLM-4.7-Flash (glm4_moe_lite) model support

7540cf5

sufubao force-pushed the glm-4.7-flash branch from 8a88aa6 to 7540cf5 Compare January 22, 2026 02:19

sufubao added 2 commits January 22, 2026 06:19

add eval script

df73194

Merge remote-tracking branch 'origin/main' into glm-4.7-flash

402a1ab

sufubao added 2 commits January 22, 2026 14:07

refactor: use dynamic batch size in attention kernels

ff22aac

- Replace hardcoded batch size limits with dynamic MAX_BATCH_SIZE parameter - Use triton.next_power_of_2() instead of custom _next_power_of_2 function - Remove hardcoded assert statements for batch size limits

add benchmark script(tmp)

76f1eb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188

feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188

Uh oh!

sufubao commented Jan 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Uh oh!

Uh oh!

sufubao commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		sent = len(self.streamed_args_for_tool[self.current_tool_id])
		argument_diff = current_args_json[sent:]

feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188

Are you sure you want to change the base?

feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188

Uh oh!

Conversation

sufubao commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model Architecture

Function Call Support

Recommended Launch Script (H200)

Function Calling Test Results (BFCL v3)

Speed Test Results (ShareGPT 2000 prompts, 4XH200)

LightLLM Performance by Workload

Test Plan

Hardware

Related

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sufubao commented Jan 22, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sufubao commented Jan 22, 2026 •

edited

Loading