Skip to content

Conversation

@sufubao
Copy link
Collaborator

@sufubao sufubao commented Jan 22, 2026

Summary

  • Add complete GLM-4.7-Flash (glm4_moe_lite) model family support with MoE architecture
  • Add MTP (Multi-Token Prediction) layer support for speculative decoding
  • Add GLM-4.7 function call parser with XML-style argument format
  • Add benchmark orchestrator and launcher scripts for performance testing
  • Increase decode attention batch size limit from 2048 to 8192 for better throughput

Model Architecture

  • Grouped MoE with top-k expert selection
  • Support for vanilla_with_att and eagle_with_att MTP modes
  • Compatible with FlashAttention3 backend

Function Call Support

  • New Glm47Detector class for parsing GLM-4.7's XML-style tool calls
  • Format: <tool_call>func_name\n<arg_key>key</arg_key><arg_value>value</arg_value></tool_call>
  • Full streaming support for incremental parsing

Recommended Launch Script (H200)

LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \
python -m lightllm.server.api_server \
    --model_dir /path/to/GLM-4.7-Flash/ \
    --tp 1 \
    --max_req_total_len 202752 \
    --chunked_prefill_size 8192 \
    --llm_prefill_att_backend flashinfer \
    --llm_decode_att_backend flashinfer \
    --graph_max_batch_size 512 \
    --tool_call_parser glm47 \
    --reasoning_parser glm45 \
    --host 0.0.0.0 \
    --port 8000

Note: add --mtp_step 4 and --mtp_mode eagle_with_att to use mtp .


Function Calling Test Results (BFCL v3)

Category LightLLM
simple 62.50%
multiple 54.50%
parallel 69.50%
parallel_multiple 61.50%
java 66.00%
javascript 48.00%
irrelevance 83.33%
live_simple 45.74%
live_multiple 34.00%
live_parallel 25.00%
live_parallel_multiple 37.50%
rest 2.86%
sql 28.00%
OVERALL 49.12%

Speed Test Results (ShareGPT 2000 prompts, 4XH200)

LightLLM Performance by Workload

Workload Output (tok/s) TTFT (ms) E2E Latency (ms)
burst 6442 11476 27719
high-conc (512) 6728 1099 11240
moderate (10 req/s) 1798 196 5746
steady (5 req/s) 917 154 2797

Test Plan

  • Code imports successfully (Glm47Detector, Glm4MoeLiteMTPModel, ModeBackend)
  • Black/flake8 formatting passes
  • Function calling test: BFCL v3 benchmark (49.12% overall, aligned with SGLang)
  • Speed test: ShareGPT benchmark (6442 tok/s, comparable to SGLang)

Hardware

  • 4× NVIDIA H200 (80GB HBM3 each)
  • NVLink 4.0 interconnect

Related

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the GLM-4.7-Flash model, a powerful Mixture-of-Experts (MoE) model, by integrating its unique architectural features like Multi-Head Latent Attention (MLA) and a specialized Multi-Token Prediction (MTP) layer for speculative decoding. It also adds a dedicated parser for the model's XML-style function calls, enhancing its utility in tool-use scenarios. Furthermore, a new benchmarking framework has been added to facilitate performance comparisons, alongside an increase in the decode attention batch size to boost overall throughput.

Highlights

  • GLM-4.7-Flash Model Support: Added comprehensive support for the GLM-4.7-Flash (glm4_moe_lite) model family, including its Mixture-of-Experts (MoE) architecture, which features Grouped MoE with top-k expert selection and a sigmoid scoring function for routing.
  • Multi-Token Prediction (MTP) Layer Integration: Implemented support for the GLM-4.7-Flash's Multi-Token Prediction (MTP) layer, enabling speculative decoding with both vanilla_with_att and eagle_with_att modes for enhanced inference speed.
  • GLM-4.7 Function Call Parser: Introduced a new Glm47Detector class to parse GLM-4.7's unique XML-style tool call format, providing full streaming support for incremental parsing of function arguments.
  • Benchmarking Orchestrator and Scripts: Added a new benchmarking suite, including an orchestrator script and dedicated launcher scripts, to compare the performance of LightLLM and SGLang inference frameworks under various configurations and workloads.
  • Increased Decode Attention Batch Size: The decode attention batch size limit has been significantly increased from 2048 to 8192, improving throughput for larger batch inference scenarios.
  • MLA Attention Adaptations: Modified the Multi-Head Latent Attention (MLA) kernels to correctly handle GLM-4.7-Flash's architecture where v_head_dim can differ from qk_nope_head_dim, ensuring accurate output projection.
  • MoE Gate Precision Enhancement: The MoE gate computation for GLM-4.7-Flash now uses float32 precision, which is crucial for the sigmoid scoring function and improves expert routing accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the GLM-4.7-Flash model, including its MoE architecture, MTP speculative decoding, and a custom XML-style function call parser. The changes are extensive, introducing new model implementations, benchmark scripts, and updates to the attention kernels. Overall, the implementation appears solid and well-integrated. I've provided a couple of suggestions to improve maintainability in the new benchmark script and to enhance the robustness of the new function call parser's streaming logic.

Comment on lines +1424 to +1425
sent = len(self.streamed_args_for_tool[self.current_tool_id])
argument_diff = current_args_json[sent:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for calculating argument_diff by slicing the new JSON string (current_args_json[sent:]) is fragile. It assumes that the new JSON string is always an append-only extension of the previously streamed string. This assumption may not hold if the underlying dictionary's key order changes for any reason during serialization, or if a value is modified rather than appended to. If the new JSON string is not a simple superset, this slicing logic could lead to incorrect or malformed argument chunks being sent to the client.

A more robust approach would be to use a diffing mechanism that doesn't rely on simple string extension, for example by finding the common prefix between the old and new JSON strings before calculating the difference.

@sufubao
Copy link
Collaborator Author

sufubao commented Jan 22, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- Replace hardcoded batch size limits with dynamic MAX_BATCH_SIZE parameter
- Use triton.next_power_of_2() instead of custom _next_power_of_2 function
- Remove hardcoded assert statements for batch size limits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants