-
Notifications
You must be signed in to change notification settings - Fork 296
feat: Add GLM-4.7-Flash (glm4_moe_lite) model support #1188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @sufubao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for the GLM-4.7-Flash model, a powerful Mixture-of-Experts (MoE) model, by integrating its unique architectural features like Multi-Head Latent Attention (MLA) and a specialized Multi-Token Prediction (MTP) layer for speculative decoding. It also adds a dedicated parser for the model's XML-style function calls, enhancing its utility in tool-use scenarios. Furthermore, a new benchmarking framework has been added to facilitate performance comparisons, alongside an increase in the decode attention batch size to boost overall throughput. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the GLM-4.7-Flash model, including its MoE architecture, MTP speculative decoding, and a custom XML-style function call parser. The changes are extensive, introducing new model implementations, benchmark scripts, and updates to the attention kernels. Overall, the implementation appears solid and well-integrated. I've provided a couple of suggestions to improve maintainability in the new benchmark script and to enhance the robustness of the new function call parser's streaming logic.
| sent = len(self.streamed_args_for_tool[self.current_tool_id]) | ||
| argument_diff = current_args_json[sent:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating argument_diff by slicing the new JSON string (current_args_json[sent:]) is fragile. It assumes that the new JSON string is always an append-only extension of the previously streamed string. This assumption may not hold if the underlying dictionary's key order changes for any reason during serialization, or if a value is modified rather than appended to. If the new JSON string is not a simple superset, this slicing logic could lead to incorrect or malformed argument chunks being sent to the client.
A more robust approach would be to use a diffing mechanism that doesn't rely on simple string extension, for example by finding the common prefix between the old and new JSON strings before calculating the difference.
04e1928 to
09fa05d
Compare
8a88aa6 to
7540cf5
Compare
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
- Replace hardcoded batch size limits with dynamic MAX_BATCH_SIZE parameter - Use triton.next_power_of_2() instead of custom _next_power_of_2 function - Remove hardcoded assert statements for batch size limits
Summary
Model Architecture
Function Call Support
Glm47Detectorclass for parsing GLM-4.7's XML-style tool calls<tool_call>func_name\n<arg_key>key</arg_key><arg_value>value</arg_value></tool_call>Recommended Launch Script (H200)
LIGHTLLM_TRITON_AUTOTUNE_LEVEL=1 LOADWORKER=18 \ python -m lightllm.server.api_server \ --model_dir /path/to/GLM-4.7-Flash/ \ --tp 1 \ --max_req_total_len 202752 \ --chunked_prefill_size 8192 \ --llm_prefill_att_backend flashinfer \ --llm_decode_att_backend flashinfer \ --graph_max_batch_size 512 \ --tool_call_parser glm47 \ --reasoning_parser glm45 \ --host 0.0.0.0 \ --port 8000Function Calling Test Results (BFCL v3)
Speed Test Results (ShareGPT 2000 prompts, 4XH200)
LightLLM Performance by Workload
Test Plan
Glm47Detector,Glm4MoeLiteMTPModel,ModeBackend)Hardware
Related