Skip to content

[mllm] support longcat_next#1637

Draft
xin3he wants to merge 3 commits intomainfrom
xinhe/3-30
Draft

[mllm] support longcat_next#1637
xin3he wants to merge 3 commits intomainfrom
xinhe/3-30

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented Mar 30, 2026

Description

ValueError: Cannot use apply_chat_template because this processor does not have a chat template.

To reproduce: auto-round /storage/xinhe/meituan-longcat/LongCat-Next/

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: Xin He <xin3.he@intel.com>
Copilot AI review requested due to automatic review settings March 30, 2026 06:28
@xin3he xin3he requested review from lvliang-intel and n1ck-guo and removed request for Copilot March 30, 2026 06:31
@XuehaoSun
Copy link
Copy Markdown
Contributor

2026-03-30 15:56:33 INFO __main__.py L599: start to quantize meituan-longcat/LongCat-Next
2026-03-30 15:56:34 INFO autoround.py L178: using MLLM mode for multimodal model.
/data3/hf_new_model_cache/modules/transformers_modules/meituan_hyphen_longcat/LongCat_hyphen_Next/522f2020e5ed353429cc403b72491ba1899ef0e6/modular_longcat_next_audio.py:220: Fut
  @autocast(enabled=True, dtype=torch.float32)
2026-03-30 15:56:41 WARNING modeling_utils.py L2446: You are attempting to use Flash Attention 2 without specifying a torch dtype. This might lead to unexpected behaviour
/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/diffusers/models/lora.py:393: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in
  deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
self.visual_offset_vals=tensor([150581, 166965, 183349, 199733, 216117, 232501, 248885, 265269])
self.audio_offset_vals=tensor([131125, 139317, 143413, 145461, 146485, 147509, 148533, 149557])
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 11.14it/s]
2026-03-30 15:57:01 WARNING compressor.py L286: longcat_next does not support for NeelNanda/pile-10k, will use liuhaotian/llava_conv_58k with default config as an alternative.
2026-03-30 15:57:01 WARNING compressor.py L296: reset batch_size(8) to 1 and gradient_accumulate_steps(1) to 8, because batch_size=8 cannot be used for liuhaotian/llava_conv_58k
2026-03-30 15:57:01 INFO base.py L517: using torch.bfloat16 for quantization tuning
2026-03-30 15:57:01 INFO base.py L834: 'enable_torch_compile' is set to `False` by default. Enabling it can reduce tuning cost by 20%, but it might throw an exception.
2026-03-30 15:57:01 WARNING formats.py L166: some layers are skipped quantization (shape not divisible by 32): audio_head.heads.[0-7], lm_head, model.audio_tokenizer.audio_flow_
2026-03-30 15:57:01 INFO base.py L1660: Using predefined ignore_layers: classifier
2026-03-30 15:57:02 INFO base.py L1818: start to cache block inputs
2026-03-30 15:57:07 WARNING base.py L2328: Some layers are offloaded to cpu, which may severely impact calibration speed. Please consider using more cards.
Some parameters are on the meta device because they were offloaded to the cpu.
2026-03-30 15:57:28 WARNING dataset.py L251: seqlen(2048) is greater than the maximum length supported by the liuhaotian/llava_conv_58k, reset to 512
2026-03-30 15:57:28 INFO dataset.py L99: use dataset llava_conv_58k, downloading...
cache block inputs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [14:44<00:00,  6.91s/it]
2026-03-30 16:12:42 INFO base.py L1835: caching done
Quantizing model.layers.0:   0%|                                                                                                                         | 0/100 [00:10<?, ?it/s]
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
^[[B^[[A2026-03-30 16:54:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.000444 -> iter 194: 0.000079,'peak_ram': 86.58GB, 'peak_vram': 66.75GB
Quantizing model.layers.1:   1%|█                                                                                                           | 1/100 [42:12<69:39:08, 2532.81s/it]
quantized 784/785 layers in the block, loss iter 0: 0.001716 -> iter 195: 0.000445,'peak_ram': 94.89GB, 'peak_vram': 66.75GB
Quantizing model.layers.2:   2%|██                                                                                                        | 2/100 [1:23:27<68:01:31, 2498.89s/it]
quantized 784/785 layers in the block, loss iter 0: 0.002576 -> iter 199: 0.001224,'peak_ram': 103.3GB, 'peak_vram': 66.75GB
Quantizing model.layers.3:   3%|███▏                                                                                                      | 3/100 [2:04:37<66:58:09, 2485.46s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003595 -> iter 197: 0.001099,'peak_ram': 104.32GB, 'peak_vram': 66.75GB
Quantizing model.layers.4:   4%|████▏                                                                                                     | 4/100 [2:43:32<64:41:42, 2426.07s/it]
quantized 784/785 layers in the block, loss iter 0: 0.003605 -> iter 192: 0.001413,'peak_ram': 116.5GB, 'peak_vram': 66.75GB
Quantizing model.layers.5:   5%|█████▎                                                                                                    | 5/100 [3:21:56<62:51:48, 2382.19s/it]
quantized 784/785 layers in the block, loss iter 0: 0.004384 -> iter 192: 0.002084,'peak_ram': 116.6GB, 'peak_vram': 66.75GB
Quantizing model.layers.6:   6%|██████▎                                                                                                   | 6/100 [4:00:49<61:45:39, 2365.31s/it]
quantized 784/785 layers in the block, loss iter 0: 0.006060 -> iter 196: 0.002672,'peak_ram': 121.61GB, 'peak_vram': 66.75GB
Quantizing model.layers.7:   7%|███████▍                                                                                                  | 7/100 [4:39:00<60:28:36, 2341.03s/it]2026-03-30 21:30:55 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009842 -> iter 169: 0.003777,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.8:   8%|████████▍                                                                                                 | 8/100 [5:18:48<60:12:30, 2355.99s/it]2026-03-30 22:10:08 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.009777 -> iter 199: 0.004623,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.9:   9%|█████████▌                                                                                                | 9/100 [5:57:55<59:29:10, 2353.30s/it]2026-03-30 22:48:58 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.018928 -> iter 191: 0.008281,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.10:  10%|██████████▍                                                                                             | 10/100 [6:36:50<58:41:27, 2347.64s/it]2026-03-30 23:28:31 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.022149 -> iter 180: 0.011693,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.11:  11%|███████████▍                                                                                            | 11/100 [7:16:19<58:12:02, 2354.18s/it]2026-03-31 00:09:23 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.041877 -> iter 196: 0.017732,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.12:  12%|████████████▍                                                                                           | 12/100 [7:57:11<58:16:20, 2383.87s/it]2026-03-31 00:52:34 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.072172 -> iter 197: 0.030324,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  13%|█████████████▌                                                                                          | 13/100 [8:40:29<59:10:31, 2448.64s/it]2026-03-31 01:34:45 INFO base.py L3187: Unquantized layers: ['mlp.router.classifier']
quantized 784/785 layers in the block, loss iter 0: 0.134645 -> iter 190: 0.045848,'peak_ram': 121.7GB, 'peak_vram': 66.75GB
Quantizing model.layers.13:  14%|██████████████▌                                                                                         | 14/100 [9:22:36<59:03:33, 2472.25s/it]Traceback (most recent call last):
  File "/home/uttest/miniforge3/envs/autoround_test/bin/auto-round", line 10, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 822, in run
    start()
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 541, in start
    tune(args)
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/__main__.py", line 761, in tune
    model, folders = autoround.quantize_and_save(export_dir, format=args.format)  # pylint: disable=E1101
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1018, in quantize_and_save
    model, _ = self.quantize()
               ^^^^^^^^^^^^^^^
  File "/home/uttest/miniforge3/envs/autoround_test/lib/python3.12/site-packages/auto_round/compressors/base.py", line 1850, in quantize
    inputs = all_inputs[block_names[0]]
             ~~~~~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'model.audio_tokenizer.audio_model.layers.0'

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Mar 31, 2026

Thank you for the checking. @XuehaoSun
Audio part should be skipped since the datasets only contains image and text, I will fix it and let you know.

@xin3he xin3he marked this pull request as draft April 1, 2026 11:09
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 2, 2026

It's more complex than the original expectation. since it's an omni model, more time is needed to enable it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants