Skip to content

[MLX] Low speed on Qwen3.5-27B-4bit: 13.28 tok/s (M4 Pro 48GB) #112

@heykb

Description

@heykb

python test_mlx.py

....

   import random
   # 在 partition 前加一句:
   arr[random.randint(low, high)], arr[high] = arr[high], arr[random.randint(low, high)]
2.

==================================================
Throughput: 13.28 tok/s
Download complete: : 0.00B [02:34, ?B/s]
a1-6@192 dflash % 
#!/usr/bin/env python3
"""DFlash MLX 测试脚本"""
from dflash.model_mlx import load, load_draft, stream_generate

# 加载主模型和草稿模型
print("Loading target model: mlx-community/Qwen3.6-27B-4bit...")
model, tokenizer = load("mlx-community/Qwen3.6-27B-4bit")

print("Loading draft model: z-lab/Qwen3.6-27B-DFlash...")
draft = load_draft("z-lab/Qwen3.6-27B-DFlash")

# 准备 prompt
messages = [{"role": "user", "content": "写个 Python 快排"}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)

print(f"\nPrompt: {messages[0]['content']}")
print("=" * 50)

# 生成
tps = 0.0
print("\n生成结果:\n")
for r in stream_generate(
    model,
    draft,
    tokenizer,
    prompt,
    block_size=16,
    max_tokens=2048,
    temperature=0.6
):
    print(r.text, end="", flush=True)
    tps = r.generation_tps

print(f"\n\n{'=' * 50}")
print(f"Throughput: {tps:.2f} tok/s")

same Qwen3.6-27

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions