Skip to content

运行examples/test_dream_dvllm_human_eval.py卡死问题 #6

@luozixin2

Description

@luozixin2

描述

  • 在远程 origin/main(commit e9e9bb08ad8396646c8c1378d252c0facdfabeb9)直接运行 examples/test_dream_dvllm_human_eval.py,多次卡死在 decode 阶段的 model_runner.prepare_decode 内层 while 循环。
  • 现场栈显示停在 diffulex/legacy/engine/model_runner.py 的 decode 路径,cur_map[local_start_idx()] 取到的 block 既非 is_in_cache、也非 is_to_cache、也非 is_active,导致 start_idx 不推进、循环不退出。

复现环境

  • 代码:工作树 /home/lzx/Diffulex-remote-main,来自 origin/main(上述 commit),无本地修改(仅 .venv 未跟踪)。
  • CUDA:CUDA_HOME=$HOME/cuda-12.2PATH="$CUDA_HOME/bin:$PATH"LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
  • GPU:CUDA_VISIBLE_DEVICES=0,1,2,3
  • 代理:http_proxy/https_proxy/all_proxy=http://127.0.0.1:17780(本地代理)。
  • 运行器:uv run(Python venv 在仓库 .venv/)。
  • 已设置:PYTHONFAULTHANDLER=1UV_HTTP_TIMEOUT=180

复现步骤

cd /home/lzx/Diffulex-remote-main
export PYTHONFAULTHANDLER=1 \
  http_proxy=http://127.0.0.1:17780 https_proxy=http://127.0.0.1:17780 \
  HTTP_PROXY=http://127.0.0.1:17780 HTTPS_PROXY=http://127.0.0.1:17780 \
  all_proxy=http://127.0.0.1:17780 ALL_PROXY=http://127.0.0.1:17780 \
  no_proxy=localhost,127.0.0.1,::1 NO_PROXY=localhost,127.0.0.1,::1 \
  CUDA_HOME=$HOME/cuda-12.2 PATH="$CUDA_HOME/bin:$PATH" \
  LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" \
  CUDA_VISIBLE_DEVICES=0,1,2,3 UV_HTTP_TIMEOUT=180
uv run python examples/test_dream_dvllm_human_eval.py > log/test_dvllm_dream_human_eval.remote_main.log 2>&1

观察到的行为

  • 进度约在 Generating: 79%|█████ | 130/164 ... 后无新输出,GPU 利用率掉到 0%,进程持续占用 CPU。
  • 手工中断打印的 Python 栈(节选):
[rank0]:   File "/home/lzx/Diffulex-remote-main/examples/test_dream_dvllm_human_eval.py", line 74, in <module>
[rank0]:     outputs = LLM.generate(prompts[:], sampling_params)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 118, in generate
[rank0]:     output, num_tokens, is_prefill, cur_n_diff_steps, _ = self.step()
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 77, in step
[rank0]:     sample_output = self.model_runner.call("run", seqs, is_prefill)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 678, in run
[rank0]:     input_ids, positions = self.prepare_prefill(seqs) if is_prefill else self.prepare_decode(seqs)
[rank0]:   File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 586, in prepare_decode
[rank0]:     if cur_map[local_start_idx()] == seq.num_diffusion_blocks - 1:
  • prepare_decode 中的 while 循环:
while start_idx < end_idx and not is_last_block and not meet_active_block:
    local_start_idx = lambda: start_idx % seq.block_size
    diffusion_block = seq.diffusion_blocks[cur_map[local_start_idx()]]
    ...
    if diffusion_block.is_in_cache:
        ...
        start_idx += step
    elif diffusion_block.is_to_cache:
        ...
        start_idx += step
    elif diffusion_block.is_active:
        meet_active_block = True
    # 其他状态未处理 → start_idx 不变,循环不退出

期望行为

  • decode 阶段不应进入无限循环,遇到异常状态应至少推进指针或报错,运行应能继续或 fail-fast。

初步推测

  • 某些 diffusion block 处于非 cache / 非 to_cache / 非 active 状态,导致指针不前进。建议在该分支补充防护(例如 else: break 或记录异常并推进 start_idx),同时输出遇到的 block 状态,帮助确认正确语义。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions