-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
描述
- 在远程
origin/main(commite9e9bb08ad8396646c8c1378d252c0facdfabeb9)直接运行examples/test_dream_dvllm_human_eval.py,多次卡死在 decode 阶段的model_runner.prepare_decode内层 while 循环。 - 现场栈显示停在
diffulex/legacy/engine/model_runner.py的 decode 路径,cur_map[local_start_idx()]取到的 block 既非is_in_cache、也非is_to_cache、也非is_active,导致start_idx不推进、循环不退出。
复现环境
- 代码:工作树
/home/lzx/Diffulex-remote-main,来自origin/main(上述 commit),无本地修改(仅.venv未跟踪)。 - CUDA:
CUDA_HOME=$HOME/cuda-12.2,PATH="$CUDA_HOME/bin:$PATH",LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"。 - GPU:
CUDA_VISIBLE_DEVICES=0,1,2,3。 - 代理:
http_proxy/https_proxy/all_proxy=http://127.0.0.1:17780(本地代理)。 - 运行器:
uv run(Python venv 在仓库.venv/)。 - 已设置:
PYTHONFAULTHANDLER=1,UV_HTTP_TIMEOUT=180。
复现步骤
cd /home/lzx/Diffulex-remote-main
export PYTHONFAULTHANDLER=1 \
http_proxy=http://127.0.0.1:17780 https_proxy=http://127.0.0.1:17780 \
HTTP_PROXY=http://127.0.0.1:17780 HTTPS_PROXY=http://127.0.0.1:17780 \
all_proxy=http://127.0.0.1:17780 ALL_PROXY=http://127.0.0.1:17780 \
no_proxy=localhost,127.0.0.1,::1 NO_PROXY=localhost,127.0.0.1,::1 \
CUDA_HOME=$HOME/cuda-12.2 PATH="$CUDA_HOME/bin:$PATH" \
LD_LIBRARY_PATH="$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}" \
CUDA_VISIBLE_DEVICES=0,1,2,3 UV_HTTP_TIMEOUT=180
uv run python examples/test_dream_dvllm_human_eval.py > log/test_dvllm_dream_human_eval.remote_main.log 2>&1观察到的行为
- 进度约在
Generating: 79%|█████ | 130/164 ...后无新输出,GPU 利用率掉到 0%,进程持续占用 CPU。 - 手工中断打印的 Python 栈(节选):
[rank0]: File "/home/lzx/Diffulex-remote-main/examples/test_dream_dvllm_human_eval.py", line 74, in <module>
[rank0]: outputs = LLM.generate(prompts[:], sampling_params)
[rank0]: File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 118, in generate
[rank0]: output, num_tokens, is_prefill, cur_n_diff_steps, _ = self.step()
[rank0]: File "/home/lzx/Diffulex/diffulex/legacy/engine/llm_engine.py", line 77, in step
[rank0]: sample_output = self.model_runner.call("run", seqs, is_prefill)
[rank0]: File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 678, in run
[rank0]: input_ids, positions = self.prepare_prefill(seqs) if is_prefill else self.prepare_decode(seqs)
[rank0]: File "/home/lzx/Diffulex/diffulex/legacy/engine/model_runner.py", line 586, in prepare_decode
[rank0]: if cur_map[local_start_idx()] == seq.num_diffusion_blocks - 1:
prepare_decode中的 while 循环:
while start_idx < end_idx and not is_last_block and not meet_active_block:
local_start_idx = lambda: start_idx % seq.block_size
diffusion_block = seq.diffusion_blocks[cur_map[local_start_idx()]]
...
if diffusion_block.is_in_cache:
...
start_idx += step
elif diffusion_block.is_to_cache:
...
start_idx += step
elif diffusion_block.is_active:
meet_active_block = True
# 其他状态未处理 → start_idx 不变,循环不退出
期望行为
- decode 阶段不应进入无限循环,遇到异常状态应至少推进指针或报错,运行应能继续或 fail-fast。
初步推测
- 某些 diffusion block 处于非 cache / 非 to_cache / 非 active 状态,导致指针不前进。建议在该分支补充防护(例如
else: break或记录异常并推进start_idx),同时输出遇到的 block 状态,帮助确认正确语义。
Metadata
Metadata
Assignees
Labels
No labels