Skip to content

Qwen3VL-30A3多图训练参数求助 #6920

@jiangsongtao

Description

@jiangsongtao

你好,我想使用多图数据(32-64张图)和video数据进行混合训练,单图情况下pp1,ep8,mbs2,imagetokennum=1024可以正常跑,现在pp4,ep8,mbs1也会报错,调小token_num好像也不行,tp2pp4,ep4也不行,都爆显存,请问video和image的token_num一般设置多少比较合适呀(SFT阶段),以下为训练代码,感谢感谢,

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
IMAGE_MAX_TOKEN_NUM=64 \
VIDEO_MIN_TOKEN_NUM=4 \
VIDEO_MAX_TOKEN_NUM=64 \
FPS=1 \
FPS_MAX_FRAMES=16 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
    --model Qwen3vl_30a3 \
    --load_safetensors true \
    --save_safetensors true \
    --dataset ./data.jsonl \
    --load_from_cache_file true \
    --sequence_parallel true \
    --packing false \
    --freeze_llm false \
    --freeze_vit false \
    --freeze_aligner false \
    --split_dataset_ratio 0 \
    --moe_permute_fusion true \
    --pipeline_model_parallel_size 4\
    --tensor_model_parallel_size 1 \
    --context_parallel_size 1 \
    --expert_model_parallel_size 8 \
    --expert_tensor_parallel_size 1 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 1e-6 \
    --micro_batch_size 1 \
    --global_batch_size 512 \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --overlap_param_gather true \
    --overlap_grad_reduce true \
    --lr 5e-5 \
    --vit_lr 2e-6 \
    --min_lr 1e-6 \
    --lr_warmup_fraction 0.05 \
    --max_epochs 1 \
    --save ./ckpt \
    --save_interval 200000 \
    --vit_gradient_checkpointing false \
    --max_length 32768 \
    --num_workers 32 \
    --no_save_optim true \
    --eval_iters -1 \
    --no_save_rng true \
    --sequence_parallel true \
    --moe_expert_capacity_factor 2 \
    --optimizer_cpu_offload true \
    --use_precision_aware_optimizer true \
    --optimizer_offload_fraction 0.2 \
    --attention_backend flash \
    --dataset_num_proc 64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions