-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Description
你好,我想使用多图数据(32-64张图)和video数据进行混合训练,单图情况下pp1,ep8,mbs2,imagetokennum=1024可以正常跑,现在pp4,ep8,mbs1也会报错,调小token_num好像也不行,tp2pp4,ep4也不行,都爆显存,请问video和image的token_num一般设置多少比较合适呀(SFT阶段),以下为训练代码,感谢感谢,
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
IMAGE_MAX_TOKEN_NUM=64 \
VIDEO_MIN_TOKEN_NUM=4 \
VIDEO_MAX_TOKEN_NUM=64 \
FPS=1 \
FPS_MAX_FRAMES=16 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
--model Qwen3vl_30a3 \
--load_safetensors true \
--save_safetensors true \
--dataset ./data.jsonl \
--load_from_cache_file true \
--sequence_parallel true \
--packing false \
--freeze_llm false \
--freeze_vit false \
--freeze_aligner false \
--split_dataset_ratio 0 \
--moe_permute_fusion true \
--pipeline_model_parallel_size 4\
--tensor_model_parallel_size 1 \
--context_parallel_size 1 \
--expert_model_parallel_size 8 \
--expert_tensor_parallel_size 1 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-6 \
--micro_batch_size 1 \
--global_batch_size 512 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--overlap_param_gather true \
--overlap_grad_reduce true \
--lr 5e-5 \
--vit_lr 2e-6 \
--min_lr 1e-6 \
--lr_warmup_fraction 0.05 \
--max_epochs 1 \
--save ./ckpt \
--save_interval 200000 \
--vit_gradient_checkpointing false \
--max_length 32768 \
--num_workers 32 \
--no_save_optim true \
--eval_iters -1 \
--no_save_rng true \
--sequence_parallel true \
--moe_expert_capacity_factor 2 \
--optimizer_cpu_offload true \
--use_precision_aware_optimizer true \
--optimizer_offload_fraction 0.2 \
--attention_backend flash \
--dataset_num_proc 64Metadata
Metadata
Assignees
Labels
No labels