Skip to content

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

@deepcoldfish

Description

@deepcoldfish

Env: 16GPUs + llama2 pretrain+ megatron-lm
strategy: TP 8 + PP 1 + DP 2
case: when killing a training proceess to retrigger fault-tollerence with megatron-distributed flash-checkpoint,the dp 1 group load_checkpoint failed with the following log,

WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when read_metadata, meanwhile dp 0 group only load from memory.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions