while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint

**Env:** 16GPUs + llama2 pretrain+ megatron-lm
**strategy:** TP 8 + PP 1 + DP 2
**case:**  when killing a training proceess to retrigger fault-tollerence  with megatron-distributed flash-checkpoint，the dp 1 group load_checkpoint failed with the following log,

```
WARNING: on rank 11 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 10 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.WARNING: on rank 14 found iteration 15 in the metadata while max iteration across the ranks is 4160813071, replacing it with max iteration.
```

The reason is that dp 1 group load checkpoint from storage for no model in memory and uses allreduce when **read_metadata**, meanwhile dp 0 group only load from memory. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint #1233

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions