Skip to content

Resuming training from a checkpoint: Incorrect function call and misspelt attribute in ckpt loading logic #71

@DhruvaRajwade

Description

@DhruvaRajwade

Describe the bug
The checkpoint loading logic contains two bugs due to incorrect function and attribute references, preventing proper resumption of training from a saved checkpoint.

  1. In examples/text/logic/state.py, line 62: self._data_state.test.load_state_dict(loaded_state["test_sampler"])
    (FIX: self._data_state.test.sampler.load_state_dict(loaded_state["test_sampler"]) )
    Here, there's a typo that tries to yoink a state_dict from a Dataset class

  2. In examples/text/main_train.py, line 27: cfg = checkpointing.load_hydra_config_from_run(cfg.load_dir)
    (FIX: cfg = checkpointing.load_cfg_from_path(cfg.load_dir) )
    Here, the function name is incorrect; the function exists with a different name in utils/checkpointing.py.

To Reproduce
Set load_dir = 'path_to_ckpt_parent' in examples/text/configs/config.yaml and run examples/text/run_train.py

Expected behavior
The checkpoint gets picked up, and training resumes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions