Skip to content

Test multi-gpu setup #10

@BrownianNotion

Description

@BrownianNotion

Description

Spin up an instance on the cloud with GPU and test that the training/eval pipeline works with multiple GPUs.
Start by changing --num_gpus=1 to 2. Understand and attempt to fix any basic errors encountered, otherwise close the issue and report the errors.

Multi-GPU training would allow us to:

  • train faster (at the cost of paying more)
  • train larger models like 3/7B potentially

Checklist before resolving issue

  • I have included the issue number in my PR description
  • I have rebased/pulled in changes from main and fixed any conflicts
  • If I have changed the models/training code, I have run dry_run.sh on a CUDA device successfully

Metadata

Metadata

Assignees

No one assigned

    Labels

    evalChanges related to the eval pipelinegood first issueGood for newcomerstrainingChanges related to training pipeline

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions