Skip to content

Fix: CUDA OOM issue after training before inference#96

Open
Hongbin10 wants to merge 1 commit intoTIO-IKIM:mainfrom
Hongbin10:fix/CPPNet-OOM-release-gpu-memory-before-inference
Open

Fix: CUDA OOM issue after training before inference#96
Hongbin10 wants to merge 1 commit intoTIO-IKIM:mainfrom
Hongbin10:fix/CPPNet-OOM-release-gpu-memory-before-inference

Conversation

@Hongbin10
Copy link
Copy Markdown

Problem:

When running run_cpp_net.py, training and inference are executed sequentially in the same process.
On hardware with limited GPU memory (tested on NVIDIA L40S, 48GB VRAM), GPU memory allocated reached ~97% by the end of training. The experiment object — including the model, optimizer states, and gradients — was not explicitly released before inference was initiated, leaving insufficient VRAM for the inference model to load.

This caused a torch.OutOfMemoryError at the start of inference:

CUDA out of memory. Tried to allocate 6.00 GiB.
GPU 0 has a total capacity of 44.42 GiB of which 607.38 MiB is free.

Root Cause

In both the checkpoint and casual run branches of run_cpp_net.py, inference = InferenceCellViTCPP(...) was called immediately after experiment.run_experiment() without releasing the training objects from GPU memory first.

Fix

Added explicit GPU memory cleanup between training and inference in both branches:

del experiment
gc.collect()
torch.cuda.empty_cache()

Result

Training and inference now run successfully end-to-end in a single job without CUDA OOM errors.

截屏2026-04-02 09 28 55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant