feat: chunked streaming for GPU memory reduction #121

vijk777 · 2026-01-22T02:06:25Z

Implemented chunked streaming: load 64K chunks on-demand instead of full dataset
Background thread prefetches chunks while GPU trains
Prefetch buffer size: 6 chunks for optimal overlap
Unit test to check async logic
Script to run on youtube dataset with a mock computation

- implements RandomChunkLoader for streaming random 64K chunks - uses background thread for disk I/O to CPU pinned memory - uses CUDA streams for async CPU→GPU transfer - enables overlap of disk I/O, transfer, and training - comprehensive unit tests with mock training (sleeps) design: - loads random overlapping windows (no chunk boundaries) - queue-based producer-consumer pattern - batches_per_chunk = chunk_size / batch_size - chunks_per_epoch = dataset_size * passes_per_epoch / chunk_size 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- use fixed 200ms load time, 100ms train time - increase prefetch buffer to 10 to avoid blocking background thread - expected: 1.5x speedup (1500ms sequential -> 1000ms with overlap) - makes test more predictable and easier to debug 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

numpy random generation was taking ~200ms per chunk, dominating the sleep delay and making overlap test unpredictable. now using np.zeros (instant) so load time is purely the sleep delay. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- both 100ms now (was 200ms load, 100ms train) - sequential: 5 * (100+100) = 1000ms - with overlap: ~500ms (2x speedup expected) - test passes if total < 650ms (0.65 threshold) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- amortizes cold start overhead over more iterations - sequential: 20 * (100+100) = 4000ms - with overlap: ~2000ms (2x speedup expected) - should show clearer overlap benefit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

implements: - ChunkLatencyStats: tracks chunk get time, batch forward/backward/step times - create_zarr_loader: wraps zarr loading for RandomChunkLoader - sample_batch_within_chunk: samples batches within current chunk - calculate_chunk_params: computes chunks_per_epoch, batches_per_chunk documentation: - CHUNKED_STREAMING_INTEGRATION.md: step-by-step guide - shows memory savings: 60 GB -> 8 GB (87% reduction) - includes expected latencies and tuning parameters ready to integrate into latent.py training loop 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

changes: - modified load_dataset to return chunk_loader instead of full training data - updated reconstruction warmup to use chunked iteration - replaced batch iterator with chunked iteration in main training loop - added ChunkLatencyStats tracking for chunk_get, forward, backward, step times - log latency stats to tensorboard every epoch - cleanup chunk_loader at end of training - removed train_loss_constant_model baseline (requires full dataset) memory savings: - before: ~60 GB (full training data on GPU) - after: ~8 GB (2 chunk buffers + validation data) - 87% reduction in GPU memory usage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

fixes: - add safety check in LossComponents.mean() for count == 0 - add validation in calculate_chunk_params() to ensure: - batches_per_chunk > 0 (chunk_size >= batch_size) - batches_per_epoch > 0 (dataset not too small) - chunks_per_epoch > 0 (enough batches for at least one chunk) bug occurred when warmup completed without processing any batches, causing count == 0 and division by zero in mean(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

changes: - remove validation error when chunk_size > total_timesteps - set max_start_idx = 0 when dataset < chunk_size (always loads full dataset) - in calculate_chunk_params: if dataset < chunk_size: - chunks_per_epoch = data_passes_per_epoch (load once per pass) - batches_per_chunk = total_timesteps // batch_size now works with small test datasets where total_timesteps < 65536 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- test_multiple_epochs_back_to_back: verifies consecutive epochs work - test_early_break_from_epoch: simulates warmup breaking early both tests currently fail, demonstrating thread cleanup issues that need to be fixed in chunk_loader.py

fixes three issues: 1. handle early break from epoch - stop thread gracefully instead of raising error 2. clear queue between epochs to remove None sentinel 3. update test to reflect new behavior (chunk_size > dataset is allowed) changes: - start_epoch() now stops alive threads by setting stop_flag and draining queue - clears queue before starting new epoch - renamed test_chunk_size_validation -> test_chunk_size_larger_than_dataset - test now verifies chunk_size > dataset returns full dataset

when chunk_size > dataset, end_idx was exceeding total_timesteps. now clamped: end_idx = min(start_idx + chunk_size, total_timesteps) fixes test_chunk_size_larger_than_dataset

tests chunk prefetching with real zarr data and 3.7s simulated training. measures: - queue size before each get_next_chunk() - chunk get times - queue empty/full counts tests prefetch values: 2, 4, 6 to diagnose why background loading isn't staying ahead during training

FlyVisSim is in LatentEvolution.load_flyvis, not NeuralGraph.FlyVis

changes: - added simulate_gpu_training() that does matmul on loaded chunks - uses actual chunk data for computation (realistic GPU memory pressure) - changed from 3.7s sleep to 1s GPU computation - loader now transfers chunks to cuda device - prints iterations per chunk to verify computation is happening - requires CUDA to run this better simulates real training workflow: disk → cpu → gpu → compute

based on diagnostics showing: - prefetch=2: queue empty 100% of time, 411ms mean get time - prefetch=6: queue has items 90% of time, 162ms mean get time expected improvements: - reduce chunk overhead from 11.5s to 4.5s per epoch (~7s savings) - 6% speedup overall (~12 min saved per 100 epochs) - cpu memory: +14 GB (21 GB total for prefetch buffer)

vijk777 and others added 20 commits January 21, 2026 11:42

feat: youtube baseline

b4b076b

add notes

4da0e17

test: add failing tests for multi-epoch scenarios

715108d

- test_multiple_epochs_back_to_back: verifies consecutive epochs work - test_early_break_from_epoch: simulates warmup breaking early both tests currently fail, demonstrating thread cleanup issues that need to be fixed in chunk_loader.py

fix: clamp end_idx to total_timesteps

6dc5537

when chunk_size > dataset, end_idx was exceeding total_timesteps. now clamped: end_idx = min(start_idx + chunk_size, total_timesteps) fixes test_chunk_size_larger_than_dataset

fix: correct FlyVisSim import path

4d3635d

FlyVisSim is in LatentEvolution.load_flyvis, not NeuralGraph.FlyVis

feat: add flush=True to all prints for real-time output

e50e2fc

make constant model trace pink

523ca87

vijk777 merged commit b4c18b9 into main Jan 22, 2026
1 of 2 checks passed

vijk777 deleted the vj/youtube branch January 22, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: chunked streaming for GPU memory reduction #121

feat: chunked streaming for GPU memory reduction #121

Uh oh!

vijk777 commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: chunked streaming for GPU memory reduction #121

feat: chunked streaming for GPU memory reduction #121

Uh oh!

Conversation

vijk777 commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants