Skip to content

Commit d610919

Browse files
committed
Improve flow performance and video progress reporting.
Add CPU buffer reuse and batched CUDA flow processing while preserving output semantics, and make video progress bars use fast probed frame totals with safe fallback behavior. Made-with: Cursor
1 parent b1a3ce2 commit d610919

4 files changed

Lines changed: 483 additions & 72 deletions

File tree

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ For `build` with video input:
2626
- Primary path: frames are decoded by `ffmpeg` as raw RGB to memory, then consumed by a bounded in-memory worker pipeline.
2727
- Backpressure: in-flight frame processing is capped and scales with worker count, so decode cannot run far ahead of strip generation.
2828
- Fallback path: if `ffprobe` is unavailable, the tool falls back to disk-backed temporary frame extraction.
29+
- Progress totals: with `--progress`, the pipeline performs a fast metadata probe for total frame count and shows determinate progress when available; if probing is unavailable/slow, it falls back to indeterminate progress updates.
2930
- With `--cuda` and `--mode flow`, decode and strip generation are pipelined in-memory for better overlap.
3031
- If NVDEC cannot decode the source stream/profile, decode automatically falls back to software while keeping CUDA strip compute enabled.
3132

@@ -198,7 +199,7 @@ Use zero-padded names so lexical sort matches chronological order:
198199
- In single-worker video `flow` mode, CUDA processing uses bounded frame batching to reduce launch/transfer overhead while preserving row order.
199200
- For most videos this is significantly faster than CPU flow mode, especially at 4K widths.
200201
- Hardware decode acceleration depends on codec/profile support; unsupported streams transparently use software decode.
201-
- On this repository's current implementation, a 1000-frame 4K flow run completed in roughly 43 seconds on an RTX 4080 during local validation.
202+
- On this repository's current implementation, a 1000-frame 4K flow CUDA run completed in about 39 seconds on an RTX 4080 during local validation.
202203

203204
## Troubleshooting
204205

0 commit comments

Comments
 (0)