Hi, thanks for sharing this impressive project. I have a question about the technical details of the image-to-world mode.
From the demos, it looks like the Gaussian scene keeps expanding as new video trajectories are generated beyond the currently reconstructed region. However, I am not fully sure how this is implemented, given that the diffusion model appears to operate on a fixed 81-frame input.
I am wondering which of the following better matches your implementation:
-
Chunk-wise expansion with later merging.
Each 81-frame diffusion pass only extrapolates from the currently reconstructed Gaussian scene. Then, to obtain the final world Gaussian, you either:
- combine all generated video chunks and feed them jointly into the reconstruction model, or
- reconstruct separate Gaussian scenes from different chunks and then align/merge them.
-
Progressive expansion within overlapping 81-frame windows.
For each 81-frame diffusion pass, the trajectory is chosen so that the generated video contains both:
- regions already covered by the current Gaussian scene, and
- newly extrapolated regions beyond it.
In this case, the final Gaussian would simply be reconstructed from the last generated 81-frame video, without explicitly merging multiple video chunks or Gaussian reconstructions.
Could you clarify which of these is closer to the actual implementation, or whether the real pipeline is different from both?
Thanks a lot.

Hi, thanks for sharing this impressive project. I have a question about the technical details of the image-to-world mode.
From the demos, it looks like the Gaussian scene keeps expanding as new video trajectories are generated beyond the currently reconstructed region. However, I am not fully sure how this is implemented, given that the diffusion model appears to operate on a fixed 81-frame input.
I am wondering which of the following better matches your implementation:
Chunk-wise expansion with later merging.
Each 81-frame diffusion pass only extrapolates from the currently reconstructed Gaussian scene. Then, to obtain the final world Gaussian, you either:
Progressive expansion within overlapping 81-frame windows.
For each 81-frame diffusion pass, the trajectory is chosen so that the generated video contains both:
In this case, the final Gaussian would simply be reconstructed from the last generated 81-frame video, without explicitly merging multiple video chunks or Gaussian reconstructions.
Could you clarify which of these is closer to the actual implementation, or whether the real pipeline is different from both?
Thanks a lot.