Model Card - Fast-FoundationStereo

Overview

Description:

The Fast-FoundationStereo model estimates the disparity of each pixel in a rectified binocular stereo pair of images. This is a transformer based foundational model which shows strong generalization running in real-time. This model is for research and evaluation purposes only.

License/Terms of Use:

The code is released using the NVIDIA Source Code License. https://github.com/NVlabs/FoundationStereo/blob/master/LICENSE

Deployment Geography:

Global

Use Case:

Researchers and developers in the field of computer vision, specifically those interested in depth estimation, are expected to use this method for tasks such as three dimensional reconstruction, object detection, object pose estimation, and scene understanding.

Release Date:

Github [02/01/2026] via [https://github.com/NVlabs/Fast-FoundationStereo]

Reference(s):

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Model Architecture:

Architecture Type: Transformers and convolutional neural networks (CNNs).

Network Architecture: The network contains three parts: 1) EdgeNeXt student module that distills the original FoundationStereo feature extractor. 2) Set of blocks (CNNs and transformers) that performs matching with long-range dependencies. 3) Reduced set of convGRU blocks.

Number of model parameters: 14.6M.

Input:

Input Type(s): A pair of two rectified binocular stereo images

Input Format(s): Red, Green, Blue (RGB)

Input Parameters: The input parameters to this model are rectified stereo images, specifically Two-Dimensional (2D) images from a camera like Zed. In addition, the baseline is needed to convert disparity to depth.

Other Properties Related to Input: Additional input properties:

No Alpha Channel or Pre-Processing Needed. Bit: 24-bit.

Output:

Output Type(s): Disparity image

Output Format(s): 16-bit unsigned integer

Output Parameters: The output parameter of this model is the final 2D disparity map.

Other Properties Related to Output: No Alpha Channel or Post-Processing Needed Bit: 16 bit.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration :

Runtime Engine(s): Not Applicable (N/A)

Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere

[Preferred/Supported] Operating System(s): Linux

Model Version(s):

v1.0: Initial model version with full capabilities, unpruned and trained.

Training and Evaluation Datasets:

Training Dataset:

Data Modality: Image

Image Training Data Size: 1 Million to 1 Billion Images.

Link: Internal, proprietary dataset, and Stereo4D dataset

Data Collection Method by dataset [Hybrid: Synthetic, Automatic/Sensors]

Labeling Method by dataset [Hybrid: Synthetic, Automatic/Sensors]

Properties: The training dataset includes: 1) a large-scale synthetic dataset featuring 1.4 million stereo pairs with large diversity of objects and scenes and high photorealism; 2) real dataset from Stereo4D (external)

Testing Dataset:

Link: Middlebury dataset

Data Collection Method by dataset [Automatic/Sensors]

Labeling Method by dataset [Automatic/Sensors]

Properties: The dataset encompasses a wide range of scenarios, includes diverse three dimensional assets, captures stereo images under diversely randomized camera parameters, and achieves high fidelity in both rendering and spatial layouts.

Evaluation Dataset:

We evaluated the model using public leaderboards well-known to the stereo community:
Middlebury: https://vision.middlebury.edu/stereo/
This is the classic leaderboard for dense stereo matching, developed at Middlebury College.
ETH3D: https://www.eth3d.net/
This is another popular leaderboard for stereo, developed at ETH.
KITTI: https://www.cvlibs.net/datasets/kitti/eval_stereo.php

Data Collection Method by dataset [Automatic/Sensors]

** Labeling Method by dataset** [Automatic/Sensors]

Properties:

The Middlebury Stereo dataset consists of high-resolution stereo sequences with complex geometry and pixel-accurate ground-truth disparity data. The ground-truth disparities were acquired using a novel technique that employs structured lighting and infrared paint.
ETHD is a multi-view stereo benchmark / 3D reconstruction benchmark that covers a variety of indoor and outdoor scenes. Ground truth geometry was obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images.
KITTI stereo dataset is a cornerstone of autonomous driving research, developed by the Karlsruhe Institute of Technology (KIT) and the Toyota Technological Institute at Chicago (TTIC). It provides real-world, high-resolution stereo imagery paired with precise ground-truth depth data collected from a moving vehicle in diverse urban environments.

Inference:

Engine: Tensor(RT)

Test Hardware :

Zed Stereo Camera, 3090

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

Description:

License/Terms of Use:

Deployment Geography:

Use Case:

Release Date:

Reference(s):

Model Architecture:

Input:

Output:

Software Integration :

Model Version(s):

Training and Evaluation Datasets:

Training Dataset:

Testing Dataset:

Evaluation Dataset:

Inference:

Ethical Considerations:

FilesExpand file tree

model_card.md

Latest commit

History

model_card.md

File metadata and controls

Overview

Description:

License/Terms of Use:

Deployment Geography:

Use Case:

Release Date:

Reference(s):

Model Architecture:

Input:

Output:

Software Integration :

Model Version(s):

Training and Evaluation Datasets:

Training Dataset:

Testing Dataset:

Evaluation Dataset:

Inference:

Ethical Considerations: