VLN-SUSA (AAAI 2026)

Official code for "Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation"

Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). Intuitively, humans inherently ground concrete semantic knowledge within spatial layouts during indoor navigation. Although previous studies have introduced diverse environmental representations to enhance reasoning, other co-occurrence modalities are often naively concatenated with RGB features, resulting in suboptimal utilization of each modality's distinct contribution. Inspired by this, we propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at diverse scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, thereby capturing fine-grained environmental semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth-enhanced Spatial Perception (DSP) module incrementally constructs a trajectory-level depth exploration map, providing the agent with a coarse-grained comprehension of the global spatial layout. Experiments demonstrate that SUSA’s hierarchical representation enrichment not only boosts the navigation performance of the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON), but also exhibits superior generalization to the continuous R2R-CE benchmark.

1. Requirements

Environment setup in environment.txt

Install Matterport3D simulator for R2R, REVERIE and SOON: follow instructions here.

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Install requirements:

conda create --name SUSA python=3.8.5
conda activate SUSA
pip install -r requirements.txt

2. Data Download

Download data from Dropbox, including processed annotations, features and pretrained models of REVERIE, SOON, R2R and R4R datasets. Put the data in `datasets' directory.
Download pretrained lxmert

mkdir -p datasets/pretrained 
wget https://nlp.cs.unc.edu/data/model_LXRT.pth -P datasets/pretrained

Download Clip-based rgb feature and Depth feature (glbson and imagenet) from (链接: https://pan.baidu.com/s/1lKend8xnwuy1uxn-aIDBtw?pwd=n8gv 提取码: n8gv)

python get_depth.py

The ground truth depth image (undistorted_depth_images) is obtained from the Matterport Simulator, and features are extracted through a. The code for each view is referenced from HAMT and here

Download Caption R2R view (BLIP-2) from caption.json
Download checkpoints on three VLN tasks：SUSA 链接: https://pan.baidu.com/s/1i5eldIr5kiodl7UUAhytaQ?pwd=yabc 提取码: yabc

3. Pretraining

The pretrained ckpts for REVERIE, R2R, SOON is at here. You can also pretrain the model by yourself, just change the pre training RGB of Duet from vit based to clip based. Combine behavior cloning and auxiliary proxy tasks in pretraining:

cd pretrain_src
bash run_r2r.sh # (run_reverie.sh, run_soon.sh)

4. Fine-tuning & Evaluation for `R2R`, `REVERIE` and `SOON`

Before training, hyperparameters can be modified in the bash files.

Use pseudo interative demonstrator to fine-tune the model:

cd map_nav_src
bash scripts/run_r2r.sh # (run_reverie.sh, run_soon.sh)

Note: The experiment found that replacing line 585 (reported in the paper) in the agent-obj.py file with line 584 resulted in better val-seen results.

# 584 line (better)
REVERIE: Env name: val_unseen, action_steps: 8.33, steps: 12.10, lengths: 23.45, sr: 54.79, oracle_sr: 60.47, spl: 39.46, rgs: 37.26, rgspl: 27.08
R2R: Env name: val_unseen, action_steps: 7.63, steps: 7.63, lengths: 14.59, nav_error: 3.08, oracle_error: 1.64, sr: 73.86, oracle_sr: 82.08, spl: 62.84, nDTW: 68.25, SDTW: 60.25, CLS: 67.35

# 585 line (paper reported)
REVERIE: Env name: val_unseen, action_steps: 8.11, steps: 11.61, lengths: 22.59, sr: 51.66, oracle_sr: 55.95, spl: 38.78, rgs: 34.90, rgspl: 26.44
R2R: Env name: val_unseen, action_steps: 7.23, steps: 7.23, lengths: 13.77, nav_error: 3.03, oracle_error: 1.59, sr: 73.65, oracle_sr: 81.86, spl: 63.73, nDTW: 69.85, SDTW: 61.31, CLS: 68.90

The main training logs and weights at here.

5. Test

Our report results on the test set are from the official website of Eval.ai. R2R: https://eval.ai/web/challenges/challenge-page/97/submission REVERIE: https://eval.ai/web/challenges/challenge-page/606/overview SOON: https://eval.ai/web/challenges/challenge-page/1275/overview

6. Additional Resources

Panoramic trajectory visualization is provided by Speaker-Follower.
Top-down maps for Matterport3D are available in NRNS.
Instructions for extracting image features from Matterport3D scenes can be found in VLN-HAMT.

7. Citation

@article{zhang2024agent,
  title={Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation},
  author={Zhang, Xuesong and Xu, Yunbo and Li, Jia and Liu Ruonan and Hu, Zhenzhen},
  journal={arXiv preprint arXiv:2412.06465},
  year={2025}
}

Acknowledgments

Our code is based on VLN-DUET , partially referenced Paonogen for caption and from HAMT for extract view features. Thanks for their great works!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
map_nav_src		map_nav_src
pretrain_src		pretrain_src
README.md		README.md
environment.txt		environment.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLN-SUSA (AAAI 2026)

1. Requirements

2. Data Download

3. Pretraining

4. Fine-tuning & Evaluation for `R2R`, `REVERIE` and `SOON`

5. Test

6. Additional Resources

7. Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

HCI-LMC/VLN-SUSA

Folders and files

Latest commit

History

Repository files navigation

VLN-SUSA (AAAI 2026)

1. Requirements

2. Data Download

3. Pretraining

4. Fine-tuning & Evaluation for R2R, REVERIE and SOON

5. Test

6. Additional Resources

7. Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

4. Fine-tuning & Evaluation for `R2R`, `REVERIE` and `SOON`

Packages