Skip to content

CAESAR-Radi/TACMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TACMT

overview This is the official implementation of TACMT:Text-Aware Cross-Modal Transformer for Visual Grounding on High-Resolution SAR Images, which proposed to constructing dataset and developing multimodal deep learning models for the SARVG task.

Introduction

we have built a new benchmark of SARVG based on images from different SAR sensors to promote SARVG research fully. Subsequently, a novel text-aware cross-modal Transformer (TACMT) is proposed which follows DETR’s architecture. We develop a cross-modal encoder to enhance the visual features associated with the textual descriptions. Next, a text-aware query selection module is devised to select relevant context features as the decoder query. To retrieve the object from various scenes, we further design a cross-scale fusion module to fuse features from different levels for accurate target localization. dataset model

Installation

1.Clone the repository

git clone https://github.com/CAESAR-Radi/TACMT.git

2.Install PyTorch 1.9.1 and torchvision 0.10.1.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

3.Install the other dependencies

pip install pytorch-pretrained-bert
pip install rasterio

Dataset and Weights

You can download the dataset in three ways
You can download the dataset and checkpoint weight from Google Drive
You can download the dataset and checkpoint weight from Baidu Netdisk.
You can download the dataset and checkpoint weight from Google Drive.

Data Usage

This dataset is created based on data from the following sources:

  1. Capella SAR Data
  2. GF-3 SAR Data
    • Data source: "GF-3 SAR Dataset" by China Center for Resources Satellite Data and Application.
  3. Iceye SAR Data

Modifications

  • This dataset has been processed and annotated for this project.
  • Modifications include:
    • Image Cropping: Images have been cropped to focus on relevant areas of interest.
    • Image Stretching: Grayscale image pixel values have been stretched to the range 0-255 for better visualization.
    • Data Annotation: Relevant features in the images have been annotated for visual grounding tasks.

Training

The following is an example of model training.

python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --config configs/SARVG_R101.py --world_size 2 --checkpoint_best --enable_batch_accum --batch_size 10 --freeze_epochs 10

Evaluation

Run the following script to evaluate the trained model with a single GPU.

python inference.py --config configs/SARVG_R101.py --checkpoint ./checkpoint_best_acc.pth --test_split val

Acknowledgement

Part of our code is based on the previous works VLTVG and RT-DETR, thanks for the authors.

About

Multimodal fusion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages