Skip to content

Commit 85bc85d

Browse files
committed
update Oriented GroundingDINO configs
1 parent 80a802e commit 85bc85d

File tree

5 files changed

+935
-0
lines changed

5 files changed

+935
-0
lines changed

projects/GroundingDINO/README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# [Oriented GroundingDINO] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
2+
3+
> - [An Open and Comprehensive Pipeline for Unified Object Grounding and Detection](https://arxiv.org/abs/2401.02361)
4+
> - [Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection](https://arxiv.org/abs/2303.05499)
5+
6+
## Quick Start:
7+
8+
```shell
9+
bash projects/GroundingDINO/run.sh
10+
```
11+
12+
13+
## Dataset Preparation
14+
15+
- Step1: download NWPU dataset, format as:
16+
17+
```text
18+
├── NWPU-RESISC45
19+
└── NWPU-RESISC45
20+
├── CLASS 1
21+
├── CLASS 2
22+
└── ...
23+
```
24+
25+
- Step2: prepare OVD dataset.
26+
27+
```
28+
python projects/GroundingDINO/tools/prepare_ovdg_dataset.py \
29+
--data_dir data/NWPU-RESISC45/NWPU-RESISC45 \
30+
--save_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json
31+
```
32+
33+
## Training
34+
35+
> **Note**: we follow the similar training pipeline as CastDet.
36+
37+
- Step1: train base-detector
38+
39+
```shell
40+
exp1="grounding_dino_swin-t_visdrone_base-set_adamw"
41+
python tools/train.py \
42+
projects/GroundingDINO/configs/$exp1.py
43+
```
44+
45+
- **[Optional]** Step2: pseudo-labeling
46+
47+
```shell
48+
# 2.1. pseudo-labeling
49+
exp2="grounding_dino_swin-t_visdrone_base-set_adamw_nwpu45_pseudo_labeling"
50+
python tools/test.py \
51+
projects/GroundingDINO/configs/$exp2.py \
52+
work_dirs/$exp1/iter_20000.pth
53+
54+
# 2.2. merge predictions
55+
python projects/GroundingDINO/tools/merge_ovdg_preds.py \
56+
--ann_path data/NWPU-RESISC45/annotations/nwpu45_unlabeled_2.json \
57+
--pred_path work_dirs/$exp2/nwpu45_pseudo_labeling_2.bbox.json \
58+
--save_path work_dirs/$exp2/nwpu45_unlabeled_with_gdino_pseudos_swin-t_adamw_top1.json \
59+
--topk 1
60+
61+
# move to data folder
62+
cp work_dirs/$exp2/snwpu45_unlabeled_with_gdino_pseudos_swin-t_adamw_top1.json data/NWPU-RESISC45/annotations/nwpu45_unlabeled_with_gdino_pseudos_swin-t_adamw_top1.json
63+
```
64+
65+
- **[Optional]** Step3: post-training
66+
s
67+
```shell
68+
exp3="grounding_dino_swin-t_visdrone_base-set_adamw_nwpu45"
69+
exp3_="grounding_dino_swin-t_visdrone_base-set_adamw_nwpu45_"
70+
python tools/train.py \
71+
projects/GroundingDINO/configs/$exp3.py \
72+
--work-dir work_dirs/$exp3_
73+
```
74+
75+
## Evaluation
76+
77+
```shell
78+
python tools/test.py \
79+
projects/GroundingDINO/configs/$exp3.py \
80+
work_dirs/$exp3_/iter_10000.pth \
81+
--work-dir work_dirs/$exp3_/dior_test
82+
```
83+
84+
## Acknowledgement
85+
86+
Thanks the wonderful open source projects [MMDetection](https://github.com/open-mmlab/mmdetection), [MMRotate](https://github.com/open-mmlab/mmrotate), [RHINO](https://github.com/SIAnalytics/RHINO), and [GroundingDINO](https://github.com/IDEA-Research/GroundingDINO)!
87+
88+
89+
## Citation
90+
91+
```
92+
// Oriented GroundingDINO (this repo)
93+
@misc{li2024exploitingunlabeleddatamultiple,
94+
title={Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation},
95+
author={Yan Li and Weiwei Guo and Xue Yang and Ning Liao and Shaofeng Zhang and Yi Yu and Wenxian Yu and Junchi Yan},
96+
year={2024},
97+
eprint={2411.02057},
98+
archivePrefix={arXiv},
99+
primaryClass={cs.CV},
100+
url={https://arxiv.org/abs/2411.02057},
101+
}
102+
103+
// GroundingDINO (Horizontal detection)
104+
@article{liu2023grounding,
105+
title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
106+
author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
107+
journal={arXiv preprint arXiv:2303.05499},
108+
year={2023}
109+
}
110+
```
Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
_base_ = [
2+
'mmrotate::_base_/datasets/visdronezsd.py',
3+
'mmrotate::_base_/default_runtime.py'
4+
]
5+
angle_version = 'le90'
6+
lang_model_name = 'bert-base-uncased'
7+
batch_size = 8
8+
num_workers = 2
9+
10+
custom_imports = dict(
11+
imports=['projects.GroundingDINO.groundingdino'], allow_failed_imports=False)
12+
# pretrained = 'https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth' # noqa
13+
pretrained = 'checkpoints/swin_tiny_patch4_window7_224.pth'
14+
15+
model = dict(
16+
type='RotatedGroundingDINO',
17+
num_queries=900,
18+
with_box_refine=True,
19+
as_two_stage=True,
20+
data_preprocessor=dict(
21+
type='mmdet.DetDataPreprocessor',
22+
mean=[123.675, 116.28, 103.53],
23+
std=[58.395, 57.12, 57.375],
24+
bgr_to_rgb=True,
25+
pad_mask=False,
26+
boxtype2tensor=False
27+
),
28+
language_model=dict(
29+
type='mmdet.BertModel',
30+
name=lang_model_name,
31+
pad_to_max=False,
32+
use_sub_sentence_represent=True,
33+
special_tokens_list=['[CLS]', '[SEP]', '.', '?'],
34+
add_pooling_layer=False,
35+
),
36+
backbone=dict(
37+
type='mmdet.SwinTransformer',
38+
embed_dims=96,
39+
depths=[2, 2, 6, 2],
40+
num_heads=[3, 6, 12, 24],
41+
window_size=7,
42+
mlp_ratio=4,
43+
qkv_bias=True,
44+
qk_scale=None,
45+
drop_rate=0.,
46+
attn_drop_rate=0.,
47+
drop_path_rate=0.2,
48+
patch_norm=True,
49+
out_indices=(1, 2, 3),
50+
with_cp=True,
51+
convert_weights=True,
52+
frozen_stages=-1,
53+
init_cfg=dict(type='Pretrained', checkpoint=pretrained)),
54+
neck=dict(
55+
type='mmdet.ChannelMapper',
56+
in_channels=[192, 384, 768],
57+
kernel_size=1,
58+
out_channels=256,
59+
act_cfg=None,
60+
bias=True,
61+
norm_cfg=dict(type='GN', num_groups=32),
62+
num_outs=4),
63+
encoder=dict(
64+
num_layers=6,
65+
num_cp=6,
66+
# visual layer config
67+
layer_cfg=dict(
68+
self_attn_cfg=dict(embed_dims=256, num_levels=4, dropout=0.0),
69+
ffn_cfg=dict(
70+
embed_dims=256, feedforward_channels=2048, ffn_drop=0.0)),
71+
# text layer config
72+
text_layer_cfg=dict(
73+
self_attn_cfg=dict(num_heads=4, embed_dims=256, dropout=0.0),
74+
ffn_cfg=dict(
75+
embed_dims=256, feedforward_channels=1024, ffn_drop=0.0)),
76+
# fusion layer config
77+
fusion_layer_cfg=dict(
78+
v_dim=256,
79+
l_dim=256,
80+
embed_dim=1024,
81+
num_heads=4,
82+
init_values=1e-4),
83+
),
84+
decoder=dict(
85+
num_layers=6,
86+
return_intermediate=True,
87+
layer_cfg=dict(
88+
# query self attention layer
89+
self_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0),
90+
# cross attention layer query to text
91+
cross_attn_text_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0), ###
92+
# cross attention layer query to image
93+
cross_attn_cfg=dict(embed_dims=256, num_heads=8, dropout=0.0), ###
94+
ffn_cfg=dict(
95+
embed_dims=256, feedforward_channels=2048, ffn_drop=0.0)),
96+
post_norm_cfg=None),
97+
positional_encoding=dict(
98+
num_feats=128, normalize=True, offset=0.0, temperature=20),
99+
bbox_head=dict(
100+
type='RotatedGroundingDINOHead', ###
101+
num_classes=20,
102+
sync_cls_avg_factor=True,
103+
contrastive_cfg=dict(max_text_len=256, log_scale='auto', bias=True),
104+
loss_cls=dict(
105+
type='mmdet.FocalLoss',
106+
use_sigmoid=True,
107+
gamma=2.0,
108+
alpha=0.25,
109+
loss_weight=1.0), # 2.0 in DeformDETR
110+
loss_bbox=dict(type='mmdet.L1Loss', loss_weight=5.0),
111+
loss_iou=dict(
112+
type='GDLoss',
113+
loss_type='kld',
114+
fun='log1p',
115+
tau=1,
116+
sqrt=False,
117+
loss_weight=2.0)),
118+
dn_cfg=dict( # TODO: Move to model.train_cfg ?
119+
label_noise_scale=0.5,
120+
box_noise_scale=1.0, # 0.4 for DN-DETR
121+
group_cfg=dict(dynamic=True, num_groups=None,
122+
num_dn_queries=100)), # TODO: half num_dn_queries
123+
# training and testing settings
124+
train_cfg=dict(
125+
assigner=dict(
126+
type='mmdet.HungarianAssigner',
127+
match_costs=[
128+
dict(type='mmdet.BinaryFocalLossCost', weight=2.0),
129+
dict(type='RBoxL1Cost', weight=5.0, box_format='xywha'),
130+
dict(
131+
type='GDCost',
132+
loss_type='kld',
133+
fun='log1p',
134+
tau=1,
135+
sqrt=False,
136+
weight=2.0)
137+
])),
138+
test_cfg=dict(max_per_img=300))
139+
140+
# dataset settings
141+
train_pipeline = [
142+
dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
143+
dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
144+
dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
145+
dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
146+
dict(type='mmdet.FilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
147+
dict(
148+
type='mmdet.RandomFlip',
149+
prob=0.75,
150+
direction=['horizontal', 'vertical', 'diagonal']),
151+
dict(
152+
type='mmdet.PackDetInputs',
153+
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
154+
'scale_factor', 'flip', 'flip_direction', 'text',
155+
'custom_entities'))
156+
]
157+
158+
val_pipeline = [
159+
dict(type='mmdet.LoadImageFromFile', backend_args=_base_.backend_args),
160+
dict(type='mmdet.Resize', scale=(800, 800), keep_ratio=True),
161+
dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
162+
dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
163+
dict(
164+
type='mmdet.PackDetInputs',
165+
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
166+
'scale_factor', 'text', 'custom_entities'))
167+
]
168+
169+
170+
train_dataloader = dict(
171+
batch_size=batch_size,
172+
num_workers=num_workers,
173+
sampler=dict(type='DefaultSampler'),
174+
dataset=dict(
175+
pipeline=train_pipeline,
176+
return_classes=True))
177+
178+
val_dataloader = dict(
179+
batch_size=batch_size,
180+
num_workers=num_workers,
181+
dataset=dict(
182+
pipeline=val_pipeline,
183+
return_classes=True))
184+
185+
# test_dataloader = val_dataloader
186+
test_dataloader = dict(
187+
batch_size=2,
188+
num_workers=num_workers,
189+
dataset=dict(
190+
ann_file='ImageSets/Main/test.txt',
191+
# data_prefix=dict(img_path='JPEGImages-trainval'),
192+
pipeline=val_pipeline,
193+
return_classes=True)
194+
)
195+
196+
# training schedule for 20k
197+
train_cfg = dict(
198+
type='IterBasedTrainLoop', max_iters=20000, val_interval=4000)
199+
val_cfg = dict(type='ValLoop')
200+
test_cfg = dict(type='TestLoop')
201+
202+
# learning rate policy
203+
param_scheduler = [
204+
dict(
205+
type='LinearLR', start_factor= 1.0 / 3, by_epoch=False, begin=0, end=500),
206+
dict(
207+
type='MultiStepLR',
208+
begin=0,
209+
end=20000,
210+
by_epoch=False,
211+
milestones=[16000, 18000],
212+
gamma=0.1)
213+
]
214+
215+
# optimizer
216+
optim_wrapper = dict(
217+
type='OptimWrapper',
218+
optimizer=dict(
219+
type='AdamW',
220+
lr=0.0001, # 0.0002 for DeformDETR
221+
weight_decay=0.0001),
222+
clip_grad=dict(max_norm=0.1, norm_type=2),
223+
paramwise_cfg=dict(custom_keys={
224+
'absolute_pos_embed': dict(decay_mult=0.),
225+
'backbone': dict(lr_mult=0.1)
226+
}))
227+
228+
default_hooks = dict(
229+
logger=dict(type='LoggerHook', interval=20),
230+
checkpoint=dict(by_epoch=False, interval=2000, max_keep_ckpts=1))
231+
log_processor = dict(by_epoch=False)
232+
233+
_base_.visualizer.vis_backends = [
234+
dict(type='LocalVisBackend'),
235+
dict(type='TensorboardVisBackend')
236+
]

0 commit comments

Comments
 (0)