A question about the strategy of DropPos

Hi author, thank you for contributing such interesting and solid work.

I got a question (maybe is a trivial question), the reconstruct target of DropPos are the actual positions of maksed PE right? But why would you consider to firstly mask a subset of patches? ( I can understand that it's necessary for MAE due to its target is RGB pixel) Is this because reconstructing the masked PE is a simply pretext task for pre-training ViT? **(as the paper claims: trivial solution)**

If so, directly feeding all patches into encoder will produces a suboptimal results, since all patches are visible for encoder, and it can reason the masked PE according all possible positions. In contrast, if we only allow it to "see" part of patches, it has to reason the masked PE _only by the visible patch._

Am I right for this question? I hope you can provide some insight to me, thanks a lot!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about the strategy of DropPos #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

A question about the strategy of DropPos #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions