-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Hi author, thank you for contributing such interesting and solid work.
I got a question (maybe is a trivial question), the reconstruct target of DropPos are the actual positions of maksed PE right? But why would you consider to firstly mask a subset of patches? ( I can understand that it's necessary for MAE due to its target is RGB pixel) Is this because reconstructing the masked PE is a simply pretext task for pre-training ViT? (as the paper claims: trivial solution)
If so, directly feeding all patches into encoder will produces a suboptimal results, since all patches are visible for encoder, and it can reason the masked PE according all possible positions. In contrast, if we only allow it to "see" part of patches, it has to reason the masked PE only by the visible patch.
Am I right for this question? I hope you can provide some insight to me, thanks a lot!