Skip to content

A question about the strategy of DropPos #7

@go-ahead-maker

Description

@go-ahead-maker

Hi author, thank you for contributing such interesting and solid work.

I got a question (maybe is a trivial question), the reconstruct target of DropPos are the actual positions of maksed PE right? But why would you consider to firstly mask a subset of patches? ( I can understand that it's necessary for MAE due to its target is RGB pixel) Is this because reconstructing the masked PE is a simply pretext task for pre-training ViT? (as the paper claims: trivial solution)

If so, directly feeding all patches into encoder will produces a suboptimal results, since all patches are visible for encoder, and it can reason the masked PE according all possible positions. In contrast, if we only allow it to "see" part of patches, it has to reason the masked PE only by the visible patch.

Am I right for this question? I hope you can provide some insight to me, thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions