Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.00790 (cs)

[Submitted on 1 Jun 2022 (v1), last revised 20 Jun 2022 (this version, v2)]

Title:Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Authors:Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny

View PDF

Abstract:Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 7$\times$7 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384$\times$384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 $\text{AP}^\text{box}$ on object detection and 0.5 $\text{AP}^\text{mask}$ on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1$\times$ faster than MAE with 0.2% higher classification accuracy on pretraining 448$\times$448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in this https URL.

Comments:	Add code
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.00790 [cs.CV]
	(or arXiv:2206.00790v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.00790

Submission history

From: Jun Chen [view email]
[v1] Wed, 1 Jun 2022 22:46:34 UTC (17,392 KB)
[v2] Mon, 20 Jun 2022 13:28:04 UTC (17,392 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators