Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.07706 (cs)

[Submitted on 15 Jun 2022 (v1), last revised 25 Apr 2023 (this version, v2)]

Title:Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Authors:Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy

View PDF

Abstract:We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.

Comments:	ICLR 2023. Project page: this https URL Code: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2206.07706 [cs.CV]
	(or arXiv:2206.07706v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.07706

Submission history

From: Jiahao Xie [view email]
[v1] Wed, 15 Jun 2022 17:58:30 UTC (7,951 KB)
[v2] Tue, 25 Apr 2023 17:29:15 UTC (8,699 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Masked Frequency Modeling for Self-Supervised Visual Pre-Training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators