mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Xiaotong Li¹²,
Yixiao Ge¹³,
Kun Yi¹³,
Zixuan Hu¹²,
Ying Shan¹³ &
…
Ling-Yu Duan^12,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13690))

Included in the following conference series:

European Conference on Computer Vision

2636 Accesses
13 Citations

Abstract

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better “tokenizer” can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image “tokenizer” and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% $\text {AP}^{b}$ and 44.0% $\text {AP}^{m}$ of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code is available at https://github.com/lixiaotong97/mc-BEiT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning

Dual-stream multi-label image classification model enhanced by feature reconstruction

Article 20 September 2024

Masked Channel Modeling for Bootstrapping Visual Pre-training

Article 16 August 2024

References

Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: ICLR (2022)
Google Scholar
Bardes, A., Ponce, J., LeCun, Y.: VICReg: variance-invariance-covariance regularization for self-supervised learning. In: ICLR (2022)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NIPS, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NIPS, pp. 9912–9924 (2020)
Google Scholar
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: NIPS, pp. 22243–22255 (2020)
Google Scholar
Chen, X., et al.: Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026 (2022)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR, pp. 15750–15758 (2021)
Google Scholar
Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: ICCV, pp. 9640–9649 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: ACL, pp. 4171–4186 (2019)
Google Scholar
Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: ICCV, pp. 1422–1430 (2015)
Google Scholar
Dong, X., et al.: Peco: perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710 (2021)
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021)
Google Scholar
Fang, Y., Yang, S., Wang, S., Ge, Y., Shan, Y., Wang, X.: Unleashing vanilla vision transformer with masked image modeling for object detection. arXiv preprint arXiv:2204.02964 (2022)
Ge, Y., et al.: Miles: visual bert pre-training with injected language semantics for video-text retrieval. arXiv preprint arXiv:2204.12408 (2022)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. In: NIPS, pp. 21271–21284 (2020)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377 (2021)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Henaff, O.: Data-efficient image recognition with contrastive predictive coding. In: ICML, pp. 4182–4192 (2020)
Google Scholar
Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification (2017)
Google Scholar
Lample, G., Conneau, A.: Cross-lingual language model pretraining. In: NIPS (2019)
Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Colorization as a proxy task for visual understanding. In: CVPR, pp. 6874–6883 (2017)
Google Scholar
Li, Y., Xie, S., Chen, X., Dollar, P., He, K., Girshick, R.: Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, X., et al.: Self-supervised learning: generative or contrastive. TKDE (2021)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML, pp. 8821–8831 (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: ICML, pp. 10347–10357 (2021)
Google Scholar
Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133 (2021)
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Chapter Google Scholar
Xie, Z., et al.: Simmim: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886 (2021)
Yi, K., et al.: Masked image modeling with denoising contrast. arXiv preprint arXiv:2205.09616 (2022)
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: ICML, pp. 12310–12320 (2021)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR, pp. 5122–5130 (2017)
Google Scholar
Zhou, J., et al.: Image BERT pre-training with online tokenizer. In: ICLR (2022)
Google Scholar

Download references

Acknowlegement

This work was supported by the National Natural Science Foundation of China under Grant 62088102, and in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation.

Author information

Authors and Affiliations

Peking University, Beijing, China
Xiaotong Li, Zixuan Hu & Ling-Yu Duan
ARC Lab, Tencent PCG, Beijing, China
Yixiao Ge, Kun Yi & Ying Shan
Peng Cheng Laboratory, Shenzhen, China
Ling-Yu Duan

Authors

Xiaotong Li
View author publications
You can also search for this author in PubMed Google Scholar
Yixiao Ge
View author publications
You can also search for this author in PubMed Google Scholar
Kun Yi
View author publications
You can also search for this author in PubMed Google Scholar
Zixuan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Shan
View author publications
You can also search for this author in PubMed Google Scholar
Ling-Yu Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ling-Yu Duan .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 196 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Ge, Y., Yi, K., Hu, Z., Shan, Y., Duan, LY. (2022). mc-BEiT: Multi-choice Discretization for Image BERT Pre-training. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13690. Springer, Cham. https://doi.org/10.1007/978-3-031-20056-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-20056-4_14
Published: 03 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20055-7
Online ISBN: 978-3-031-20056-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning

Dual-stream multi-label image classification model enhanced by feature reconstruction

Masked Channel Modeling for Bootstrapping Visual Pre-training

References

Acknowlegement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 196 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Latent Masked Image Modeling for Self-supervised Visual Representation Learning

Dual-stream multi-label image classification model enhanced by feature reconstruction

Masked Channel Modeling for Bootstrapping Visual Pre-training

References

Acknowlegement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 196 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation