research-article

TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut

Authors:

James L. Crowley,

Dominique VaufreydazAuthors Info & Claims

IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 45, Issue 12

Pages 15790 - 15801

https://doi.org/10.1109/TPAMI.2023.3305122

Published: 18 August 2023 Publication History

Abstract

In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, in which the edge between each pair of patches is labeled with a similarity score based on the features learned by the transformer. Detection and segmentation of salient objects can then be formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6% when tested with the VOC07, VOC12, and COCO20 K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.

References

[1]

B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 446–454.

[2]

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, pp. 1231–1237, 2013.

Digital Library

[3]

G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3974–3983.

[4]

T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.

[5]

H. H. Aghdam, A. Gonzalez-Garcia, J. V. D. Weijer, and A. M. López, “Active learning for deep detection neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3671–3679.

[6]

Y. Siddiqui, J. Valentin, and M. Nießner, “ViewAL: Active learning with viewpoint entropy for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9430–9440.

[7]

Y.-C. Liu et al., “Unbiased teacher for semi-supervised object detection,” in Proc. Int. Conf. Learn. Representations, 2021.

[8]

X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2613–2622.

[9]

Z. Ren et al., “Instance-aware, context-focused, and memory-efficient weakly supervised object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10595–10604.

[10]

T.-W. Ke, J.-J. Hwang, and S. X. Yu, “Universal weakly supervised segmentation by pixel-to-segment contrastive learning,” in Proc. Int. Conf. Learn. Representations, 2021.

[11]

A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2020.

[12]

M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9630–9640.

[13]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 15979–15988.

[14]

C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 35946–35958.

[15]

Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” 2022,.

[16]

O. Siméoni et al., “Localizing objects with self-supervised transformers and no labels,” in Proc. Brit. Mach. Vis. Conf., 2021.

[17]

J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000.

Digital Library

[18]

P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2011, pp. 109–117.

[19]

J. T. Barron and B. Poole, “The fast bilateral solver,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 617–632.

[20]

J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended CSSD,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 717–729, Apr. 2016.

Digital Library

[21]

L. Wang et al., “Learning to detect salient objects with image-level supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3796–3805.

[22]

C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3166–3173.

[23]

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 724–732.

[24]

P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6, pp. 1187–1200, Jun. 2014.

Digital Library

[25]

F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmentation by tracking many figure-ground segments,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2192–2199.

[26]

L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8354–8365.

[27]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes challenge 2007 (VOC2007) Results,” 2007. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

[28]

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes challenge 2012 (VOC2012) results,” PASCAL VOC2012. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

[29]

A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.

[30]

X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9620–9629.

[31]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015,.

[32]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics - Hum. Lang. Technol., 2018, pp. 4171–4186.

[33]

Z. Li et al., “MST: Masked self-supervised transformer for visual representation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 13165–13176.

[34]

H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image transformers,” 2021,.

[35]

A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image co-segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1943–1950.

[36]

A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 542–549.

[37]

S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 2217–2224.

[38]

K.-J. Hsu et al., “Co-attention CNNs for unsupervised object co-segmentation,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 748–756.

[39]

Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, “Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3632–3647, Oct. 2021.

[40]

K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei, “Co-localization in real-world images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1464–1471.

[41]

M. Cho, S. Kwak, C. Schmid, and J. Ponce, “Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1201–1210.

[42]

H. V. Vo et al., “Unsupervised image matching and object discovery as optimization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8279–8288.

[43]

H. V. Vo, P. Pérez, and J. Ponce, “Toward unsupervised, multi-object discovery in large-scale image collections,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 779–795.

[44]

H. V. Vo, E. Sizikova, C. Schmid, P. Pérez, and J. Ponce, “Large-scale unsupervised object discovery,” 2021,.

[45]

Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1155–1162.

[46]

W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2814–2821.

[47]

H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2083–2090.

[48]

N. Li, B. Sun, and J. Yu, “A weighted sparse coding framework for saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5216–5223.

[49]

M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.

Digital Library

[50]

Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 29–42.

[51]

J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley, “Deep unsupervised saliency detection: A multiple noisy labeling perspective,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9029–9038.

[52]

T. Nguyen et al., “DeepUSPS: Deep robust unsupervised saliency prediction via self-supervision,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no.

[53]

A. Voynov, S. Morozov, and A. Babenko, “Object segmentation without labels with large-scale generative models,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10596–10606.

[54]

G. Shin, S. Albanie, and W. Xie, “Unsupervised salient object detection with spectral cluster voting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2022, pp. 3970–3979.

[55]

Y. J. Koh and C.-S. Kim, “Primary object segmentation in videos based on region augmentation and reduction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7417–7425.

[56]

D. Lao and G. Sundaramoorthi, “Extending layered models to 3D motion,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 441–457.

[57]

Y. Yang, A. Loquercio, D. Scaramuzza, and S. Soatto, “Unsupervised moving object detection via contextual information separation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 879–888.

[58]

C. Yang, H. Lamdouar, E. Lu, A. Zisserman, and W. Xie, “Self-supervised video object segmentation by motion grouping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7157–7168.

[59]

Y. Yang, B. Lai, and S. Soatto, “DyStaB: Unsupervised object segmentation via dynamic-static bootstrapping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2825–2835.

[60]

V. Ye, Z. Li, R. Tucker, A. Kanazawa, and N. Snavely, “Deformable sprites for unsupervised video decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2647–2656.

[61]

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,.

[62]

C. F. Van Loan and G. Golub, Matrix Computations. Baltimore, MD, USA: The Johns Hopkins University Press, 1996.

[63]

A. Pothen, H. D. Simon, and K.-P. Liou, “Partitioning sparse matrices with eigenvectors of graphs,” SIAM J. Matrix Anal. Appl., vol. 11, pp. 430–452, 1990.

Digital Library

[64]

S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” Int. J. Comput. Vis., vol. 92, pp. 1–31, 2011.

Digital Library

[65]

J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, pp. 154–171, 2013.

Digital Library

[66]

C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 391–405.

[67]

G. Kim and A. Torralba, “Unsupervised detection of regions of interest using iterative link analysis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2009, pp. 961–969.

[68]

R. Zhang et al., “Object discovery from a single unlabeled image by mining frequent itemsets with multi-scale features,” IEEE Trans. Image Process., vol. 29, pp. 8606–8621, 2020.

[69]

X.-S. Wei, C.-L. Zhang, J. Wu, C. Shen, and Z.-H. Zhou, “Unsupervised object discovery and co-localization by deep descriptor transformation,” Pattern Recognit., vol. 88, pp. 113–126, 2019.

Digital Library

[70]

Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 402–419.

[71]

L. Liu et al., “Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6488–6497.

[72]

X. Shen, A. A. Efros, A. Joulin, and M. Aubry, “Learning co-segmentation by segment swapping for retrieval and discovery,” 2021,.

[73]

T. Deselaers, B. Alexe, and V. Ferrari, “Localizing objects while learning their appearance,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 452–466.

[74]

P. Siva, C. Russell, T. Xiang, and L. Agapito, “Looking beyond the image: Unsupervised learning for object saliency and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3238–3245.

[75]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.

[76]

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from Abbey to Zoo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–3492.

[77]

Y. Hu, R. Song, and Y. Li, “Efficient coarse-to-fine patchmatch for large displacement optical flow,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5704–5712.

[78]

D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimation and their principles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2432–2439.

[79]

D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8934–8943.

[80]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.

Cited By

Hanauer KHenzinger MMünk RRäcke HVötsch MBaeza-Yates RBonchi F(2024)Expander Hierarchies for Normalized Cuts on GraphsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671978(1016-1027)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671978

Index Terms

TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
        Video segmentation
      2. Computer vision tasks
        Scene understanding
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Unsupervised Cell Segmentation in Fluorescence Microscopy Images via Self-supervised Learning
Pattern Recognition and Artificial Intelligence
Abstract
Cell segmentation in microscopy images is challenging particularly when only few or no annotations available. Existing unsupervised deep learning-based segmentation methods rely on large data sets to train large networks, use synthetic training ...
Self-supervised 3D Anatomy Segmentation Using Self-distilled Masked Image Transformer (SMIT)
Medical Image Computing and Computer Assisted Intervention – MICCAI 2022
Abstract
Vision transformers efficiently model long-range context and thus have demonstrated impressive accuracy gains in several image analysis tasks including segmentation. However, such methods need large labeled datasets for training, which is hard to ...
Unsupervised object discovery with pseudo label generated using K-means and self-supervised transformer
Highlights
- A simple yet effective framework that generates pseudo labels for unsupervised class agnostic instance segmentation is proposed.
Abstract
Instance segmentation is a fundamental task in the field of computer vision. It aims to assign every pixel to an appropriate class and localize objects within bounding boxes. However, expensive pixel-level segmentation labels are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Computer Society

United States

Publication History

Published: 18 August 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hanauer KHenzinger MMünk RRäcke HVötsch MBaeza-Yates RBonchi F(2024)Expander Hierarchies for Normalized Cuts on GraphsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671978(1016-1027)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671978

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents