Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut

Published: 18 August 2023 Publication History

Abstract

In this paper, we describe a graph-based algorithm that uses the features obtained by a self-supervised transformer to detect and segment salient objects in images and videos. With this approach, the image patches that compose an image or video are organised into a fully connected graph, in which the edge between each pair of patches is labeled with a similarity score based on the features learned by the transformer. Detection and segmentation of salient objects can then be formulated as a graph-cut problem and solved using the classical Normalized Cut algorithm. Despite the simplicity of this approach, it achieves state-of-the-art results on several common image and video detection and segmentation tasks. For unsupervised object discovery, this approach outperforms the competing approaches by a margin of 6.1%, 5.7%, and 2.6% when tested with the VOC07, VOC12, and COCO20 K datasets. For the unsupervised saliency detection task in images, this method improves the score for Intersection over Union (IoU) by 4.4%, 5.6% and 5.2%. When tested with the ECSSD, DUTS, and DUT-OMRON datasets. This method also achieves competitive results for unsupervised video object segmentation tasks with the DAVIS, SegTV2, and FBMS datasets.

References

[1]
B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 446–454.
[2]
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, pp. 1231–1237, 2013.
[3]
G.-S. Xia et al., “DOTA: A large-scale dataset for object detection in aerial images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3974–3983.
[4]
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[5]
H. H. Aghdam, A. Gonzalez-Garcia, J. V. D. Weijer, and A. M. López, “Active learning for deep detection neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3671–3679.
[6]
Y. Siddiqui, J. Valentin, and M. Nießner, “ViewAL: Active learning with viewpoint entropy for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9430–9440.
[7]
Y.-C. Liu et al., “Unbiased teacher for semi-supervised object detection,” in Proc. Int. Conf. Learn. Representations, 2021.
[8]
X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2613–2622.
[9]
Z. Ren et al., “Instance-aware, context-focused, and memory-efficient weakly supervised object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10595–10604.
[10]
T.-W. Ke, J.-J. Hwang, and S. X. Yu, “Universal weakly supervised segmentation by pixel-to-segment contrastive learning,” in Proc. Int. Conf. Learn. Representations, 2021.
[11]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2020.
[12]
M. Caron et al., “Emerging properties in self-supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9630–9640.
[13]
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 15979–15988.
[14]
C. Feichtenhofer, H. Fan, Y. Li, and K. He, “Masked autoencoders as spatiotemporal learners,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 35946–35958.
[15]
Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” 2022,.
[16]
O. Siméoni et al., “Localizing objects with self-supervised transformers and no labels,” in Proc. Brit. Mach. Vis. Conf., 2021.
[17]
J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000.
[18]
P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2011, pp. 109–117.
[19]
J. T. Barron and B. Poole, “The fast bilateral solver,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 617–632.
[20]
J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended CSSD,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 717–729, Apr. 2016.
[21]
L. Wang et al., “Learning to detect salient objects with image-level supervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3796–3805.
[22]
C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3166–3173.
[23]
F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 724–732.
[24]
P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6, pp. 1187–1200, Jun. 2014.
[25]
F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmentation by tracking many figure-ground segments,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2192–2199.
[26]
L. Melas-Kyriazi, C. Rupprecht, I. Laina, and A. Vedaldi, “Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 8354–8365.
[27]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes challenge 2007 (VOC2007) Results,” 2007. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
[28]
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes challenge 2012 (VOC2012) results,” PASCAL VOC2012. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
[29]
A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[30]
X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 9620–9629.
[31]
G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015,.
[32]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics - Hum. Lang. Technol., 2018, pp. 4171–4186.
[33]
Z. Li et al., “MST: Masked self-supervised transformer for visual representation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 13165–13176.
[34]
H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image transformers,” 2021,.
[35]
A. Joulin, F. Bach, and J. Ponce, “Discriminative clustering for image co-segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1943–1950.
[36]
A. Joulin, F. Bach, and J. Ponce, “Multi-class cosegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 542–549.
[37]
S. Vicente, C. Rother, and V. Kolmogorov, “Object cosegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 2217–2224.
[38]
K.-J. Hsu et al., “Co-attention CNNs for unsupervised object co-segmentation,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 748–756.
[39]
Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, “Show, match and segment: Joint weakly supervised learning of semantic matching and object co-segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3632–3647, Oct. 2021.
[40]
K. Tang, A. Joulin, L.-J. Li, and L. Fei-Fei, “Co-localization in real-world images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1464–1471.
[41]
M. Cho, S. Kwak, C. Schmid, and J. Ponce, “Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1201–1210.
[42]
H. V. Vo et al., “Unsupervised image matching and object discovery as optimization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8279–8288.
[43]
H. V. Vo, P. Pérez, and J. Ponce, “Toward unsupervised, multi-object discovery in large-scale image collections,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 779–795.
[44]
H. V. Vo, E. Sizikova, C. Schmid, P. Pérez, and J. Ponce, “Large-scale unsupervised object discovery,” 2021,.
[45]
Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1155–1162.
[46]
W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 2814–2821.
[47]
H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 2083–2090.
[48]
N. Li, B. Sun, and J. Yu, “A weighted sparse coding framework for saliency detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5216–5223.
[49]
M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.
[50]
Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 29–42.
[51]
J. Zhang, T. Zhang, Y. Dai, M. Harandi, and R. Hartley, “Deep unsupervised saliency detection: A multiple noisy labeling perspective,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9029–9038.
[52]
T. Nguyen et al., “DeepUSPS: Deep robust unsupervised saliency prediction via self-supervision,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no.
[53]
A. Voynov, S. Morozov, and A. Babenko, “Object segmentation without labels with large-scale generative models,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10596–10606.
[54]
G. Shin, S. Albanie, and W. Xie, “Unsupervised salient object detection with spectral cluster voting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2022, pp. 3970–3979.
[55]
Y. J. Koh and C.-S. Kim, “Primary object segmentation in videos based on region augmentation and reduction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7417–7425.
[56]
D. Lao and G. Sundaramoorthi, “Extending layered models to 3D motion,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 441–457.
[57]
Y. Yang, A. Loquercio, D. Scaramuzza, and S. Soatto, “Unsupervised moving object detection via contextual information separation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2019, pp. 879–888.
[58]
C. Yang, H. Lamdouar, E. Lu, A. Zisserman, and W. Xie, “Self-supervised video object segmentation by motion grouping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7157–7168.
[59]
Y. Yang, B. Lai, and S. Soatto, “DyStaB: Unsupervised object segmentation via dynamic-static bootstrapping,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 2825–2835.
[60]
V. Ye, Z. Li, R. Tucker, A. Kanazawa, and N. Snavely, “Deformable sprites for unsupervised video decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 2647–2656.
[61]
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,.
[62]
C. F. Van Loan and G. Golub, Matrix Computations. Baltimore, MD, USA: The Johns Hopkins University Press, 1996.
[63]
A. Pothen, H. D. Simon, and K.-P. Liou, “Partitioning sparse matrices with eigenvectors of graphs,” SIAM J. Matrix Anal. Appl., vol. 11, pp. 430–452, 1990.
[64]
S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” Int. J. Comput. Vis., vol. 92, pp. 1–31, 2011.
[65]
J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, pp. 154–171, 2013.
[66]
C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 391–405.
[67]
G. Kim and A. Torralba, “Unsupervised detection of regions of interest using iterative link analysis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2009, pp. 961–969.
[68]
R. Zhang et al., “Object discovery from a single unlabeled image by mining frequent itemsets with multi-scale features,” IEEE Trans. Image Process., vol. 29, pp. 8606–8621, 2020.
[69]
X.-S. Wei, C.-L. Zhang, J. Wu, C. Shen, and Z.-H. Zhou, “Unsupervised object discovery and co-localization by deep descriptor transformation,” Pattern Recognit., vol. 88, pp. 113–126, 2019.
[70]
Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 402–419.
[71]
L. Liu et al., “Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6488–6497.
[72]
X. Shen, A. A. Efros, A. Joulin, and M. Aubry, “Learning co-segmentation by segment swapping for retrieval and discovery,” 2021,.
[73]
T. Deselaers, B. Alexe, and V. Ferrari, “Localizing objects while learning their appearance,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 452–466.
[74]
P. Siva, C. Russell, T. Xiang, and L. Agapito, “Looking beyond the image: Unsupervised learning for object saliency and detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 3238–3245.
[75]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
[76]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from Abbey to Zoo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3485–3492.
[77]
Y. Hu, R. Song, and Y. Li, “Efficient coarse-to-fine patchmatch for large displacement optical flow,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5704–5712.
[78]
D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimation and their principles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 2432–2439.
[79]
D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8934–8943.
[80]
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.

Cited By

View all
  • (2024)Expander Hierarchies for Normalized Cuts on GraphsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671978(1016-1027)Online publication date: 25-Aug-2024

Index Terms

  1. TokenCut: Segmenting Objects in Images and Videos With Self-Supervised Transformer and Normalized Cut
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          Publisher

          IEEE Computer Society

          United States

          Publication History

          Published: 18 August 2023

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 02 Oct 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Expander Hierarchies for Normalized Cuts on GraphsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671978(1016-1027)Online publication date: 25-Aug-2024

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media