Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3512527.3531359acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation

Published: 27 June 2022 Publication History

Abstract

Image-based object pose estimation sounds amazing because in real applications the shape of object is oftentimes not available or not easy to take like photos. Although it is an advantage to some extent, un-explored shape information in 3D vision learning problem looks like "flaws in jade''. In this paper, we deal with the problem in a reasonable new setting, namely 3D shape is exploited in the training process, and the testing is still purely image-based. We enhance the performance of image-based methods for category-agnostic object pose estimation by exploiting 3D knowledge learned by a multi-modal method. Specifically, we propose a novel contrastive knowledge distillation framework that effectively transfers 3D-augmented image representation from a multi-modal model to an image-based model. We integrate contrastive learning into the two-stage training procedure of knowledge distillation, which formulates an advanced solution to combine these two approaches for cross-modal tasks. We experimentally report state-of-the-art results compared with existing category-agnostic image-based methods by a large margin (up to +5% improvement on ObjectNet3D dataset), demonstrating the effectiveness of our method.

Supplementary Material

MP4 File (ICMR22-fp043.mp4)
Presentation video for the paper "3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation".

References

[1]
Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovic, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-Supervised MultiModal Versatile Networks. NeurIPS, Vol. 2, 6 (2020), 7.
[2]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML). PMLR, 1597--1607.
[3]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 c. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020).
[4]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020 a. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
[5]
Alvaro Collet, Manuel Martinez, and Siddhartha S Srinivasa. 2011. The MOPED framework: Object recognition and pose estimation for manipulation. The international journal of robotics research, Vol. 30, 10 (2011), 1284--1306.
[6]
Rui Dai, Srijan Das, and Francc ois Bremond. 2021. Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) .
[7]
Meghal Dani, Karan Narain, and Ramya Hebbalaguppe. 2021. 3DPoseLite: A Compact 3D Pose Estimation Using Node Embeddings. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1878--1887.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Ieee, 248--255.
[9]
Xinke Deng, Yu Xiang, Arsalan Mousavian, Clemens Eppner, Timothy Bretl, and Dieter Fox. 2020. Self-supervised 6d object pose estimation for robot manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3665--3671.
[10]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision (IJCV), Vol. 88, 2 (2010), 303--338.
[11]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning (ICML). PMLR, 1126--1135.
[12]
Yuqian Fu, Yanwei Fu, and Yu-Gang Jiang. 2021. Can Action be Imitated" Learn to Reconstruct and Transfer Human Dynamics from Videos. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR '21). ACM. https://doi.org/10.1145/3460426.3463609
[13]
Alexander Grabner, Peter M Roth, and Vincent Lepetit. 2018. 3d pose estimation and 3d model retrieval for objects in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3022--3031.
[14]
Alexander Grabner, Peter M Roth, and Vincent Lepetit. 2019. Gp2c: Geometric projection parameter consensus for joint 3d pose and focal length estimation in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2222--2231.
[15]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855--864.
[16]
Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2827--2836.
[17]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. IEEE, 1735--1742.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[19]
Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. 2019. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 1921--1930.
[20]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[21]
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
[22]
Alex Kendall, Matthew Grimes, and Roberto Cipolla. 2015. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2938--2946.
[23]
Donghyun Kim, Yi-Hsuan Tsai, Bingbing Zhuang, Xiang Yu, Stan Sclaroff, Kate Saenko, and Manmohan Chandraker. 2021. Learning cross-modal contrastive features for video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13618--13627.
[24]
Abhijit Kundu, Yin Li, and James M Rehg. 2018. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3559--3568.
[25]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2117--2125.
[26]
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2016. Unifying distillation and privileged information. (2016).
[27]
Fabian Manhardt, Wadim Kehl, and Adrien Gaidon. 2019. Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2069--2078.
[28]
Eric Marchand, Hideaki Uchiyama, and Fabien Spindler. 2015. Pose estimation for augmented reality: a hands-on survey. IEEE Transactions on Visualization and Computer Graphics (TVCG), Vol. 22, 12 (2015), 2633--2651.
[29]
Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 2017. 3d bounding box estimation using deep learning and geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7074--7082.
[30]
Siva Karthik Mustikovela, Varun Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, Carsten Rother, and Jan Kautz. 2020. Self-supervised viewpoint learning from image collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3971--3981.
[31]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[32]
Junyi Pan, Xiaoguang Han, Weikai Chen, Jiapeng Tang, and Kui Jia. 2019. Deep mesh reconstruction from single rgb images via topology modification networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9964--9973.
[33]
Keunhong Park, Arsalan Mousavian, Yu Xiang, and Dieter Fox. 2020. Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10710--10719.
[34]
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3967--3976.
[35]
Mandela Patrick, Yuki M Asano, Polina Kuznetsova, Ruth Fong, Jo ao F Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2021. On compositions of transformations in contrastive self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9577--9587.
[36]
Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G Derpanis, and Kostas Daniilidis. 2017. 6-dof object pose from semantic keypoints. In 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2011--2018.
[37]
Giorgia Pitteri, Slobodan Ilic, and Vincent Lepetit. 2019. CorNet: Generic 3D Corners for 6D Pose Estimation of New Objects without Retraining. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVw) .
[38]
Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021).
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[41]
Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. 2015. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2686--2694.
[42]
Siqi Sun, Zhe Gan, Yuwei Fang, Yu Cheng, Shuohang Wang, and Jingjing Liu. 2020. Contrastive Distillation on Intermediate Representations for Language Model Compression. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 498--508.
[43]
Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. 2018. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2974--2983.
[44]
Fulin Tang, Yihong Wu, Xiaohui Hou, and Haibin Ling. 2019. 3D Mapping and 6D Pose Computation for Real Time Augmented Reality on Cylindrical Objects. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 9 (2019), 2887--2899.
[45]
Ajinkya Tejankar, Soroush Abbasi Koohpayegani, Vipin Pillai, Paolo Favaro, and Hamed Pirsiavash. 2021. ISD: Self-supervised learning by iterative similarity distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 9609--9618.
[46]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive Representation Distillation. In International Conference on Learning Representations (ICLR) .
[47]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XI 16. Springer, 776--794.
[48]
Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. 2018. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects. In Conference on Robot Learning. PMLR, 306--316.
[49]
Hung-Yu Tseng, Shalini De Mello, Jonathan Tremblay, Sifei Liu, Stan Birchfield, Ming-Hsuan Yang, and Jan Kautz. 2019. Few-shot viewpoint estimation. In British Machine Vision Conference (BMVC) .
[50]
Shubham Tulsiani and Jitendra Malik. 2015. Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1510--1519.
[51]
Vladimir Vapnik and Akshay Vashist. 2009. A new learning paradigm: Learning using privileged information. Neural networks, Vol. 22, 5--6 (2009), 544--557.
[52]
Gu Wang, Fabian Manhardt, Jianzhun Shao, Xiangyang Ji, Nassir Navab, and Federico Tombari. 2020. Self6d: Self-supervised monocular 6d object pose estimation. In European Conference on Computer Vision (ECCV). Springer, 108--125.
[53]
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. 2019. Normalized object coordinate space for category-level 6d object pose and size estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2642--2651.
[54]
Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE international conference on computer vision (ICCV). 2794--2802.
[55]
Di Wu, Zhaoyong Zhuang, Canqun Xiang, Wenbin Zou, and Xia Li. 2019. 6d-vnet: End-to-end 6-dof vehicle pose estimation from monocular rgb images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 0--0.
[56]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3733--3742.
[57]
Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, and Silvio Savarese. 2016. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision (ECCV). Springer, 160--176.
[58]
Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 75--82.
[59]
Yang Xiao, Yuming Du, and Renaud Marlet. 2021. PoseContrast: Class-Agnostic Object Viewpoint Estimation in the Wild with Pose-Aware Contrastive Learning. In International Conference on 3D Vision (3DV) .
[60]
Yang Xiao and Renaud Marlet. 2020. Few-shot object detection and viewpoint estimation for objects in the wild. In European Conference on Computer Vision (ECCV). Springer, 192--210.
[61]
Yang Xiao, Xuchong Qiu, Pierre-Alain Langlois, Mathieu Aubry, and Renaud Marlet. 2019. Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects. In British Machine Vision Conference (BMVC) .
[62]
Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. 2021. Multimodal Contrastive Training for Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6995--7004.
[63]
Congcong Zhang, Ning He, Qixiang Sun, Xiaojie Yin, and Ke Lu. 2021. Human Pose Estimation Based on Attention Multi-Resolution Network. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR '21). Association for Computing Machinery, New York, NY, USA, 682--687. https://doi.org/10.1145/3460426.3463668
[64]
Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3713--3722.
[65]
Zhu Zhang, Zhou Zhao, Zhijie Lin, Xiuqiang He, et al. 2020. Counterfactual contrastive learning for weakly-supervised vision-language grounding. Advances in Neural Information Processing Systems, Vol. 33 (2020), 18123--18134.
[66]
Xingyi Zhou, Arjun Karpur, Linjie Luo, and Qixing Huang. 2018. Starmap for category-agnostic keypoint and viewpoint estimation. In Proceedings of the European Conference on Computer Vision (ECCV). 318--334.
[67]
Menglong Zhu, Konstantinos G Derpanis, Yinfei Yang, Samarth Brahmbhatt, Mabel Zhang, Cody Phillips, Matthieu Lecce, and Kostas Daniilidis. 2014. Single image 3D object detection and pose estimation for grasping. In 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 3936--3943.
[68]
Nan Zhuang and Yadong Mu. 2021. Joint Hand-Object Pose Estimation with Differentiably-Learned Physical Contact Point Analysis. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR '21). Association for Computing Machinery, New York, NY, USA, 420--428. https://doi.org/10.1145/3460426.3463648

Cited By

View all
  • (2023)MultiCAD: Contrastive Representation Learning for Multi-modal 3D Computer-Aided Design ModelsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614982(1766-1776)Online publication date: 21-Oct-2023
  • (2023)PanoSwin: a Pano-style Swin Transformer for Panorama Understanding2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01703(17755-17764)Online publication date: Jun-2023
  • (2023)Overcoming the TradeOff between Accuracy and Plausibility in 3D Hand Shape Reconstruction2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00060(544-553)Online publication date: Jun-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. category-agnostic object pose estimation
  2. cross-modal contrastive learning
  3. generalized knowledge distillation

Qualifiers

  • Research-article

Funding Sources

  • the National Key Research and Development Program of China

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)MultiCAD: Contrastive Representation Learning for Multi-modal 3D Computer-Aided Design ModelsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614982(1766-1776)Online publication date: 21-Oct-2023
  • (2023)PanoSwin: a Pano-style Swin Transformer for Panorama Understanding2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01703(17755-17764)Online publication date: Jun-2023
  • (2023)Overcoming the TradeOff between Accuracy and Plausibility in 3D Hand Shape Reconstruction2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00060(544-553)Online publication date: Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media