Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-030-58545-7_4guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

AUTO3D: Novel View Synthesis Through Unsupervisely Learned Variational Viewpoint and Global 3D Representation

Published: 23 August 2020 Publication History

Abstract

This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision. In the viewer-centered coordinates, we construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation (shape, texture and the origin of viewer-centered coordinates, etc.). The global appearance of the 3D object is given by several appearance-describing images taken from any number of viewpoints. Our spatial correlation module extracts a global 3D representation from the appearance-describing images in a permutation invariant manner. Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction. With an unsupervisely learned viewer-centered relative-pose/rotation code, the decoder can hallucinate the novel view continuously by sampling the relative-pose in a prior distribution. In various applications, we demonstrate that our model can achieve comparable or even better results than pose/3D model-supervised learning-based novel view synthesis (NVS) methods with any number of input views.

References

[1]
Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016)
[2]
Barron JT and Malik J Shape, illumination, and reflectance from shading IEEE Trans. Pattern Anal. Mach. Intell. 2014 37 8 1670-1687
[3]
Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 60–65. IEEE (2005)
[4]
Cao, J., Hu, Y., Yu, B., He, R., Sun, Z.: Load balanced GANs for multi-view face image synthesis. arXiv preprint arXiv:1802.07447 (2018)
[5]
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
[6]
Che, T., et al.: Deep verifier networks: Verification of deep discriminative models with deep generative models. arXiv preprint arXiv:1911.07421 (2019)
[7]
Chen, X., Song, J., Hilliges, O.: Monocular neural image based rendering with continuous view control. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4090–4100 (2019)
[8]
Choy CB, Xu D, Gwak JY, Chen K, and Savarese S Leibe B, Matas J, Sebe N, and Welling M 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction Computer Vision – ECCV 2016 2016 Cham Springer 628-644
[9]
Chung, F.R., Graham, F.C.: Spectral Graph Theory. No. 92. American Mathematical Soc. (1997)
[10]
Dosovitskiy A, Springenberg JT, Tatarchenko M, and Brox T Learning to generate chairs, tables and cars with convolutional networks IEEE Trans. Pattern Anal. Mach. Intell. 2016 39 4 692-705
[11]
Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538–1546 (2015)
[12]
Du Q, Gunzburger M, Lehoucq RB, and Zhou K Analysis and approximation of nonlocal diffusion problems with volume constraints SIAM Rev. 2012 54 4 667-696
[13]
Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551 (2018)
[14]
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5524 (2016)
[15]
Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference (2002)
[16]
Garg R, B.G. VK, Carneiro G, and Reid I Leibe B, Matas J, Sebe N, and Welling M Unsupervised CNN for single view depth estimation: geometry to the rescue Computer Vision – ECCV 2016 2016 Cham Springer 740-756
[17]
Gilboa G and Osher S Nonlocal linear image regularization and supervised segmentation Multisc. Model. Simul. 2007 6 2 595-630
[18]
Goodfellow, I.: Nips 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
[19]
Goodfellow I, Bengio Y, and Courville A Deep Learning 2016 Cambridge MIT Press
[20]
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)
[21]
Han, Y., et al.: Wasserstein loss-based deep object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 998–999 (2020)
[22]
He, G., Liu, X., Fan, F., You, J.: Classification-aware semi-supervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 964–965 (2020)
[23]
He, G., Liu, X., Fan, F., You, J.: Image2Audio: facilitating semi-supervised audio emotion recognition with facial expression image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 912–913 (2020)
[24]
Henderson, P., Ferrari, V.: Learning single-image 3D reconstruction by generative modelling of shape, pose and shading. Int. J. Comput. Vis. 1–20 (2019)
[25]
Higgins, I., et al.: beta-vae: Learning basic visual concepts with a constrained variational framework. In: ICLR, vol. 2, no. 5, p. 6 (2017)
[26]
Huang, H., He, R., Sun, Z., Tan, T., et al.: IntroVAE: introspective variational autoencoders for photographic image synthesis. In: Advances in Neural Information Processing Systems, pp. 52–63 (2018)
[27]
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
[28]
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: Advances in Neural Information Processing Systems, pp. 2802–2812 (2018)
[29]
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
[30]
Ji, D., Kwon, J., McFarland, M., Savarese, S.: Deep view morphing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2155–2163 (2017)
[31]
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386 (2018)
[32]
Kholgade N, Simon T, Efros A, and Sheikh Y 3D object manipulation in a single photograph using stock 3D models ACM Trans. Graph. (TOG) 2014 33 4 1-12
[33]
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[34]
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
[35]
Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2144–2151. IEEE (2011)
[36]
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)
[37]
Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3D object reconstruction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
[38]
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems, pp. 700–708 (2017)
[39]
Liu, R., et al.: An intriguing failing of convolutional neural networks and the CoordConv solution. In: Advances in Neural Information Processing Systems, pp. 9605–9616 (2018)
[40]
Liu, X.: Disentanglement for discriminative visual recognition. arXiv preprint arXiv:2006.07810 (2020)
[41]
Liu, X., B.V.K., K., Yang, C., Tang, Q., You, J.: Dependency-aware attention control for unconstrained face recognition with image sets. In: European Conference on Computer Vision (2018)
[42]
Liu, X., Fan, F., Kong, L., Diao, Z., Xie, W., Lu, J., You, J.: Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing (2020)
[43]
Liu X, Ge Y, Yang C, and Jia P Adaptive metric learning with deep neural networks for video-based facial expression recognition J. Electron. Imaging 2018 27 1 013022
[44]
Liu, X., Guo, Z., Jia, J., Kumar, B.: Dependency-aware attention control for imageset-based face recognition. In: IEEE Transactions on Information Forensics and Security (2019)
[45]
Liu, X., Guo, Z., Li, S., Kong, L., Jia, P., You, J., Kumar, B.V.: Permutation-invariant feature restructuring for correlation-aware image set-based recognition. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
[46]
Liu, X., et al.: Importance-aware semantic segmentation in self-driving with discrete wasserstein training. In: AAAI, pp. 11629–11636 (2020)
[47]
Liu, X., Ji, W., You, J., Fakhri, G.E., Woo, J.: Severity-aware semantic segmentation with reinforced wasserstein training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12566–12575 (2020)
[48]
Liu X, Kong L, Diao Z, and Jia P Line-scan system for continuous hand authentication Opt. Eng. 2017 56 3 033106
[49]
Liu, X., Kumar, B.V., Ge, Y., Yang, C., You, J., Jia, P.: Normalized face image generation with perceptron generative adversarial networks. In: 2018 IEEE 4th International Conference on Identity, Security, and Behavior Analysis (ISBA), pp. 1–8 (2018)
[50]
Liu X, Kumar BV, Jia P, and You J Hard negative generation for identity-disentangled facial expression recognition Pattern Recogn. 2019 88 1-12
[51]
Liu, X., Li, S., Kong, L., Xie, W., Jia, P., You, J., Kumar, B.: Feature-level Frankenstein: eliminating variations for discriminative recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 637–646 (2019)
[52]
Liu, X., Vijaya Kumar, B., You, J., Jia, P.: Adaptive deep metric learning for identity-aware facial expression recognition. In: CVPR Workshops, pp. 20–29 (2017)
[53]
Liu, X., et al.: Conservative wasserstein training for pose estimation. In: The IEEE International Conference on Computer Vision (ICCV), October 2019
[54]
Liu, X., et al.: Data augmentation via latent space interpolation for image classification. In: 24th International Conference on Pattern Recognition (ICPR), pp. 728–733 (2018)
[55]
Liu, X., Zou, Y., Song, Y., Yang, C., You, J., K Vijaya Kumar, B.: Ordinal regression with neuron stick-breaking for medical diagnosis. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
[56]
Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling factors of variation in deep representation using adversarial training. In: Advances in Neural Information Processing Systems, pp. 5040–5048 (2016)
[57]
Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images. arXiv preprint arXiv:1904.01326 (2019)
[58]
Nguyen-Phuoc, T.H., Li, C., Balaban, S., Yang, Y.: RenderNet: a deep convolutional network for differentiable rendering from 3D shapes. In: Advances in Neural Information Processing Systems, pp. 7891–7901 (2018)
[59]
Olszewski, K., Tulyakov, S., Woodford, O., Li, H., Luo, L.: Transformable bottleneck networks. arXiv preprint arXiv:1904.06458 (2019)
[60]
Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3D view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3500–3509 (2017)
[61]
Paszke, A., et al.: Automatic differentiation in PyTorch (2017)
[62]
Pontes JK, Kong C, Sridharan S, Lucey S, Eriksson A, and Fookes C Jawahar CV, Li H, Mori G, and Schindler K Image2Mesh: a learning framework for single image 3D reconstruction Computer Vision – ACCV 2018 2019 Cham Springer 365-381
[63]
Rajeswar, S., Mannan, F., Golemo, F., Vazquez, D., Nowrouzezahrai, D., Courville, A.: Pix2Scene: learning implicit 3D representations from images (2018)
[64]
Rematas K, Nguyen CH, Ritschel T, Fritz M, and Tuytelaars T Novel views of objects from a single image IEEE Trans. Pattern Anal. Mach. Intell. 2016 39 8 1576-1590
[65]
Saxe, A.M., et al.: On the information bottleneck theory of deep learning (2018)
[66]
Shin, D., Fowlkes, C.C., Hoiem, D.: Pixels, voxels, and views: A study of shape representations for single view 3D object shape prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3069 (2018)
[67]
Sturm P and Triggs B Buxton B and Cipolla R A factorization based algorithm for multi-image projective structure and motion Computer Vision — ECCV 1996 1996 Heidelberg Springer 709-720
[68]
Sun, S.H., Huh, M., Liao, Y.H., Zhang, N., Lim, J.J.: Multi-view to novel view: synthesizing novel views with self-learned confidence. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 155–171 (2018)
[69]
Szabó, A., Favaro, P.: Unsupervised 3D shape learning from image collections in the wild. arXiv preprint arXiv:1811.10519 (2018)
[70]
Tao, Y., Sun, Q., Du, Q., Liu, W.: Nonlocal neural networks, nonlocal diffusion and nonlocal modeling. arXiv preprint arXiv:1806.00681 (2018)
[71]
Tatarchenko M, Dosovitskiy A, and Brox T Leibe B, Matas J, Sebe N, and Welling M Multi-view 3D models from single images with a convolutional network Computer Vision – ECCV 2016 2016 Cham Springer 322-337
[72]
Tian, Y., Peng, X., Zhao, L., Zhang, S., Metaxas, D.N.: CR-GAN: learning complete representations for multi-view generation. arXiv preprint arXiv:1806.11191 (2018)
[73]
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: CVPR, vol. 3, p. 7 (2017)
[74]
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
[75]
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
[76]
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. Image quality assessment: from error visibility to structural similarity IEEE Trans. Image Process. 2004 13 4 600-612
[77]
Wu, J., Zhang, C., Zhang, X., Zhang, Z., Freeman, W.T., Tenenbaum, J.B.: Learning shape priors for single-view 3D completion and reconstruction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 646–662 (2018)
[78]
Xie J, Girshick R, and Farhadi A Leibe B, Matas J, Sebe N, and Welling M Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks Computer Vision – ECCV 2016 2016 Cham Springer 842-857
[79]
Xu, X., Chen, Y.C., Jia, J.: View independent generative adversarial network for novel view synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7791–7800 (2019)
[80]
Yang, C., Liu, X., Tang, Q., Kuo, C.C.J.: Towards disentangled representations for human retargeting by multi-view learning. arXiv preprint arXiv:1912.06265 (2019)
[81]
Yang, C., Song, Y., Liu, X., Tang, Q., Kuo, C.C.J.: Image inpainting using block-wise procedural training with annealed adversarial counterpart. arXiv preprint arXiv:1803.08943 (2018)
[82]
Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. arXiv preprint arXiv:1905.08233 (2019)
[83]
Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., Wu, J.: Learning to reconstruct shapes from unseen classes. In: Advances in Neural Information Processing Systems, pp. 2257–2268 (2018)
[84]
Zhou, B., Andonian, A., Torralba, A.: Temporal relational reasoning in videos. In ECCV (2018)
[85]
Zhou T, Tulsiani S, Sun W, Malik J, and Efros AA Leibe B, Matas J, Sebe N, and Welling M View synthesis by appearance flow Computer Vision – ECCV 2016 2016 Cham Springer 286-301
[86]
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
[87]
Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155 (2016)

Cited By

View all
  • (2023)Novel View Synthesis from a Single Unposed Image via Unsupervised LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358746719:6(1-23)Online publication date: 31-May-2023
  • (2022)Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous TranslatorMedical Image Computing and Computer Assisted Intervention – MICCAI 202210.1007/978-3-031-16446-0_36(376-386)Online publication date: 18-Sep-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX
Aug 2020
860 pages
ISBN:978-3-030-58544-0
DOI:10.1007/978-3-030-58545-7

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Author Tags

  1. Unsupervised novel view synthesis
  2. Viewer-centered coordinates
  3. Variational viewpoints
  4. Global 3D representation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Novel View Synthesis from a Single Unposed Image via Unsupervised LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358746719:6(1-23)Online publication date: 31-May-2023
  • (2022)Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous TranslatorMedical Image Computing and Computer Assisted Intervention – MICCAI 202210.1007/978-3-031-16446-0_36(376-386)Online publication date: 18-Sep-2022

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media