Abstract
Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods, as well as the remaining challenges of each subfield, are further discussed. Finally, we summarize the commonly used datasets and challenges.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski, M. Ekelid. Speech recognition with primarily temporal cues. Science, vol. 270, no. 5234, pp. 303–304, 1995. DOI: https://doi.org/10.1126/science.270.5234.303.
G. Krishna, C. Tran, J. G. Yu, A. H. Tewfik. Speech recognition with no speech or with noisy speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 1090–1094, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683453.
R. He, W. S. Zheng, B. G. Hu. Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1561–1576, 2011. DOI: https://doi.org/10.1109/TPAMI.2010.220.
C. Y. Fu, X. Wu, Y. B. Hu, H. B. Huang, R. He. Dual variational generation for low shot heterogeneous face recognition. In Proceedings of Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 2670–2679, 2019.
S. G. Tong, Y. Y. Huang, Z. M. Tong. A robust face recognition method combining lbp with multi-mirror symmetry for images with various face interferences. International Journal of Automation and Computing, vol. 16, no. 5, pp. 671–682, 2019. DOI: https://doi.org/10.1007/s11633-018-1153-8.
A. X. Li, K. X. Zhang, L. W. Wang. Zero-shot fine-grained classification by deep feature learning with semantics. International Journal of Automation and Computing, vol. 16, no. 5, pp. 563–574, 2019. DOI: https://doi.org/10.1007/s11633-019-1177-8.
Y. F. Ding, Z. Y. Ma, S. G. Wen, J. Y. Xie, D. L. Chang, Z. W. Si, M. Wu, H. B. Ling. AP-CNN: Weakly supervised attention pyramid convolutional neural network or fine-grained visual classification. IEEE Transactions on Image Processing, vol. 30, pp. 2826–2836, 2021. DOI: https://doi.org/10.1109/TIP.2021.3055617.
D. L. Chang, Y. F. Ding, J. Y. Xie, A. K. Bhunia, X. X. Li, Z. Y. Ma, M. Wu, J. Guo, Y. Z. Song. The devil is in the channels: Mutual-channel loss or fine-grained image classification. IEEE Transactions on Image Processing, vol. 29, pp. 4683–4695, 2020. DOI: https://doi.org/10.1109/TIP.2020.2973812.
A. Gabbay, A. Ephrat, T. Halperin, S. Peleg. Seeing through noise: Visually driven speaker separation and enhancement. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Calgary, Canada, pp. 3051–3055, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462527.
T. Afouras, J. S. Chung, A. Zisserman. The conversation: Deep audio-visual speech enhancement. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 3244–3248, 2018. DOI: https://doi.org/10.21437/Interspeech.20181400.
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, M. Rubinstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation ACM Transactions on Graphics, vol. 37, no. 4, Article number 112, 2018. DOI: https://doi.org/10.1145/3197517.3201357.
P. Morgado, N. Vasconcelos, T. Langlois, O. Wang. Self-supervised generation of spatial audio for 360° video. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 360–370, 2018. DOI: https://doi.org/10.5555/3326943.3326977.
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville Improved training of Wasserstein GANs In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp.5769–5779, 2017. DOI: https://doi.org/10.5555/3295222.3295327.
T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 4396–4405, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00453.
Y. Y. Bengio, A. Courville, P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. DOI: https://doi.org/10.1109/TPAMI.2013.50.
R. Arandjelovic, A. Zisserman. Look, listen and learn. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 609–617, 2017. DOI: https://doi.org/10.1109/ICCV.2017.73.
B. Korbar, D. Tran, L. Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada pp. 7774–7785, 2018. DOI: https://doi.org/10.5555/3327757.3327874.
Y. Z. Isik, J. Le Roux, Z. Chen, S. Watanabe, J. R. Hershey. Single-channel multi-speaker separation using deep clustering. In Proceedings of Interspeech 2016, ISCA, San Francisco, USA, pp. 545–549, 2016. DOI: https://doi.org/10.21437/Interspeech.2016-1176.
Y. Luo, Z. Chen, N. Mesgarani. Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018. DOI: https://doi.org/10.1109/TASLP.2018.2795749.
T. Darrell, J. W. Fisher III, P. Viola. Audio-visual segmentation and “the cocktail party effect”. In Proceedings of the 3rd International Conference on Multimodal Interfaces, Springer, Beijing, China, pp. 32–40, 2000. DOI: https://doi.org/10.1007/3-540-40063-X_5.
J. W. Fisher III, T. Darrell, W. T. Freeman, P. Viola. Learning joint statistical models for audio-visual fusion and segregation. In Proceedings of the 13th International Conference on Neural Information Processing Systems, Denver, USA, pp. 742–748, 2000. DOI: https://doi.org/10.5555/3008751.3008859.
B. C. Li, K. Dinesh, Z. Y. Duan, G. Sharma. See and listen: Score-informed association of sound tracks to players in chamber music performance videos. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, New Orleans, USA, pp. 2906–2910, 2017. DOI: https://doi.org/10.1109/ICASSP.2017.7952688.
J. Pu, Y. Panagakis, S. Petridis, M. Pantic. Audio-visual object localization and separation using low-rank and sparsity. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, New Orleans, USA, pp. 2901–2905, 2017. DOI: https://doi.org/10.1109/ICASSP.2017.7952687.
S. Hochreiter, J. Schmidhuber. Long short-term memory. Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735.
R. Lu, Z. Y. Duan, C. S. Zhang. Listen and look: Audio-visual matching assisted speech source separation. IEEE Signal Processing Letters, vol. 25, no. 9, pp. 1315–1319, 2018. DOI: https://doi.org/10.1109/LSP.2018.2853566.
G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, L. Badino. Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 6900–6904, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8682061.
J. Hershey, J. Movellan. Audio-vision: Using audio-visual synchrony to locate sounds. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, USA, pp. 813–819, 1999. DOI: https://doi.org/10.5555/3009657.3009772.
H. L. van Trees. Optimum Array Processing: Part IV of Detection, Estimation and Modulation Theory, New York, USA: Wiley-Interscience, 2002.
A. Zunino, M. Crocco, S. Martelli, A. Trucco, A. Del Bue, V. Murino. Seeing the sound: A new multimodal imaging device for computer vision. In Proceedings of IEEE International Conference on Computer Vision Workshop, IEEE, Santiago, Chile, pp.693–701, 2015. DOI: https://doi.org/10.1109/ICCVW.2015.95.
R. H. Gao, R. Feris, K. Grauman. Learning to separate object sounds by watching unlabeled video. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 36–54, 2018. DOI: https://doi.org/10.1007/978-3-030-01219-9_3.
R. Z. Gu, S. X. Zhang, Y. Xu, L. W. Chen, Y. X. Zou, D. Yu. Multi-modal multi-channel target speech separation. IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 530–541, 2020. DOI: https://doi.org/10.1109/JSTSP.2020.2980956.
L. Y. Zhu, E. Rahtu. Separating sounds from a single image. [Online], Available: https://arxiv.org/abs/2007.07984, 2020.
H. Izadinia, I. Saleemi, M. Shah. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 378–390, 2013. DOI: https://doi.org/10.1109/TMM.2012.2228476.
A. Senocak, T. H. Oh, J. Kim, M. H. Yang, I. S. Kweon. Learning to localize sound source in visual scenes. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 4358–4366, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00458.
E. Hoffer, N. Ailon. Deep metric learning using triplet network. In Proceedings of the 3rd International Workshop on Similarity-Based Pattern Recognition, Springer, Copenhagen, Denmark, pp. 84–92, 2015. DOI: https://doi.org/10.1007/978-3-319-24261-3_7.
Y. Wu, L. C. Zhu, Y. Yan, Y. Yang. Dual attention matching for audio-visual event localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6291–6299, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00639.
Y. P. Tian, J. Shi, B. C. Li, Z. Y. Duan, C. L. Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 252–268, 2018. DOI: https://doi.org/10.1007/978-3-030-01216-8_16.
R. Sharma, K. Somandepalli, S. Narayanan. Crossmodal learning for audio-visual speech event localization. [Online], Available: https://arxiv.org/abs/2003.04358, 2020.
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, A. Torralba. The sound of pixels. In Proceedings of 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 587–604, 2018. DOI: https://doi.org/10.1007/978-3-030-01246-5_35.
H. Zhao, C. Gan, W. C. Ma, A. Torralba. The sound of motions. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 1735–1744, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00182.
A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, A. Torralba. Self-supervised audio-visual co-segmentation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 2357–2361, 2019. DOI: https://doi.org/10.1109/ICAS-SP.2019.8682467.
S. Parekh, A. Ozerov, S. Essid, N. Q. K. Duong, P. Pérez, G. Richard. Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision. In Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, IEEE, New Paltz, USA, pp.268–272, 2019. DOI: https://doi.org/10.1109/WASPAA.2019.8937237.
X. C. Sun, H. Jia, Z. Zhang, Y. Z. Yang, Z. Y. Sun, J. Yang. Sound localization and separation in three-dimensional space using a single microphone with a metamaterial enclosure, [Online], Available: https://arxiv.org/abs/1908.08160, 2019.
K. Sriskandaraja, V. Sethu, E. Ambikairajah. Deep siamese architecture based replay detection for secure voice biometric. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, pp. 671–675, 2018. DOI: https://doi.org/10.21437/Interspeech.2018-1819.
R. Białobrzeski, M. Kośmider, M. Matuszewski, M. Plata, A. Rakowski. Robust Bayesian and light neural networks for voice spoofing detection. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 1028–1032, 2019. DOI: https://doi.org/10.21437/Interspeech.2019-2676.
A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, A. M. Gomez. A light convolutional GRU-RNN deep feature extractor for ASV spoofing detection. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 1068–1072, 2019. DOI: https://doi.org/10.21437/Interspeech.2019-2212.
X. Wu, R. He, Z. N. Sun, T. N. Tan. A light CNN for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018. DOI: https://doi.org/10.1109/TIFS.2018.2833032.
J. Chung, C. Gulcehre, K. Cho, Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. Online], Available: https://arxiv.org/abs/1412.3555, 2014.
A. Nagrani, S. Albanie, A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 8427–8436, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00879c.
A. Torfi, S. M. Iranmanesh, N. M. Nasrabadi, J. Dawson. 3D convolutional neural networks for audio-visual recognition. [Online], Available: https:/arxiv.org/abs/1706.05739, 2017.
K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 568–576, 2014. DOI: https://doi.org/10.5555/2968826.2968890.
Y. D. Wen, M. Al Ismail, W. Y. Liu, B. Raj, R. Singh. Disjoint mapping network for cross-modal matching of voices and faces. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
S. Ioffe, C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, pp. 448–456, 2015.
C. Lippert, R. Sabatini, M. C. Maher, E. Y. Kang, S. Lee, O. Arikan, A. Harley, A. Bernal, P. Garst, V. Lavrenko, K. Yocum, T. Wong, M. F. Zhu, W. Y. Yang, C. Chang, T. Lu, C. W. H. Lee, B. Hicks, S. Ramakrishnan, H. B. Tang, C. Xie, J. Piper, S. Brewerton, Y. Turpaz, A. Telenti, R. K. Roby, F. J. Och, J. C. Venter. Identification of individuals by trait prediction using whole-genome sequencing data. In Proceedings of the National Academy of Sciences of the United States of America, vol. 114, no. 38, pp. 10166–10171, 2017. DOI: https://doi.org/10.1073/pnas.1711125114.
K. Hoover, S. Chaudhuri, C. Pantofaru, M. Slaney, I. Sturdy. Putting a face to the voice: Fusing audio and visual signals across a video to determine speakers. [Online], Available: https://arxiv.org/abs/1706.00079, 2017.
S. W. Chung, J. S. Chung, H. G. Kang. Perfect match: Improved cross-modal embeddings for audio-visual synchronisation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 3965–3969, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8682524.
R. Wang, H. B. Huang, X. F. Zhang, J. X. Ma, A. H. Zheng. A novel distance learning for elastic cross-modal audio-visual matching. In Proceedings of IEEE International Conference on Multimedia & Expo Workshops, IEEE, Shanghai, China, pp. 300–305, 2019. DOI: https://doi.org/10.1109/ICMEW.2019.00-70.
A. H. Zheng, M. L. Hu, B. Jiang, Y. Huang, Y. Yan, B. Luo. Adversarial-metric learning for audio-visual cross-modal matching. IEEE Transactions on Multimedia, 2021. DOI: https://doi.org/10.1109/TMM.2021.3050089.
R. K. Srihari. Combining text and image information in content-based retrieval. In Proceedings of International Conference on Image Processing, IEEE, Washington, USA, pp. 326–329, 1995. DOI: https://doi.org/10.1109/ICIP.1995.529712.
L. R. Long, L. E. Berman, G. R. Thoma. Prototype client/server application for biomedical text/image retrieval on the Internet. In Proceedings of Storage and Retrieval for Still Image and Video Databases IV, SPIE, San Jose, USA, vol. 2670, pp. 362–372, 1996. DOI: https://doi.org/10.1117/12.234775.
N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. R. G. Lanckriet, R. Levy, N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, ACM, Firenze, Italy, pp. 251–260, 2010. DOI: https://doi.org/10.1145/1873951.1873987.
Y. Aytar, C. Vondrick, A. Torralba. See, hear, and read: Deep aligned representations. [Online], Available: https://arxiv.org/abs/1706.00932, 2017.
D. Surís, A. Duarte, A. Salvador, J. Torres, X. Giró-i-Nieto. Cross-modal embeddings for video and audio retrieval. In Proceedings of European Conference on Computer Vision Workshop, Springer, Munich, Germany, pp. 711–716, 2019. DOI: https://doi.org/10.1007/978-3-030-11018-5_62.
S. Hong, W. Im, H. S. Yang. Content-based video-music retrieval using soft intra-modal structure constraint. [Online], Available: https://arxiv.org/abs/1704.06761, 2017.
A. Nagrani, S. Albanie, A. Zisserman. Learnable PINs: Cross-modal embeddings for person identity. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 73–89, 2018. DOI: https://doi.org/10.1007/978-3-030-01261-8_5.
D. H. Zeng, Y. Yu, K. Oyama. Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 16, no. 3, Article number 76, 2020. DOI: https://doi.org/10.1145/3387164.
V. Sanguineti, P. Morerio, N. Pozzetti, D. Greco, M. Cristani, V. Murino. Leveraging acoustic images for effective self-supervised audio representation learning. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 119–135, 2020. DOI: https://doi.org/10.1007/978-3-030-58542-6_8.
Y. X. Chen, X. Q. Lu, S. Wang. Deep cross-modal image-voice retrieval in remote sensing. IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 10, pp. 7049–7061, 2020. DOI: https://doi.org/10.1109/TGRS.2020.2979273.
N. Takashima, F. Li, M. Grzegorzek, K. Shirahama. Cross-modal music-emotion retrieval using DeepCCA. Information Technology in Biomedicine, E. Pietka, P. Badura, J. Kawa, W. Wieclawek, Eds., Cham, Germany: Springer, pp. 133–145, 2021. DOI: https://doi.org/10.1007/978-3-030-49666-1_11.
I. Kansizoglou, L. Bampis, A. Gasteratos. An active learning paradigm for online audio-visual emotion recognition. IEEE Transactions on Affective Computing, 2019. DOI: https://doi.org/10.1109/TAFFC.2019.2961089.
S. Dupont, J. Luettin. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 141–151, 2000. DOI: https://doi.org/10.1109/6046.865479.
S. Petridis, M. Pantic. Prediction-based audiovisual fusion for classification of non-linguistic vocalisations. IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 45–58, 2016. DOI: https://doi.org/10.1109/TAFFC.2015.2446462.
G. Potamianos, C. Neti, G. Gravier, A. Garg, A. W. Senior. Recent advances in the automatic recognition of audiovisual speech. In Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. DOI: https://doi.org/10.1109/JPROC.2003.817150.
D. Hu, X. L. Li, X. Q. Lu. Temporal multimodal learning in audiovisual speech recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3574–3582, 2016. DOI: https://doi.org/10.1109/CVPR.2016.389.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, USA, pp. 689–696, 2011. DOI: https://doi.org/10.5555/3104482.3104569.
H. Ninomiya, N. Kitaoka, S. Tamura, Y. Iribe, K. Takeda. Integration of deep bottleneck features for audio-visual speech recognition. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp. 563–567, 2015.
S. Petridis, Z. W. Li, M. Pantic. End-to-end visual speech recognition with LSTMS. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, New Orleans, USA, pp. 2592–2596, 2017. DOI: https://doi.org/10.1109/ICASSP.2017.7952625.
M. Wand, J. Koutník, J. Schmidhuber. Lipreading with long short-term memory. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Shanghai, China, pp. 6115–6119, 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7472852.
Y. M. Assael, B. Shillingford, S. Whiteson, N. de Freitas. LipNet: Sentence-level lipreading. [Online], Available: https://arxiv.org/abs/1611.01599v1, 2016.
T. Stafylakis, G. Tzimiropoulos. Combining residual networks with LSTMs for lipreading. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 3652–3656, 2017. DOI: https://doi.org/10.21437/Interspeech.2017-85.
T. Makino, H. Liao, Y. Assael, B. Shillingford, B. Garcia, O. Braga, O. Siohan. Recurrent neural network transducer for audio-visual speech recognition. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, IEEE, Singapore, pp. 905–912, 2019. DOI: https://doi.org/10.1109/ASRU46091.2019.9004036.
M. Cooke, J. Barker, S. Cunningham, X. Shao. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 2421–2424, 2006. DOI: https://doi.org/10.1121/1.2229005.
G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, S. Zafeiriou. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Shanghai, China, pp. 5200–5204, 2016. DOI: https://doi.org/10.1109/ICASSP.2016.7472669.
J. S. Chung, A. Senior, O. Vinyals, A. Zisserman. Lip reading sentences in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3444–3453, 2017. DOI: https://doi.org/10.1109/CVPR.2017.367.
M. Nussbaum-Thom, J. Cui, B. Ramabhadran, V. Goel. Acoustic modeling using bidirectional gated recurrent convolutional units. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, San Francisco, USA, pp. 390–394, 2016. DOI: https://doi.org/10.21437/Interspeech.2016-212.
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman. Deep audio-visual speech recognition. [Online], Available: https://arxiv.org/abs/1809.02108, 2018.
Y. Y. Zhang, Z. R. Wang, J. Du. Deep fusion: An attention guided factorized bilinear pooling for audio-video emotion recognition. In Proceedings of International Joint Conference on Neural Networks, IEEE, Budapest, Hungary, pp. 1–9, 2019. DOI: https://doi.org/10.1109/IJCNN.2019.8851942
P. Zhou, W. W. Yang, W. Chen, Y. F. Wang, J. Jia Modality attention for end-to-end audio-visual speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp. 6565–6569, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8683733.
R. J. Tao, R. K. Das, H. Z. Li. Audio-visual speaker recognition with a cross-modal discriminative network. In Proceedings of the 21st Annual Conference of the International Speech Communication Association, Shanghai, China, pp. 2242–2246, 2020. DOI: https://doi.org/10.21437/Interspeech.2020-1814.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 2672–2680, 2014. DOI: https://doi.org/10.5555/2969033.29691250.
M. Arjovsky, S. Chintala, L. Bottou. Wasserstein GAN. [Online], Available: https://arxiv.org/abs/1701.07875, 2017.
L. L. Chen, S. Srivastava, Z. Y. Duan, C. L. Xu. Deep cross-modal audio-visual generation. In Proceedings of the on Thematic Workshops of ACM Multimedia, ACM, Mountain View, USA, pp. 349–357, 2017. DOI: https://doi.org/10.1145/3126686.3126723.
H. Zhu, H. B. Huang, Y. Li, A. H. Zheng, R. He. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, Japan, pp. 2362–2368, 2020. DOI: https://doi.org/10.24963/ijcai.2020/327.
L. H. Wei, S. L. Zhang, W. Gao, Q. Tian. Person transfer GAN to bridge domain gap for person re-identification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 79–88, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00016.
S. W. Huang, C. T. Lin, S. P. Chen, Y. Y. Wu, P. H. Hsu, S. H. Lai. AugGAN: Cross domain adaptation with GAN-based data augmentation, In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 731–744, 2018. DOI: https://doi.org/10.1007/978-3030-01240-3_44.
T. Le Cornu, B. Milner. Reconstructing intelligible audio speech from visual speech features. In Proceedings of the 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, pp. 3355–3359, 2015.
A. Ephrat, S. Peleg. Vid2speech: Speech reconstruction from silent video. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, New Orleans, USA, pp. 5095–5099, 2017. DOI: https://doi.org/10.1109/ICASSP.2017.7953127.
A. Ephrat, T. Halperin, S. Peleg. Improved speech reconstruction from silent video. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp.455–462, 2017. DOI: https://doi.org/10.1109/ICCVW.2017.61.
T. Le Cornu, B. Milner. Generating intelligible audio speech from visual speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 9, pp. 1751–1761, 2017. DOI: https://doi.org/10.1109/TASLP.2017.2716178.
A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, W. T. Freeman. The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics, vol. 33, no. 4, Article number 79, 2014. DOI: https://doi.org/10.1145/2601097.2601119.
A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, W. T. Freeman. Visually indicated sounds. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2405–2413, 2016. DOI: https://doi.org/10.1109/CVPR.2016.264.
Y. P. Zhou, Z. W. Wang, C. Fang, T. Bui, T. L. Berg. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3550–3558, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00374.
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C. Courville, Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 2017.
H. Zhou, X. D. Xu, D. H. Lin, X. G. Wang, Z. W. Liu. Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 52–69, 2020. DOI: https://doi.org/10.1007/978-3-030-58610-2_4.
C. H. Wan, S. P. Chuang, H. Y. Lee. Towards audio to scene image synthesis using generative adversarial network. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, Brighton, UK, pp.496–500, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8682383.
Y. Qiu, H. Kataoka. Image generation associated with music data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Salt Lake City, USA, pp. 2510–2513, 2018.
W. L. Hao, Z. X. Zhang, H. Guan. CMCGAN: A uniform framework for cross-modal visual-audio mutual generation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, pp. 6886–6893, 2018.
J. G. Li, X. F. Zhang, C. M. Jia, J. Z. Xu, L. Zhang, Y. Wang, S. W. Ma, W. Gao. Direct speech-to-image translation. IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 517–529, 2020. DOI: https://doi.org/10.1109/JSTSP.2020.2987417.
X. S. Wang, T. T. Qiao, J. H. Zhu, A. Hanjalic, O. Scharenborg. Generating images from spoken descriptions. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 850–865, 2021. DOI: https://doi.org/10.1109/TASLP.2021.3053391.
A. Duarte, F. Roldan, M. Tubau, J. Escur, S. Pascual, A. Salvador, E. Mohedano, K. McGuinness, J. Torres, X. Giro-i-Nieto. Wav2Pix: Speech-conditioned face generation using generative adversarial networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE Brighton, UK, pp. 8633–8637, 2019. DOI: https://doi.org/10.1109/ICASSP.2019.8682970.
T. H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, W. Matusik. Speech2Face: Learning the face behind a voice. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 7531–7540, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00772.
Y. D. Wen, B. Raj, R. Singh. Face reconstruction from voice using generative adversarial networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems, Vancouver, Canada, pp. 5266–5275, 2019.
A. A. Samadani, E. Kubica, R. Gorbet, D. Kulic. Perception and generation of affective hand movements. International Journal of Social Robotics, vol. 5, no. 1, pp. 35–51, 2013. DOI: https://doi.org/10.1007/s12369-012-0169-4.
J. Tilmanne, T. Dutoit. Expressive gait synthesis using PCA and Gaussian modeling. In Proceedings of the 3rd International Conference on Motion in Games, Springer, Utrecht, The Netherlands, pp. 363–374, 2010. DOI: https://doi.org/10.1007/978-3-642-16958-8_34.
M. Brand, A. Hertzmann. Style machines. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, ACM, New Orleans, USA, pp. 183–192, 2000. DOI: https://doi.org/10.1145/344779.344865.
J. M. Wang, D. J. Fleet, A. Hertzmann. Multifactor Gaussian process models for style-content separation. In Proceedings of the 24th International Conference on Machine Learning, ACM, Corvalis, USA, pp. 975–982, 2007. DOI: https://doi.org/10.1145/1273496.1273619.
G. W. Taylor, G. E. Hinton. Factored conditional restricted Boltzmann machines for modeling motion style. In Proceedings of the 26th Annual International Conference on Machine Learning, ACM, Montreal, Canada, pp. 1025–1032, 2009. DOI: https://doi.org/10.1145/1553374.1553505.
L. Crnkovic-Friis, L. Crnkovic-Friis. Generative choreography using deep learning. In Proceedings of the 7th International Conference on Computational Creativity, Paris, France, pp. 272–277, 2016.
D. Holden, J. Saito, T. Komura. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics, vol. 35, no. 4, Article number 138, 2016. DOI: https://doi.org/10.1145/2897824.2925975.
O. Alemi J. Françosse, P. Pasquier. GrooveNet: Rea-time music-driven dance movement generation using artificial neural networks. In Proceedings of the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshop on Machine Learning for Creativity, ACM, Halifax, Canada, pp. 26, 2017.
J. Lee, S. Kim, K. Lee. Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. [Online], Available: https://arxiv.org/abs/1811.00818, 2018.
E. Shlizerman, L. Dery, H. Schoen, I. Kemelmacher-Shlizerman. Audio to body dynamics. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7574–7583, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00790.
T. R. Tang, J. Jia, H. Y. Mao. Dance with melody: An LSTM-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM International Conference on Multimedia, ACM, Seoul, Republic of Korea, pp. 1598–1606, 2018. DOI: https://doi.org/10.1145/3240508.3240526
N. Yalta, S. Watanabe, K. Nakadai, T. Ogata. Weakly-supervised deep recurrent neural networks for basic dance step generation. In Proceedings of International Joint Conference on Neural Networks, IEEE, Budapest, Hungary, 2019. DOI: https://doi.org/10.1109/IJCNN.2019.8851872.
R. Kumar, J. Sotelo, K. Kumar, A. de Brébisson, Y. Bengio. ObamaNet: Photo-realistic lip-sync from text. [Online], Available: https://arxiv.org/abs/1801.01442, 2017.
A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, vol. 18, no. 5–6, pp. 602–610, 2005. DOI: https://doi.org/10.1016/j.neunet.2005.06.042.
S. Suwajanakorn, S. M. Seitz, I. Kemelmacher-Shlizerman. Synthesizing Obama: Learning lip sync from audio. ACM Transactions on Graphics, vol. 36, no. 4, Article number 95, 2017. DOI: https://doi.org/10.1145/3072959.3073640.
A. Jamaludin, J. S. Chung, A. Zisserman. You said that?: Synthesising talking faces from audio International Journal of Computer Vision, vol. 127, no. 11–12, pp. 1767–1779, 2019. DOI: https://doi.org/10.1007/s11263-019-01150-y.
S. A. Jalalifar, H. Hasani, H. Aghajan. Speech-driven facial reenactment using conditional generative adversarial networks. [Online], Available: https://arxiv.org/abs/1803.07461, 2018.
K. Vougioukas, S. Petridis, M. Pantic. End-to-end speech-driven facial animation with temporal GANs. In Proceedings of British Machine Vision Conference, Newcastle, UK, 2018.
M. Saito, E. Matsumoto, S. Saito. Temporal generative adversarial nets with singular value clipping. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2849–2858, 2017. DOI: https://doi.org/10.1109/ICCV.2017.308.
L. Chen, Z. H. Li, R. K. Maddox, Z. Y. Duan, C. L. Xu. Lip movements generation at a glance. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 538–553, 2018. DOI: https://doi.org/10.1007/978-3-030-01234-2_32.
H. Zhou, Y. Liu, Z. W. Liu, P. Luo, X. G. Wang. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, USA, pp. 9299–9306, 2019. DOI: https://doi.org/10.1609/aaai.v33i01.33019299.
L. L. Chen, R. K. Maddox, Z. Y. Duan, C. L. Xu. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp.7824–7833, 2019. DOI: https://doi.org/10.1109/CVPR.2019.00802.
O. Wiles, A. S. Koepke, A. Zisserman. X2Face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 690–706, 2018. DOI: https://doi.org/10.1007/978-3-030-01261-8_41.
S. E. Eskimez, Y. Zhang, Z. Y. Duan. Speech driven talking face generation from a single image and an emotion condition. [Online], Available: https://arxiv.org/abs/2008.03592, 2020.
S. E. Eskimez, R. K. Maddox, C. L. Xu, Z. Y. Duan. Noise-resilient training method for face landmark generation from speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp.27–38, 2020. DOI: https://doi.org/10.1109/TASLP.2019.2947741.
Y. Aytar, C. Vondrick, A. Torralba. Soundnet: Learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS, Barcelona, Spain, pp. 892–900, 2016. DOI:https://doi.org/10.5555/3157096.3157196.
R. Arandjelovic, A. Zisserman. Objects that sound. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 451–466, 2018. DOI: https://doi.org/10.1007/978-3-030-01246-5_27.
K. Leidal D. Harwath, J. Glass. Learning modaiity-invariant representations for speech and images. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, IEEE, Okinawa, Japan, pp. 424–429, 2017. DOI: https://doi.org/10.1109/ASRU.2017.8268967.
D. Hu, F. P. Nie, X. L. Li. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 9240–9249. DOI: https://doi.org/10.1109/CVPR.2019.00947.
A. Owens, A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 639–658, 2018. DOI: https://doi.org/10.1007/978-3-030-01231-1_39.
Y. Bengio, J. Louradour, R. Collobert, J. Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ACM, Montreal, Canada, pp. 41–48, 2009. DOI: https://doi.org/10.1145/1553374.1553380.
S. Parekh, S. Essid, A. Ozerov, N. Q. K. Duong, P. Pérez, G. Richard. Weakly supervised representation learning for unsynchronized audio-visual events. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Salt Lake City, USA, pp. 2518–2519, 2018.
N. Harte, E. Gillen. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, vol. 17, no. 5, pp. 603–615, 2015. DOI: https://doi.org/10.1109/TMM.2015.2407694.
C. Sanderson, B. C. Lovell. Multi-region probabilistic histograms for robust and scalable identity inference. In Proceedings of the 3rd International Conference on Advances in Biometrics, Springer, Alghero, Italy, pp. 199–208, 2009. DOI: https://doi.org/10.1007/978-3-642-01793-3_21.
S. R. Livingstone, F. A. Russo. The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS One, vol. 13, no. 5, Article number e0196391, 2018. DOI: https://doi.org/10.1371/journal.pone.0196391.
N. Alghamdi, S. Maddock, R. Marxer, J. Barker, G. J. Brown. A corpus of audio-visual Lombard speech with frontal and profile views. The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. EL523–EL529, 2018. DOI: https://doi.org/10.1121/1.5042758.
G. Y. Zhao, M. Barnard, M. Pietikainen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, vol. 11, no. 7, pp. 1254–1265, 2009. DOI: https://doi.org/10.1109/TMM.2009.2030637.
I. Anina, Z. H. Zhou, G. Y. Zhao, M. Pietikäinen. OuluVs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In Proceedings of the 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, IEEE, Ljubljana, Slovenia, pp. 1–5, 2015. DOI: https://doi.org/10.1109/FG.2015.7163155.
J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval, J. Han, V. Pandit, A. Toisoul, B. Schuller, K. Star, E. Hajiyev, M. Pantic. SEWA DB: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 1022–1040, 2021. DOI: https://doi.org/10.1109/TPAMI.2019.2944808.
K. S. Y. Wang, Q. Y. Wu, L. S. Song, Z. Q. Yang, W. Wu, C. Qian, R. He, Y. Qiao, C. C. Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 700–717, 2020. DOI: https://doi.org/10.1007/978-3-030-58589-1_42.
J. S. Chung, A. Zisserman. Lip reading in the wild. In Proceedings of the 13th Asian Conference on Computer Vision, Springer, Taipei, China, pp. 87–103, 2017. DOI: https://doi.org/10.1007/978-3-319-54184-6_6.
J. S. Chung, A. Zisserman. Lip reading in profile. In Proceedings of British Machine Vision Conference 2017, BMVA Press, London, UK, 2017.
A. Nagrani, J. S. Chung, A. Zisserman. VoxCeleb: A large-scale speaker identification dataset. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 2616–2620, 2017. DOI: https://doi.org/10.21437/Interspeech.2017-950.
J. S. Chung, A. Nagrani, A. Zisserman. VoxCeleb2: Deep speaker recognition. In Proceedings of the 19th Annual Conference of the International Speech Communication Association, Association, Hyderabad, India, pp. 1086–1090, 2018. DOI: https://doi.org/10.21437/Interspeech.2018-1929.
J. Roth, S. Chaudhuri, O. Klejch, R. Marvin, A. Gallagher, L. Kaver, S. Ramaswamy, A. Stopczynski, C. Schmid, Z. H. Xi, C. Pantofaru. Supplementary material: AVA-ActiveSpeaker: An audio-visual dataset for active speaker detection. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea, pp. 3718–3722, 2019. DOI: https://doi.org/10.1109/ICCVW.2019.00460.
O. Gillet, G. Richard. ENST-drums: An extensive audio-visual database for drum signals processing. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada, pp. 156–159, 2006.
A. Bazzica, J. C. van Gemert, C. C. S. Liem, A. Hanjalic. Vision-based detection of acoustic timed events: A case study on clarinet note onsets. [Online], Available: https://arxiv.org/abs/1706.09556, 2017.
B. C. Li, X. Z. Liu, K. Dinesh, Z. Y. Duan, G. Sharma. Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, 2019. DOI: https://doi.org/10.1109/TMM.2018.2856090.
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman. The kinetics human action video dataset. [Online], Available: https://arxiv.org/abs/1705.06950, 2017.
J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, A. Zisserman. A short note about kinetics-600. [Online], Available: https://arxiv.org/abs/1808.01340, 2018.
J. Carreira, E. Noland, C. Hillier, A. Zisserman. A short note on the kinetics-700 human action dataset. [Online], Available: https://arxiv.org/abs/1907.06987, 2019.
C. H. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Q. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, J. Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6047–6056, 2018. DOI: https://doi.org/10.1109/CVPR.2018.00633.
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, New Orleans, USA, pp. 776–780, 2017. DOI: https://doi.org/10.1109/ICASSP.2017.7952261.
J. Lee, A. Natsev, W. Reade, R. Sukthankar, G. Toderici. The 2nd youtube-8m large-scale video understanding challenge. In Proceedings of European Conference on Computer Vision, Springer, Munich, Germany, pp. 193–205, 2019. DOI: https://doi.org/10.1007/978-3-030-11018-5_18.
C. Sun, A. Shrivastava, S. Singh, A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp.843–852, 2017. DOI: https://doi.org/10.1109/ICCV.2017.97.
O. M. Parkhi, A. Vedaldi, A. Zisserman. Deep face recognition. In Proceedings of British Machine Vision Conference, Swansea, UK, 2015.
Acknowledgments
This work was supported by National Key Research and Development Program of China (No. 2016YFB1001001), Beijing Natural Science Foundation (No. JQ18017), and National Natural Science Foundation of China (No. 61976002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Recommended by Associate Editor Nazim Mir-Nasiri
Colored figures are available in the online version at https://link.springer.com/journal/11633
This article has been accepted and is being edited now. Its final version will be assigned to and published in a specific journal issue later.
Hao Zhu received the B. Eng. degree from Anhui Polytechnic University, China in 2018. He is currently a master student in Research Center of Cognitive Computing, Anhui University, China. He is also a joint master student in Center for Research on Intelligent Perception and Computing (CRIPAC), Institute of Automation, Chinese Academy of Sciences (CAS), China.
His research interests include deepfakes generation, computer vision, and pattern recognition.
Man-Di Luo received the B.Eng. degree in automation engineering from University of Electronic Science and Technology of China, China in 2017, and the B.Sc. and M.Sc. degrees in electronic engineering from Katholieke University Leuven, Belgium in 2017 and 2018. She is currently a Ph. D. degree candidate in computer application technology at University of Chinese Academy of Sciences, China.
Her research interests include biometrics, pattern recognition, and computer vision.
Rui Wang received the B.Sc. degree in computer science and technology from Hefei University, China in 2018. He is currently a master student in Department of Computer Science and Technology, Anhui University, China. He is also an intern at Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences, China.
His research interests include style-transfer, computer vision, and pattern recognition.
Ai-Hua Zheng received B. Eng. degree in computer science and technology from Anhui University of China, China in 2006 and 2008, and received the Ph. D. degree in computer science from University of Greenwich, UK in 2012. She visited University of Stirling, UK and Texas State University, USA during June to September in 2013 and September 2019 to August 2020 respectively. She is currently an associate professor and a Ph. D. supervisor in Anhui University, China.
Her research interests include vision based artificial intelligence and pattern recognition, especially on person/vehicle re-identification, audio visual computing, motion detection and tracking.
Ran He received the B.Eng. and M.Sc. degrees in computer science from Dalian University of Technology, China in 2001 and 2004, and the Ph. D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2009. In September 2010, he joined NLPR, where he is currently a full professor. He is the Fellow of International Association for Pattern Recognition (IAPR). He serves as the editor board member of Pattern Recognition, and serves on the program committee for several conferences. His work won IEEE SPS Young Author Best Paper Award (2020), IAPR ICPR Best Scientific Paper Award (2020), IAPR/IEEE ICB Honorable Mention Paper Award (2019), IEEE ISM Best Paper Candidate (2016) and IAPR ACPR Best Poster Award (2013).
His research interests include information theoretic learning, pattern recognition, and computer vision.
Rights and permissions
Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhu, H., Luo, MD., Wang, R. et al. Deep Audio-visual Learning: A Survey. Int. J. Autom. Comput. 18, 351–376 (2021). https://doi.org/10.1007/s11633-021-1293-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-021-1293-0