Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3536221.3556628acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD

Published: 07 November 2022 Publication History

Abstract

In most word-level lipreading architectures of recent years, temporal feature extraction module tend to employ Multi-scale Temporal Convolution Network (MS-TCN). In our experiments, we have noticed it is hard for MS-TCN to deal with noise information that may contain in image sequences. In order to solve the problems, we propose a lipreading architecture based on temporal shrinkage network and NetVLAD. We first propose Temporal Shrinkage Unit according to Residual Shrinkage Network and then replace temporal convolution unit with it. The improved network which named Multi-scale Temporal Shrinkage Network (MS-TSN) could focus more on relevant information. Following with MS-TSN that deals with noise frames, NetVLAD is proposed to integrate local information into global feature. Compared with Global Average Pooling, NetVLAD could extract key features by clustering. Our experiments on Lipreading in the Wild (LRW) show that the architecture we propose achieves an accuracy of 89.41%, attaining new state-of-the-art in word-level lipreading. In addition, we build a new Mandarin Chinese lipreading dataset named MCLR-100 and verify our proposed architecture on it.

References

[1]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5297–5307. https://arxiv.org/abs/1511.07247
[2]
Fatchul Arifin, Aris Nasuha, and Hardika Dwi Hermawan. 2015. Lip reading based on background subtraction and image projection. In 2015 International Conference on Information Technology Systems and Innovation (ICITSI). IEEE, 1–3. https://www.doi.org/10.1109/ICITSI.2015.7437727
[3]
Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, and Chin-Hui Lee. 2021. Automatic Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries. Proc. Interspeech 2021(2021), 3001–3005. https://www.isca-speech.org/archive/pdfs/interspeech_2021/chen21k_interspeech.pdf
[4]
Greg I Chiou and Jenq-Neng Hwang. 1997. Lipreading from color video. IEEE Transactions on Image Processing 6, 8 (1997), 1192–1195. https://fdocuments.in/document/lipreading-from-color-video.html
[5]
Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian conference on computer vision. Springer, 87–103. https://link.springer.com/chapter/10.1007/978-3-319-54184-6_6
[6]
Dalu Feng, Shuang Yang, Shiguang Shan, and Xilin Chen. 2020. Learn an effective lip reading model without pains. arXiv preprint arXiv:2011.07557(2020). https://arxiv.org/abs/2011.07557
[7]
Ivan Fung and Brian Mak. 2018. End-to-end low-resource lip-reading with maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2511–2515. https://www.cse.ust.hk/faculty/mak/PDF/icassp2018-lipreading.pdf
[8]
Mingfeng Hao, Mutallip Mamut, Nurbiya Yadikar, Alimjan Aysa, and Kurban Ubul. 2021. How to Use Time Information Effectively? Combining with Time Shift Module for Lipreading. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7988–7992. https://www.researchgate.net/publication/352170483_How_to_Use_Time_Information_Effectively_Combining_with_Time_Shift_Module_for_Lipreading
[9]
Minsu Kim, Jeong Hun Yeo, and Yong Man Ro. 2022. Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, Vol. 22. https://www.aaai.org/AAAI22Papers/AAAI-6712.KimM.pdf
[10]
YanMei Li, XingYu Wang, WeiWu Ding, JingHong Tang, and LiHong Li. 2021. An end-to-end lipreading network combined with ResNet50 and Bi-GRU. In International Conference on Electronic Information Engineering and Computer Technology (EIECT 2021), Vol. 12087. SPIE, 27–32. https://doi.org/10.1117/12.2624702
[11]
Yuanyao Lu, Shenyao Yang, Zheng Xu, and Jingzhong Wang. 2020. Speech training system for hearing impaired individuals based on automatic lip-reading recognition. In International Conference on Applied Human Factors and Ergonomics. Springer, 250–258. https://link.springer.com/chapter/10.1007/978-3-030-51369-6_34
[12]
Mingshuang Luo, Shuang Yang, Shiguang Shan, and Xilin Chen. 2020. Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 273–280. https://arxiv.org/abs/2003.03983
[13]
Pingchuan Ma, Rodrigo Mira, Stavros Petridis, Björn W Schuller, and Maja Pantic. 2021. LiRA: Learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171(2021). https://arxiv.org/pdf/2106.09171.pdf
[14]
Dilip Kumar Margam, Rohith Aralikatti, Tanay Sharma, Abhinav Thanda, Sharad Roy, Shankar M Venkatesan, 2019. LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models. arXiv preprint arXiv:1906.12170(2019). https://arxiv.org/abs/1906.12170
[15]
Brais Martinez, Pingchuan Ma, Stavros Petridis, and Maja Pantic. 2020. Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6319–6323. https://arxiv.org/abs/2001.08702
[16]
Allen A Montgomery. 1981. Effects of consonantal context on vowel lipreading. The Journal of the Acoustical Society of America 69, S1 (1981), S122–S122. https://asa.scitation.org/doi/pdf/10.1121/1.385981
[17]
Vijayaditya Peddinti, Yiming Wang, Daniel Povey, and Sanjeev Khudanpur. 2017. Low latency acoustic modeling using temporal convolution and LSTMs. IEEE Signal Processing Letters 25, 3 (2017), 373–377. https://danielpovey.com/files/2017_spl_tdnnlstm.pdf
[18]
V Sooraj, M Hardhik, Nishanth S Murthy, C Sandesh, and R Shashidhar. 2020. Lip-reading techniques: a review. international journal of scientific & technology research 9, 02(2020). http://www.ijstr.org/final-print/feb2020/Lip-reading-Techniques-A-Review.pdf
[19]
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13. https://dl.acm.org/doi/abs/10.1145/3072959.3073640
[20]
Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation. 275–284. https://diglib.eg.org/xmlui/bitstream/handle/10.2312/SCA.SCA12.275-284/275-284.pdf?sequence=1
[21]
Xinshuo Weng and Kris Kitani. 2019. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. arXiv preprint arXiv:1905.02540(2019). https://arxiv.org/abs/1905.02540
[22]
Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, and Xilin Chen. 2020. Deformation flow based two-stream network for lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 364–370. https://arxiv.org/abs/2003.05709
[23]
Minghang Zhao, Shisheng Zhong, Xuyun Fu, Baoping Tang, and Michael Pecht. 2019. Deep Residual Shrinkage Networks for fault diagnosis. IEEE Transactions on Industrial Informatics 16, 7 (2019), 4681–4690. https://www.researchgate.net/publication/336101168_Deep_Residual_Shrinkage_Networks_for_Fault_Diagnosis
[24]
Xing Zhao, Shuang Yang, Shiguang Shan, and Xilin Chen. 2020. Mutual information maximization for effective lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE, 420–427. https://arxiv.org/abs/2003.06439

Cited By

View all
  • (2023)Multi-Temporal Lip-Audio Memory for Visual Speech RecognitionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10094693(1-5)Online publication date: 4-Jun-2023

Index Terms

  1. Improved Word-level Lipreading with Temporal Shrinkage Network and NetVLAD

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
      November 2022
      830 pages
      ISBN:9781450393904
      DOI:10.1145/3536221
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 07 November 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. NetVLAD
      2. Temporal Shrinkage Network
      3. Word-level lipreading

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICMI '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 13 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Multi-Temporal Lip-Audio Memory for Visual Speech RecognitionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10094693(1-5)Online publication date: 4-Jun-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media