research-article

AdaStreamLite: Environment-adaptive Streaming Speech Recognition on Mobile Devices

Authors:

Junzhao DuAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 7, Issue 4

Article No.: 187, Pages 1 - 29

https://doi.org/10.1145/3631460

Published: 12 January 2024 Publication History

Abstract

Streaming speech recognition aims to transcribe speech to text in a streaming manner, providing real-time speech interaction for smartphone users. However, it is not trivial to develop a high-performance streaming speech recognition system purely running on mobile platforms, due to the complex real-world acoustic environments and the limited computational resources of smartphones. Most existing solutions lack the generalization to unseen environments and have difficulty to work with streaming speech. In this paper, we design AdaStreamLite, an environment-adaptive streaming speech recognition tool for smartphones. AdaStreamLite interacts with its surroundings to capture the characteristics of the current acoustic environment to improve the robustness against ambient noise in a lightweight manner. We design an environment representation extractor to model acoustic environments with compact feature vectors, and construct a representation lookup table to improve the generalization of AdaStreamLite to unseen environments. We train our system using large speech datasets publicly available covering different languages. We conduct experiments in a large range of real acoustic environments with different smartphones. The results show that AdaStreamLite outperforms the state-of-the-art methods in terms of recognition accuracy, computational resource consumption and robustness against unseen environments.

References

[1]

Zhongxin Bai and Xiao-Lei Zhang. 2021. Speaker recognition based on deep learning: An overview. Neural Networks 140 (2021), 65--99.

[2]

Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27, 2 (1979), 113--120.

[3]

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline. In Oriental COCOSDA 2017. Submitted.

[4]

Suliang Bu, Yunxin Zhao, Tuo Zhao, Shaojun Wang, and Mei Han. 2022. Modeling Speech Structure to Improve T-F Masks for Speech Enhancement and Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2705--2715. https://doi.org/10.1109/TASLP.2022.3196168

Digital Library

[5]

Joseph P Campbell. 1997. Speaker recognition: A tutorial. Proc. IEEE 85, 9 (1997), 1437--1462.

[6]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4960--4964. https://doi.org/10.1109/ICASSP.2016.7472621

Digital Library

[7]

Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin Kang, and Helen Meng. 2022. FullSubNet+: Channel Attention Fullsubnet with Complex Spectrograms for Speech Enhancement. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7857--7861. https://doi.org/10.1109/ICASSP43922.2022.9747888

[8]

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, and Jinyu Li. 2021. Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5904--5908. https://doi.org/10.1109/ICASSP39728.2021.9413535

[9]

Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4774--4778. https://doi.org/10.1109/ICASSP.2018.8462105

Digital Library

[10]

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-Based Models for Speech Recognition. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/1068c6e4c8051cfd4e9ea8072e3189e2-Paper.pdf

[11]

George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 20, 1 (2012), 30--42. https://doi.org/10.1109/TASL.2011.2134090

Digital Library

[12]

Jiankang Deng, Jia Guo, Jing Yang, Niannan Xue, Irene Kotsia, and Stefanos Zafeiriou. 2022. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2022), 5962--5979. https://doi.org/10.1109/TPAMI.2021.3087709

Digital Library

[13]

Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485--532.

[14]

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proc. Interspeech 2020. 3830--3834. https://doi.org/10.21437/Interspeech.2020-2650

[15]

Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3, Article 111 (sep 2022), 25 pages. https://doi.org/10.1145/3550303

Digital Library

[16]

Y. Ephraim and D. Malah. 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 33, 2 (1985), 443--445. https://doi.org/10.1109/TASSP.1985.1164550

[17]

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Trans. Graph. 37, 4, Article 112 (jul 2018), 11 pages. https://doi.org/10.1145/3197517.3201357

Digital Library

[18]

Petko Georgiev, Sourav Bhattacharya, Nicholas D. Lane, and Cecilia Mascolo. 2017. Low-Resource Multi-Task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1, 3, Article 50 (sep 2017), 19 pages. https://doi.org/10.1145/3131895

Digital Library

[19]

Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711 (2012).

[20]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369--376.

Digital Library

[21]

Alex Graves and Navdeep Jaitly. 2014. Towards End-To-End Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 32), Eric P. Xing and Tony Jebara (Eds.). PMLR, Bejing, China, 1764--1772. https://proceedings.mlr.press/v32/graves14.html

[22]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649. https://doi.org/10.1109/ICASSP.2013.6638947

[23]

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. arXiv preprint arXiv:1410.5401 (2014).

[24]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, et al. 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[25]

Mahdi Hajibabaei and Dengxin Dai. 2018. Unified hypersphere embedding for speaker recognition. arXiv preprint arXiv:1807.08312 (2018).

[26]

Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, and Alexander Gruenstein. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6381--6385. https://doi.org/10.1109/ICASSP.2019.8682336

[27]

Hu Hu, Tian Tan, and Yanmin Qian. 2018. Generative Adversarial Networks Based Data Augmentation for Noise Robust Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5044--5048. https://doi.org/10.1109/ICASSP.2018.8462624

Digital Library

[28]

Insider Intelligence. 2023. Voice Assistants in 2023: Usage, growth, and future of the AI voice assistant market. Retrieved July 08, 2023 from https://www.insiderintelligence.com/insights/voice-assistants/

[29]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. pmlr, 448--456.

[30]

F. Jelinek. 1976. Continuous speech recognition by statistical methods. Proc. IEEE 64, 4 (1976), 532--556. https://doi.org/10.1109/PROC.1976.10159

[31]

Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, et al. 2019. Attention based on-device streaming speech recognition with large speech corpus. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 956--963.

[32]

Lantian Li, Ruiqian Nai, and Dong Wang. 2022. Real Additive Margin Softmax for Speaker Verification. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7527--7531. https://doi.org/10.1109/ICASSP43922.2022.9747166

[33]

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37, 3 (2020), 50--60. https://doi.org/10.1109/MSP.2020.2975749

[34]

J.S. Lim and A.V. Oppenheim. 1979. Enhancement and bandwidth compression of noisy speech. Proc. IEEE 67, 12 (1979), 1586--1604. https://doi.org/10.1109/PROC.1979.11540

[35]

Sicong Liu, Bin Guo, Ke Ma, Zhiwen Yu, and Junzhao Du. 2021. AdaSpring: Context-Adaptive and Runtime-Evolutionary Deep Model Compression for Mobile Applications. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 1, Article 24 (mar 2021), 22 pages. https://doi.org/10.1145/3448125

Digital Library

[36]

Philipos C Loizou. 2013. Speech enhancement: theory and practice. CRC press.

Digital Library

[37]

Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Proc. Interspeech 2013. 436--440. https://doi.org/10.21437/Interspeech.2013-130

[38]

Douglas A Lyon. 2009. The discrete fourier transform, part 4: spectral leakage. Journal of object technology 8, 7 (2009).

[39]

Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, and Yonghong Yan. 2020. Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 1452--1465. https://doi.org/10.1109/TASLP.2020.2987752

Digital Library

[40]

Zhaoxu Nian, Yan-Hui Tu, Jun Du, and Chin-Hui Lee. 2021. A Progressive Learning Approach to Adaptive Noise and Speech Estimation for Speech Enhancement and Noisy Speech Recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6913--6917. https://doi.org/10.1109/ICASSP39728.2021.9413395

[41]

Sergey Novoselov, Vadim Shchemelinin, Andrey Shulipa, Alexandr Kozlov, and Ivan Kremnev. 2018. Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition. In Proc. Interspeech 2018. 2242--2246. https://doi.org/10.21437/Interspeech.2018-1209

[42]

Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive Statistics Pooling for Deep Speaker Embedding. In Proc. Interspeech 2018. 2252--2256. https://doi.org/10.21437/Interspeech.2018-993

[43]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5206--5210. https://doi.org/10.1109/ICASSP.2015.7178964

[44]

Santiago Pascual, Antonio Bonafonte, and Joan Serra. 2017. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).

[45]

L.R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 2 (1989), 257--286. https://doi.org/10.1109/5.18626

[46]

Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).

[47]

D Raj Reddy. 1976. Speech recognition by machine: A review. Proc. IEEE 64, 4 (1976), 501--531.

[48]

Leda Sarı, Niko Moritz, Takaaki Hori, and Jonathan Le Roux. 2020. Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7384--7388. https://doi.org/10.1109/ICASSP40776.2020.9054249

[49]

Hendrik Schroter, Alberto N. Escalante-B, Tobias Rosenkranz, and Andreas Maier. 2022. Deepfilternet: A Low Complexity Speech Enhancement Framework for Full-Band Audio Based On Deep Filtering. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7407--7411. https://doi.org/10.1109/ICASSP43922.2022.9747055

[50]

Eric Hal Schwartz. 2021. EU Publishes Privacy Guidelines for Voice Assistants for Comment. Retrieved April 27, 2023 from https://voicebot.ai/2021/03/23/eu-publishes-privacy-guidelines-for-voice-assistants-for-comment/

[51]

Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 7398--7402. https://doi.org/10.1109/ICASSP.2013.6639100

[52]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5329--5333. https://doi.org/10.1109/ICASSP.2018.8461375

Digital Library

[53]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.

Digital Library

[54]

Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent. 2013. DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments. In Proc. Meetings Acoust. 1--6.

[55]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).

[56]

Ehsan Variani, Ke Wu, Michael D Riley, David Rybach, Matt Shannon, and Cyril Allauzen. 2022. Global Normalization for Streaming Speech Recognition in a Modular Framework. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 4257--4269. https://proceedings.neurips.cc/paper_files/paper/2022/file/1b4839ff1f843b6be059bd0e8437e975-Paper-Conference.pdf

[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[58]

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized End-to-End Loss for Speaker Verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4879--4883. https://doi.org/10.1109/ICASSP.2018.8462665

Digital Library

[59]

Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On Training Targets for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12 (2014), 1849--1858. https://doi.org/10.1109/TASLP.2014.2352935

Digital Library

[60]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189.

[61]

Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi. 2017. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. IEEE Journal of Selected Topics in Signal Processing 11, 8 (2017), 1240--1253. https://doi.org/10.1109/JSTSP.2017.2763455

[62]

Yuheng Wei, Junzhao Du, and Hui Liu. 2020. Angular Margin Centroid Loss for Text-Independent Speaker Recognition. In Proc. Interspeech 2020. 3820--3824. https://doi.org/10.21437/Interspeech.2020-2538

[63]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2015), 7--19. https://doi.org/10.1109/TASLP.2014.2364452

Digital Library

[64]

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. 2022. WENETSPEECH: A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6182--6186. https://doi.org/10.1109/ICASSP43922.2022.9746682

[65]

Binbin Zhang, Di Wu, Zhendong Peng, Xingchen Song, Zhuoyuan Yao, Hang Lv, Lei Xie, Chao Yang, Fuping Pan, and Jianwei Niu. 2022. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455 (2022).

[66]

Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang, Liyong Guo, Yaguang Hu, Lei Xie, and Xin Lei. 2020. Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481 (2020).

[67]

Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, and Shankar Kumar. 2020. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7829--7833. https://doi.org/10.1109/ICASSP40776.2020.9053896

[68]

Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to Hear: Speech Enhancement for Mobile Devices Using Acoustic Signals. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 3, Article 137 (sep 2021), 30 pages. https://doi.org/10.1145/3478093

Digital Library

[69]

Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung yi Lee, and Helen Meng. 2022. MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. In Proc. Interspeech 2022. 306--310. https://doi.org/10.21437/Interspeech.2022-563

[70]

Zixing Zhang, Jürgen Geiger, Jouni Pohjalainen, Amr El-Desoky Mousa, Wenyu Jin, and Björn Schuller. 2018. Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments. ACM Trans. Intell. Syst. Technol. 9, 5, Article 49 (apr 2018), 28 pages. https://doi.org/10.1145/3178115

Digital Library

[71]

Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang. 2019. Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing. Proc. IEEE 107, 8 (2019), 1738--1762. https://doi.org/10.1109/JPROC.2019.2918951

Cited By

Agarwal MUpadhyay DGill KDevliyal S(2024)Emotional Resonance Unleashed by exploring Novel Audio Classification Techniques with Log- Melspectrogram Augmentation2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687403(1-5)Online publication date: 5-Jun-2024
https://doi.org/10.1109/OTCON60325.2024.10687403
Agarwal MGill KThapliyal NRawat R(2024)Harmonizing Emotions: A Novel Approach to Audio Emotion Classification using Log-Melspectrogram with Augmentation2024 International Conference on Communication, Computing and Internet of Things (IC3IoT)10.1109/IC3IoT60841.2024.10550216(1-4)Online publication date: 17-Apr-2024
https://doi.org/10.1109/IC3IoT60841.2024.10550216

Index Terms

AdaStreamLite: Environment-adaptive Streaming Speech Recognition on Mobile Devices
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable ...
The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners
Highlights
- We compared native English and non-native (Dutch) Lombard and plain speech.
- ...
Abstract
Speech produced in noise (Lombard speech) is more intelligible than speech produced in quiet (plain speech). Previous research on the Lombard intelligibility benefit focused almost entirely on how native speakers produce and perceive ...
MFCC-GMM based accent recognition system for Telugu speech signals

Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 7, Issue 4

December 2023

1613 pages

EISSN:2474-9567

DOI:10.1145/3640795

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 January 2024

Published in IMWUT Volume 7, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
263
Total Downloads

Downloads (Last 12 months)263
Downloads (Last 6 weeks)24

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Agarwal MUpadhyay DGill KDevliyal S(2024)Emotional Resonance Unleashed by exploring Novel Audio Classification Techniques with Log- Melspectrogram Augmentation2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.010.1109/OTCON60325.2024.10687403(1-5)Online publication date: 5-Jun-2024
https://doi.org/10.1109/OTCON60325.2024.10687403
Agarwal MGill KThapliyal NRawat R(2024)Harmonizing Emotions: A Novel Approach to Audio Emotion Classification using Log-Melspectrogram with Augmentation2024 International Conference on Communication, Computing and Internet of Things (IC3IoT)10.1109/IC3IoT60841.2024.10550216(1-4)Online publication date: 17-Apr-2024
https://doi.org/10.1109/IC3IoT60841.2024.10550216

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents