research-article

Open access

ClearSpeech: Improving Voice Quality of Earbuds Using Both In-Ear and Out-Ear Microphones

Authors:

Rajesh BalanAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 7, Issue 4

Article No.: 170, Pages 1 - 25

https://doi.org/10.1145/3631409

Published: 12 January 2024 Publication History

Abstract

Wireless earbuds have been gaining increasing popularity and using them to make phone calls or issue voice commands requires the earbud microphones to pick up human speech. When the speaker is in a noisy environment, speech quality degrades significantly and requires speech enhancement (SE). In this paper, we present ClearSpeech, a novel deep-learning-based SE system designed for wireless earbuds. Specifically, by jointly using the earbud's in-ear and out-ear microphones, we devised a suite of techniques to effectively fuse the two signals and enhance the magnitude and phase of the speech spectrogram. We built an earbud prototype to evaluate ClearSpeech under various settings with data collected from 20 subjects. Our results suggest that ClearSpeech can improve the SE performance significantly compared to conventional approaches using the out-ear microphone only. We also show that ClearSpeech can process user speech in real-time on smartphones.

Supplementary Material

ma (ma.zip)

Supplemental movie, appendix, image and software files for, ClearSpeech: Improving Voice Quality of Earbuds Using Both In-Ear and Out-Ear Microphones

Download
132.31 MB

References

[1]

Online. Bela Mini Board. https://learn.bela.io/products/bela-boards/bela-mini/. (Accessed on Dec 4, 2022).

[2]

Online. Microphone. https://www.cuidevices.com/product/resource/cmc-4015-40l100.pdf. (Accessed on Dec 4, 2022).

[3]

Online. Pytorch Quantization. https://pytorch.org/docs/stable/quantization.html. (Accessed on Dec 4, 2022).

[4]

Takayuki Arakawa, Takafumi Koshinaka, Shohei Yano, Hideki Irisawa, Ryoji Miyahara, and Hitoshi Imaoka. 2016. Fast and accurate personal authentication using ear acoustics. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 1--4.

[5]

Steven Boll. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing 27, 2 (1979), 113--120.

[6]

Kayla-Jade Butkow, Ting Dang, Andrea Ferlini, Dong Ma, and Cecilia Mascolo. 2021. Motion-resilient heart rate monitoring with in-ear microphones. arXiv preprint arXiv:2108.09393 (2021).

[7]

Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384--396.

Digital Library

[8]

Jun Chen, Wei Rao, Zilin Wang, Jiuxin Lin, Zhiyong Wu, Yannan Wang, Shidong Shang, and Helen Meng. 2023. Inter-Subnet: Speech Enhancement with Subband Interaction. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

[9]

Jun Chen, Zilin Wang, Deyi Tuo, Zhiyong Wu, Shiyin Kang, and Helen Meng. 2022. FullSubNet+: Channel attention fullsubnet with complex spectrograms for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7857--7861.

[10]

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--11.

Digital Library

[11]

Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, and Xuedong Huang. 2022. Personalized speech enhancement: New models and comprehensive evaluation. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 356--360.

[12]

Andrea Ferlini, Dong Ma, Robert Harle, and Cecilia Mascolo. 2021. EarGate: gait-based user identification with in-ear microphones. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 337--349.

Digital Library

[13]

Yang Gao, Wei Wang, Vir V Phoha, Wei Sun, and Zhanpeng Jin. 2019. EarEcho: Using ear canal echo for wearable authentication. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1--24.

Digital Library

[14]

John S Garofolo. 1993. Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993).

[15]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776--780.

Digital Library

[16]

Lloyd Griffiths and CW Jim. 1982. An alternative approach to linearly constrained adaptive beamforming. IEEE Transactions on antennas and propagation 30, 1 (1982), 27--34.

[17]

Mattias P Heinrich, Maik Stille, and Thorsten M Buzug. 2018. Residual U-net convolutional neural network architecture for low-dose CT denoising. Current Directions in Biomedical Engineering 4, 1 (2018), 297--300.

[18]

Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: "Hearing" your silent speech commands in ear. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--28.

Digital Library

[19]

Pavan Karjol, M Ajay Kumar, and Prasanta Kumar Ghosh. 2018. Speech enhancement using multiple deep neural networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5049--5052.

Digital Library

[20]

Xugang Lu, Yu Tsao, Shigeki Matsuda, and Chiori Hori. 2013. Speech enhancement based on deep denoising autoencoder. In Interspeech, Vol. 2013. 436--440.

[21]

Yi Luo, Zhuo Chen, and Takuya Yoshioka. 2020. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 46--50.

[22]

Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time--frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27, 8 (2019), 1256--1266.

[23]

Dong Ma, Andrea Ferlini, and Cecilia Mascolo. 2021. OESense: employing occlusion effect for in-ear human sensing. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. 175--187.

Digital Library

[24]

Alexis Martin and Jérémie Voix. 2017. In-ear audio wearable: Measurement of heart and breathing rates for health and safety monitoring. IEEE Transactions on Biomedical Engineering 65, 6 (2017), 1256--1263.

[25]

Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-Speech: noise-robust speech capturing glasses using vibration sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1--23.

Digital Library

[26]

Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1368--1396.

Digital Library

[27]

Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon. 2013. Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21, 10 (2013), 2140--2151.

Digital Library

[28]

H Gustav Mueller, Kathryn E Bright, and Jerry L Northern. 1996. Studies of the hearing aid occlusion effect. In Seminars in Hearing, Vol. 17. Copyright© 1996 by Thieme Medical Publishers, Inc., 21--31.

[29]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Icml.

[30]

Nikolai Novitski, Minna Huotilainen, Mari Tervaniemi, Risto Näätänen, and Vineta Fellman. 2007. Neonatal frequency discrimination in 250--4000-Hz range: Electrophysiological evidence. Clinical Neurophysiology 118, 2 (2007), 412--419.

[31]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[32]

Pascal Scalart et al. 1996. Speech enhancement based on a priori signal to noise estimation. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 2. IEEE, 629--632.

[33]

Roman Schlieper, Song Li, Stephan Preihs, and Jürgen Peissig. 2019. The relationship between the acoustic impedance of headphones and the occlusion effect. In Audio Engineering Society Conference: 2019 AES International Conference on Headphone Technology. Audio Engineering Society.

[34]

Stefania Sesia, Issam Toufik, and Matthew Baker. 2011. LTE - The UMTS Long Term Evolution: From Theory to Practice. John Wiley & Sons Ltd.

[35]

Irtaza Shahid, Yang Bai, Nakul Garg, and Nirupam Roy. 2022. VoiceFind: Noise-resilient speech recovery in commodity headphones. In Proceedings of the 1st ACM International Workshop on Intelligent Acoustic Systems and Applications. 13--18.

Digital Library

[36]

Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 160--173.

Digital Library

[37]

Ming Tu and Xianxian Zhang. 2017. Speech enhancement based on deep neural networks with skip connections. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5565--5569.

Digital Library

[38]

Barry D Van Veen and Kevin M Buckley. 1988. Beamforming: A versatile approach to spatial filtering. IEEE assp magazine 5, 2 (1988), 4--24.

[39]

Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV). Ieee, 1451--1460.

[40]

Donald S Williamson, Yuxuan Wang, and DeLiang Wang. 2015. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483--492.

[41]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2013. An experimental study on speech enhancement based on deep neural networks. IEEE Signal processing letters 21, 1 (2013), 65--68.

[42]

Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wenjun Zeng. 2020. PHASEN: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9458--9465.

[43]

Asri Rizki Yuliani, M Faizal Amri, Endang Suryawati, Ade Ramdan, and Hilman Ferdinandus Pardede. 2021. Speech enhancement using deep learning methods: a review. Jurnal Elektronika dan Telekomunikasi 21, 1 (2021), 19--26.

[44]

Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.

Digital Library

Cited By

Pettys-Baker RClarke MHolschuh BKostakos VKay JHoang T(2024)Functional Now, Wearable Later: Examining the Design Practices of Wearable TechnologistsProceedings of the 2024 ACM International Symposium on Wearable Computers10.1145/3675095.3676615(71-81)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3675095.3676615
Chen TYang YQiu CFan XGuo XShangguan LOkoshi TKo JLiKamWa R(2024)Enabling Hands-Free Voice Assistant Activation on EarphonesProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661890(155-168)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661890
Tan XLv XJiang JZhang L(2024)Understanding Real-Time Collaborative Programming: A Study of Visual Studio Live ShareACM Transactions on Software Engineering and Methodology10.1145/364367233:4(1-28)Online publication date: 27-Jan-2024
https://dl.acm.org/doi/10.1145/3643672

Index Terms

ClearSpeech: Improving Voice Quality of Earbuds Using Both In-Ear and Out-Ear Microphones
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Speech enhancement for robust automatic speech recognition

Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Model-Based Dereverberation Preserving Binaural Cues

The ability of the human auditory system for sound localization mainly depends on the binaural cues, especially interaural time and level differences (ITD and ILD). In the context of digital hearing aids and binaural audio transmission systems, these ...
Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 7, Issue 4

December 2023

1613 pages

EISSN:2474-9567

DOI:10.1145/3640795

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 January 2024

Published in IMWUT Volume 7, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 Grant

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
689
Total Downloads

Downloads (Last 12 months)689
Downloads (Last 6 weeks)114

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pettys-Baker RClarke MHolschuh BKostakos VKay JHoang T(2024)Functional Now, Wearable Later: Examining the Design Practices of Wearable TechnologistsProceedings of the 2024 ACM International Symposium on Wearable Computers10.1145/3675095.3676615(71-81)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3675095.3676615
Chen TYang YQiu CFan XGuo XShangguan LOkoshi TKo JLiKamWa R(2024)Enabling Hands-Free Voice Assistant Activation on EarphonesProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661890(155-168)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3643832.3661890
Tan XLv XJiang JZhang L(2024)Understanding Real-Time Collaborative Programming: A Study of Visual Studio Live ShareACM Transactions on Software Engineering and Methodology10.1145/364367233:4(1-28)Online publication date: 27-Jan-2024
https://dl.acm.org/doi/10.1145/3643672

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents