research-article

Paralinguistic Privacy Protection at the Edge

Authors:

David BoyleAuthors Info & Claims

ACM Transactions on Privacy and Security, Volume 26, Issue 2

Article No.: 19, Pages 1 - 27

https://doi.org/10.1145/3570161

Published: 13 April 2023 Publication History

Abstract

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, and well-being are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data.

In this article we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY’s on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate, and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in “zero-shot” ABX score or minimal performance penalties of approximately 5.95% word error rate (WER) in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.

References

[1]

Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2020. The faults in our ASRs: An overview of attacks against automatic speech recognition and speaker identification systems. arXiv preprint arXiv:2007.06622 (2020).

[2]

David Ifeoluwa Adelani, Ali Davody, Thomas Kleinbauer, and Dietrich Klakow. 2020. Privacy guarantees for de-identifying text transformations. arXiv preprint arXiv:2008.03101 (2020).

[3]

Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, and Parmesh Ramanathan. 2020. Preech: A system for privacy-preserving speech transcription. In 29th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security’20). 2703–2720.

[4]

Ranya Aloufi, Hamed Haddadi, and David Boyle. 2019. Emotion filtering at the edge. In Proceedings of the 1st Workshop on Machine Learning on Edge in Sensor Systems. Association for Computing Machinery.

Digital Library

[5]

Ranya Aloufi, Hamed Haddadi, and David Boyle. 2020. Privacy-preserving Voice Analysis via Disentangled Representations. Association for Computing Machinery, New York, NY, 1–14.

Digital Library

[6]

Amazon.2022. Transcribe. https://aws.amazon.com/transcribe/.

[7]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Vaino Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, and Zhenyao Zhu. 2016. Deep speech 2: End-to-end speech recognition in English and Mandarin. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. JMLR.org, 173–182.

[8]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670 (2019).

[9]

Alexei Baevski, Steffen Schneider, and Michael Auli. 2020. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations.

[10]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020).

[11]

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).

[12]

Franziska Boenisch, Adam Dziedzic, Roei Schuster, Ali Shahin Shamsabadi, Ilia Shumailov, and Nicolas Papernot. 2021. When the Curious Abandon Honesty: Federated Learning Is Not Private. arxiv:2112.02918 [cs.LG]

[13]

Antoine Boutet, Carole Frindel, Sébastien Gambs, Théo Jourdan, and Claude Rosin Ngueveu. 2020. DYSAN: Dynamically sanitizing motion sensor data against sensitive inferences through adversarial networks. arXiv preprint arXiv:2003.10325 (2020).

[14]

Cristian Buciluundefined, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). Association for Computing Machinery, New York, NY, 535–541.

Digital Library

[15]

Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. 2014. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing 5, 4 (2014), 377–390.

[16]

William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 4960–4964.

Digital Library

[17]

Ju chieh Chou, Cheng chieh Yeh, and Hung yi Lee. 2019. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. arxiv:1904.05742 [cs.LG]

[18]

Ju chieh Chou, Cheng chieh Yeh, Hung yi Lee, and Lin shan Lee. 2018. Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations. arxiv:1804.02812 [eess.AS]

[19]

Chung-Cheng Chiu, Tara Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani. 2018. State-of-the-art speech recognition with sequence-to-sequence models. 4774–4778.

Digital Library

[20]

Jan Chorowski, Ron J. Weiss, Samy Bengio, and Aäron van den Oord. 2019. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 12 (2019), 2041–2053.

Digital Library

[21]

Nicholas Cummins, Alice Baird, and Bjoern W. Schuller. 2018. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning. Methods 151 (2018), 41–54.

[22]

Gauri Deshpande and Björn Schuller. 2020. An overview on audio, signal, speech, ‘|&’ language processing for COVID-19. arXiv preprint arXiv:2005.08579 (2020).

[23]

Daniel J. Dubois, Roman Kolcun, Anna Maria Mandalari, Muhammad Talha Paracha, David Choffnes, and Hamed Haddadi. 2020. When speakers are all ears: Characterizing misactivations of IoT smart speakers. In Proceedings of the 20th Privacy Enhancing Technologies Symposium (PETS’20).

[24]

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776–780.

Digital Library

[25]

Yuan Gong and Christian Poellabauer. 2018. Towards learning fine-grained disentangled representations from speech. arXiv preprint arXiv:1808.02939 (2018).

[26]

Google.2022. Speech-to-Tex12t. https://cloud.google.com/speech-to-text.

[27]

Filip Granqvist, Matt Seigel, Rogier van Dalen, Áine Cahill, Stephen Shum, and Matthias Paulik. 2020. Improving on-device speaker verification using federated learning with privacy. arXiv preprint arXiv:2008.02651 (2020).

[28]

Shupeng Gui, Haotao N. Wang, Haichuan Yang, Chen Yu, Zhangyang Wang, and Ji Liu. 2019. Model compression with adversarial robustness: A unified optimization framework. In Advances in Neural Information Processing Systems. 1285–1296.

[29]

Sanaul Haq, Philip J. B. Jackson, and James Edge. 2008. Audio-visual feature selection and reduction for emotion classification. In Proc. Int. Conf. on Auditory-Visual Speech Processing (AVSP’08).

[30]

Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. 2019. Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6381–6385.

[31]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arxiv:1503.02531 [stat.ML]

[32]

Lori L. Holt and Andrew J. Lotto. 2010. Speech perception as categorization. Attention, Perception, & Psychophysics 72, 5 (2010), 1218–1227.

[33]

Wei-Ning Hsu, Yu Zhang, and James Glass. 2017. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems. 1878–1889.

[34]

T. Hu, A. Shrivastava, O. Tuzel, and C. Dhir. 2020. Unsupervised style and content separation by minimizing mutual information for speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 3267–3271.

[35]

Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-Huai Peng, Yu Tsao, and Hsin-Min Wang. 2020. Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion. IEEE Transactions on Emerging Topics in Computational Intelligence (2020).

[36]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2704–2713.

[37]

Mimansa Jaiswal and Emily Mower Provost. 2019. Privacy enhanced multimodal neural representations for emotion recognition. arXiv preprint arXiv:1910.13212 (2019).

[38]

Huafeng Jin and Shuo Wang. 2018. Voice-based determination of physical and emotional characteristics of users.

[39]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2410–2419.

[40]

Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi. 2022. textless-lib: a Library for Textless Spoken Language Processing. arxiv:2202.07359 [cs.CL]

[41]

Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. In International Conference on Machine Learning. 2649–2658.

[42]

W. Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia S. C. Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. 2021. Generative speech coding with predictive variance regularization. arXiv preprint arXiv:2102.09660. (2021).

[43]

Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin, and Yury Gorbachev. 2020. Neural network compression framework for fast model inference. arXiv preprint arXiv:2002.08679 (2020).

[44]

Jacob Leon Kröger, Otto Hans-Martin Lutz, and Philip Raschke. 2019. Privacy implications of voice and speech analysis–information disclosure by inference. In IFIP International Summer School on Privacy and Identity Management. Springer, 242–258.

[45]

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, and Emmanuel Dupoux. 2021. Generative Spoken Language Modeling from Raw Audio. arxiv:2102.01192 [cs.CL]

[46]

B. M. Lal Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent. 2020. Evaluating voice conversion-based privacy protection against informed attackers. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 2802–2806.

[47]

Petri-Johan Last, Herman A. Engelbrecht, and Herman Kamper. 2020. Unsupervised feature learning for speech using correspondence and siamese networks. IEEE Signal Processing Letters (2020).

[48]

S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Epps, and B. W. Schuller. 2020. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective Computing (2020), 1–1.

[49]

Siddique Latif, Rajib Rana, Sara Khalifa, Raja Jurdak, Junaid Qadir, and Björn W. Schuller. 2020. Deep representation learning in speech processing: Challenges, recent advances, and future trends. arXiv preprint arXiv:2001.00378 (2020).

[50]

Shaoshi Ling and Yuzong Liu. 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659 (2020).

[51]

Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2019. Mobile sensor data anonymization. In Proceedings of the International Conference on Internet of Things Design and Implementation. 49–58.

Digital Library

[52]

Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2020. Privacy and utility preserving sensor-data transformations. Pervasive and Mobile Computing (2020), 101132.

[53]

Mohammad Malekzadeh, Richard G. Clegg, and Hamed Haddadi. 2017. Replacement autoencoder: A privacy-preserving algorithm for sensory data analysis. arXiv preprint arXiv:1710.06564 (2017).

[54]

Mozilla.2022. Watson Speech to Text. https://github.com/mozilla/DeepSpeech.

[55]

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. In 18th Annual Conference of the International Speech Communication Association (Interspeech’17),Francisco Lacerda (Ed.). ISCA, 2616–2620.

[56]

Vikramjit Mitra, Sue Booker, Erik Marchi, David Scott, Farrar Ute, Dorothea Peitz, Bridget Cheng, Ermine Teves, Anuj Mehta, and Devang Naik. 2019. Leveraging acoustic cues and paralinguistic embeddings to detect expression from voice. https://arxiv.org/pdf/1907.00112.pdf.

[57]

Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. 2017. Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119 (2017).

[58]

Andreas Nautsch, Abelino Jiménez, Amos Treiber, Jascha Kolberg, Catherine Jasserand, Els Kindt, Héctor Delgado, Massimiliano Todisco, Mohamed Amine Hmani, Aymen Mtibaa, Mohammed Ahmed Abdelraheem, Alberto Abad, Francisco Teixeira, Driss Matrouf, Marta Gomez-Barrero, Dijana Petrovska-Delacrétaz, Gérard Chollet, Nicholas W. D. Evans, and Christoph Busch. 2019. Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 58 (2019), 441–480.

Digital Library

[59]

Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, and Emmanuel Dupoux. 2020. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. arXiv preprint arXiv:2011.11588 (2020).

[60]

Deniz Oktay, Johannes Ballé, Saurabh Singh, and Abhinav Shrivastava. 2020. Scalable model compression by entropy penalized reparameterization. In International Conference on Learning Representations. https://openreview.net/forum?id=HkgxW0EYDS.

[61]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[62]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[63]

Seyed Ali Osia, Ali Shahin Shamsabadi, Sina Sajadmanesh, Ali Taheri, Kleomenis Katevas, Hamid R. Rabiee, Nicholas D. Lane, and Hamed Haddadi. 2020. A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet of Things Journal 7, 5 (2020), 4505–4518.

[64]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). IEEE, 5206–5210.

[65]

Jong-Hyeon Park, Myungwoo Oh, and Hyung-Min Park. 2019. Unsupervised speech domain adaptation based on disentangled representation learning for robust speech recognition. arXiv preprint arXiv:1904.06086 (2019).

[66]

Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, and Shwetak Patel. 2020. FUN! Fast, Universal, Non-Semantic Speech Embeddings. arxiv:2011.04609 [cs.SD]

[67]

Raghuveer Peri, Haoqi Li, Krishna Somandepalli, Arindam Jati, and Shrikanth Narayanan. 2020. An empirical analysis of information encoded in disentangled neural speaker representations. In Proceedings of Odyssey.

[68]

Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2018. Hidebehind: Enjoy voice input with voiceprint unclonability and anonymity. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems. Association for Computing Machinery, 82–94.

Digital Library

[69]

Kaizhi Qian, Yang Zhang, Shiyu Chang, David Cox, and Mark Hasegawa-Johnson. 2020. Unsupervised speech decomposition via triple information bottleneck. arXiv preprint arXiv:2004.11284 (2020).

[70]

Nisarg Raval, Ashwin Machanavajjhala, and Jerry Pan. 2019. Olympus: Sensor privacy through utility aware obfuscation. Proceedings on Privacy Enhancing Technologies 2019, 1 (2019), 5–25.

[71]

Morgane Rivière, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsupervised pretraining transfers well across languages. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 7414–7418.

[72]

Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian MüLler, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language state-of-the-art and the challenge. Computer Speech & Language 27, 1 (2013), 4–39.

Digital Library

[73]

Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Felix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. 2020. Towards learning a universal non-semantic representation of speech. arXiv preprint arXiv:2002.12764 (2020).

[74]

Congzheng Song and Vitaly Shmatikov. 2020. Overlearning reveals sensitive attributes. In International Conference on Learning Representations.

[75]

Brij Mohan Lal Srivastava, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent. 2019. Privacy-preserving adversarial representation learning in ASR: Reality or illusion? arXiv preprint arXiv:1911.04913 (2019).

[76]

Brij Mohan Lal Srivastava, Natalia Tomashenko, Xin Wang, Emmanuel Vincent, Junichi Yamagishi, Mohamed Maouche, Aurélien Bellet, and Marc Tommasi. 2020. Design choices for x-vector based speaker anonymization. arXiv preprint arXiv:2005.08601 (2020).

[77]

G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu. 2020. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 6264–6268.

[78]

Urmish Thakker, Jesse Beu, Dibakar Gope, Chu Zhou, Igor Fedorov, Ganesh Dasika, and Matthew Mattina. 2019. Compressing RNNs for IoT devices by 15-38x using Kronecker products. arXiv preprint arXiv:1906.02876 (2019).

[79]

Natalia Tomashenko, Brij Mohan Lal Srivastava, Xin Wang, Emmanuel Vincent, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Jose Patino, Jean-François Bonastre, Paul-Gauthier Noé, et al. 2020. Introducing the voiceprivacy initiative. arXiv preprint arXiv:2005.01387 (2020).

[80]

Aaron van den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. In Advances in Neural Information Processing Systems. 6306–6315.

[81]

Benjamin van Niekerk, Leanne Nortje, and Herman Kamper. 2020. Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge. arXiv preprint arXiv:2005.09409 (2020).

[82]

Lisa van Staden and Herman Kamper. 2020. A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings. arxiv:2012.07387 [cs.CL]

[83]

Dong Wang, Xiaodong Wang, and Shaohe Lv. 2019. An overview of end-to-end automatic speech recognition. Symmetry 11, 8 (2019), 1018.

[84]

Xiaofei Wang, Yiwen Han, Victor C. M. Leung, Dusit Niyato, Xueqiang Yan, and Xu Chen. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.

[85]

Y. Wang, X. Fan, I. Chen, Y. Liu, T. Chen, and B. Hoffmeister. 2019. End-to-end anchored speech recognition. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19).

[86]

IBM Watson.2022. Watson Speech to Text. https://www.ibm.com/ca-en/cloud/watson-speech-to-text.

[87]

Henry Weld, Xiaoqi Huang, Siqu Long, Josiah Poon, and Soyeon Caren Han. 2021. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys (CSUR) (2021).

[88]

Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5791–5795.

[89]

Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. 2018. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF’18). IEEE, 268–282.

[90]

Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph E. Gonzalez, and Kurt Keutzer. 2020. SqueezeWave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685 (2020).

[91]

Hanbin Zhang, Chen Song, Aosen Wang, Chenhan Xu, Dongmei Li, and Wenyao Xu. 2019. Pdvocal: Towards privacy-preserving Parkinson’s disease detection using non-speech body sounds. In The 25th Annual International Conference on Mobile Computing and Networking. 1–16.

Digital Library

[92]

Y. Zhang, S. Pan, L. He, and Z. Ling. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). 6945–6949.

Cited By

Andriulo FFiore MMongiello MTraversa EZizzo V(2024)Edge Computing and Cloud Computing for Internet of Things: A ReviewInformatics10.3390/informatics1104007111:4(71)Online publication date: 30-Sep-2024
https://doi.org/10.3390/informatics11040071
Benazir ALin FGanesan DLane NShi W(2024)Maximizing the Capabilities of Tiny Speech Foundation Models in a Privacy Preserving MannerProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3697457(1677-1679)Online publication date: 4-Dec-2024
https://dl.acm.org/doi/10.1145/3636534.3697457
Teixeira FAbad ARaj BTrancoso I(2024)Privacy-Oriented Manipulation of Speaker RepresentationsIEEE Access10.1109/ACCESS.2024.340906712(82949-82971)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3409067
Show More Cited By

Index Terms

Paralinguistic Privacy Protection at the Edge

Recommendations

Privacy-preserving Voice Analysis via Disentangled Representations
CCSW'20: Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop

Voice User Interfaces (VUIs) are increasingly popular and built into smartphones, home assistants, and Internet of Things (IoT) devices. Despite offering an always-on convenient user experience, VUIs raise new security and privacy concerns for their ...
EDGY: on-device paralinguistic privacy protection
S3 '21: Proceedings of the 12th ACM Wireless of the Students, by the Students, and for the Students (S3) Workshop

Voice user interfaces and assistants are rapidly entering our lives and becoming singular touchpoints spanning our devices. Raw audio signals collected through these devices contain a host of sensitive paralinguistic information (e.g., emotional ...
Privacy preserving speech analysis using emotion filtering at the edge: poster abstract
SenSys '19: Proceedings of the 17th Conference on Embedded Networked Sensor Systems

Voice controlled devices and services are commonplace in consumer IoT. Cloud-based analysis services extract information from voice input using speech recognition techniques. Services providers can build detailed profiles of users' demographics, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Privacy and Security

ACM Transactions on Privacy and Security Volume 26, Issue 2

May 2023

335 pages

ISSN:2471-2566

EISSN:2471-2574

DOI:10.1145/3572849

Editor:
Ninghui Li
Purdue University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2023

Online AM: 03 November 2022

Accepted: 26 September 2022

Revised: 10 March 2022

Received: 29 May 2021

Published in TOPS Volume 26, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
500
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)19

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Andriulo FFiore MMongiello MTraversa EZizzo V(2024)Edge Computing and Cloud Computing for Internet of Things: A ReviewInformatics10.3390/informatics1104007111:4(71)Online publication date: 30-Sep-2024
https://doi.org/10.3390/informatics11040071
Benazir ALin FGanesan DLane NShi W(2024)Maximizing the Capabilities of Tiny Speech Foundation Models in a Privacy Preserving MannerProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3697457(1677-1679)Online publication date: 4-Dec-2024
https://dl.acm.org/doi/10.1145/3636534.3697457
Teixeira FAbad ARaj BTrancoso I(2024)Privacy-Oriented Manipulation of Speaker RepresentationsIEEE Access10.1109/ACCESS.2024.340906712(82949-82971)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3409067
Bigelli LContoli CFreschi VLattanzi E(2024)Privacy preservation in sensor-based Human Activity Recognition through autoencoders for low-power IoT devicesInternet of Things10.1016/j.iot.2024.10118926(101189)Online publication date: Jul-2024
https://doi.org/10.1016/j.iot.2024.101189
Guan Z(2023)Construction and Management of Doctor-Patient Privacy Protection System under Big Data Computing Environment2023 3rd International Conference on Mobile Networks and Wireless Communications (ICMNWC)10.1109/ICMNWC60182.2023.10435861(1-6)Online publication date: 4-Dec-2023
https://doi.org/10.1109/ICMNWC60182.2023.10435861

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents