research-article

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer

Authors:

Ziyao LiuAuthors Info & Claims

SecTL '24: Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems

Pages 47 - 55

https://doi.org/10.1145/3665451.3665532

Published: 23 July 2024 Publication History

Abstract

In light of the widespread application of Automatic Speech Recognition (ASR) systems, their security concerns have received much more attention than ever before, primarily due to the susceptibility of Deep Neural Networks. Previous studies have illustrated that surreptitiously crafting adversarial perturbations enables the manipulation of speech recognition systems, resulting in the production of malicious commands. These attack methods mostly require adding noise perturbations under ℓp norm constraints, inevitably leaving behind artifacts of manual modifications. Recent research has alleviated this limitation by manipulating style vectors to synthesize adversarial examples based on Text-to-Speech (TTS) synthesis audio. However, style modifications based on optimization objectives significantly reduce the controllability and editability of audio styles. In this paper, we propose an attack on ASR systems based on user-customized style transfer. We first test the effect of Style Transfer Attack (STA) which combines style transfer and adversarial attack in sequential order. And then, as an improvement, we propose an iterative Style Code Attack (SCA) to maintain audio quality. Experimental results show that our method can meet the need for user-customized styles and achieve a success rate of 82% in attacks, while keeping sound naturalness due to our user study.

References

[1]

Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. Adversarial attack on graph structured data. In International conference on machine learning, pages 1115--1124. PMLR, 2018.

[2]

Dong Wang, Xiaodong Wang, and Shaohe Lv. An overview of end-to-end automatic speech recognition. Symmetry, 11(8):1018, 2019.

[3]

Ambuj Mehrish, Navonil Majumder, Rishabh Bharadwaj, Rada Mihalcea, and Soujanya Poria. A review of deep learning techniques for speech processing. Information Fusion, page 101869, 2023.

[4]

Atieh Poushneh. Humanizing voice assistant: The impact of voice assistant personality on consumers' attitudes and behaviors. Journal of Retailing and Consumer Services, 58:102283, 2021.

[5]

Chen Yan, Xiaoyu Ji, Kai Wang, Qinhong Jiang, Zizhi Jin, and Wenyuan Xu. A survey on voice assistant security: Attacks and countermeasures. ACM Computing Surveys, 55(4):1--36, 2022.

Digital Library

[6]

Video subtitles. https://www.kapwing.com/resources/subtitle-statistics/.

[7]

Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. Advpulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pages 1121--1134, 2020.

Digital Library

[8]

Hyun Kwon, Yongchul Kim, Hyunsoo Yoon, and Daeseon Choi. Selective audio adversarial example in evasion attack on speech recognition system. IEEE Transactions on Information Forensics and Security, 15:526--538, 2019.

Digital Library

[9]

Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. Black-box adversarial attacks on commercial speech platforms with minimal information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 86--107, 2021.

Digital Library

[10]

Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. {WaveGuard}: Understanding and mitigating audio adversarial examples. In 30th USENIX Security Symposium (USENIX Security 21), pages 2273--2290, 2021.

[11]

Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. Characterizing audio adversarial examples using temporal dependency. In International Conference on Learning Representations, 2019.

[12]

Xia Du, Chi-Man Pun, and Zheng Zhang. A unified framework for detecting audio adversarial examples. In Proceedings of the 28th ACM International Conference on Multimedia, pages 3986--3994, 2020.

Digital Library

[13]

Xinghua Qu, Pengfei Wei, Mingyong Gao, Zhu Sun, Yew Soon Ong, and Zejun Ma. Synthesising audio adversarial examples for automatic speech recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 1430--1440, 2022.

Digital Library

[14]

Zhiyuan Yu, Yuanhaur Chang, Ning Zhang, and Chaowei Xiao. {SMACK}: Semantically meaningful adversarial audio attack. In 32nd USENIX Security Symposium (USENIX Security 23), pages 3799--3816, 2023.

[15]

Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[16]

Eric Grinstein, Ngoc QK Duong, Alexey Ozerov, and Patrick Pérez. Audio style transfer. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 586--590. IEEE, 2018.

Digital Library

[17]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210--5219. PMLR, 2019.

[18]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, and David Cox. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836--7846. PMLR, 2020.

[19]

Chak Ho Chan, Kaizhi Qian, Yang Zhang, and Mark Hasegawa-Johnson. Speech-split2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6332--6336. IEEE, 2022.

[20]

Keon Lee, Kyumin Park, and Daeyoung Kim. STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. In Proc. Interspeech 2021, pages 4643--4647, 2021.

[21]

Li-Wei Chen and Alexander Rudnicky. Fine-grained style control in transformer-based text-to-speech synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7907--7911. IEEE, 2022.

[22]

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. Advances in Neural Information Processing Systems, 35:10970--10983, 2022.

[23]

Yinghao Aaron Li, Cong Han, and Nima Mesgarani. Styletts: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.

[24]

Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In 2021 IEEE symposium on security and privacy (SP), pages 730--747. IEEE, 2021.

[25]

Xiao Zhang, Hao Tan, Xuan Huang, Denghui Zhang, Keke Tang, and Zhaoquan Gu. Adversarial example attacks against asr systems: An overview. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), pages 470--477. IEEE, 2022.

[26]

Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE security and privacy workshops (SPW), pages 1--7. IEEE, 2018.

[27]

Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarial examples for black box audio systems. In 2019 IEEE security and privacy workshops (SPW), pages 15--20. IEEE, 2019.

[28]

Hongting Zhang, Qiben Yan, Pan Zhou, and Xiao-Yang Liu. Generating robust audio adversarial examples with temporal dependency. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3167--3173, 2021.

Digital Library

[29]

Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5334--5341. International Joint Conferences on Artificial Intelligence Organization, 7 2019.

[30]

Mohammad Esmaeilpour, Patrick Cardinal, and Alessandro Lameiras Koerich. Towards robust speech-to-text adversarial attack. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2869--2873. IEEE, 2022.

[31]

Mia Chiquier, Chengzhi Mao, and Carl Vondrick. Real-time neural voice camouflage. In International Conference on Learning Representations, 2022.

[32]

Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14129--14137, 2021.

[33]

Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan. Specpatch: Human-in-the-loop adversarial audio spectrogram patch attack on speech recognition. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 1353--1366, 2022.

Digital Library

[34]

Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. 2019.

[35]

Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning, pages 5231--5240. PMLR, 2019.

[36]

Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. Practical hidden voice attacks against speech and speaker recognition systems. In 2019 Network and Distributed Systems Security (NDSS) Symposium, 2019.

[37]

Xinghui Wu, Shiqing Ma, Chao Shen, Chenhao Lin, Qian Wang, Qi Li, and Yuan Rao. {KENKU}: Towards efficient and stealthy black-box adversarial attacks against {ASR} systems. In 32nd USENIX Security Symposium (USENIX Security 23), pages 247--264, 2023.

[38]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369--376, 2006.

Digital Library

[39]

Yuxin Cao, Xi Xiao, Ruoxi Sun, Derui Wang, Minhui Xue, and Sheng Wen. Style-fool: Fooling video classification systems via style transfer. In 2023 IEEE Symposium on Security and Privacy (SP), pages 1631--1648. IEEE, 2023.

[40]

Chuxuan Tong, Xi Zheng, Jianhua Li, Xingjun Ma, Longxiang Gao, and Yong Xiang. Query-efficient black-box adversarial attacks on automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.

Digital Library

[41]

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39--57. Ieee, 2017.

[42]

Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald, et al. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019.

[43]

Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091--1095, 2007.

Digital Library

[44]

Xiaolei Liu, Kun Wan, Yufei Ding, Xiaosong Zhang, and Qingxin Zhu. Weighted-sampling audio adversarial example attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4908--4915, 2020.

[45]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173--182. PMLR, 2016.

Digital Library

[46]

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

[47]

Amazon mechanical turk. https://www.mturk.com.

[48]

Rensis Likert. A technique for the measurement of attitudes. Archives of Psychology, 1932.

Index Terms

Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
1. Computing methodologies
  1. Machine learning
2. Security and privacy

Recommendations

Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition

The effect of bio-inspired spectro-temporal processing for automatic speech recognition (ASR) is analyzed for two different tasks with focus on the robustness of spectro-temporal Gabor features in comparison to mel-frequency cepstral coefficients (MFCCs)...
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights
- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
Abstract
Speech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Psycho-acoustics inspired automatic speech recognition
Abstract
Understanding the human spoken language recognition process is still a far scientific goal. Nowadays, commercial automatic speech recognisers (ASRs) achieve high performance at recognising clean speech, but their approaches are poorly ...
Highlights
- We propose a novel Automatic Speech Recognizer inspired by psycho-acoustic studies.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SecTL '24: Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems

July 2024

69 pages

ISBN:9798400706912

DOI:10.1145/3665451

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

ASIA CCS '24

Sponsor:

SIGSAC

ASIA CCS '24: ACM Asia Conference on Computer and Communications Security

July 2 - 20, 2024

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
53
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)6

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents