Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3552466.3556524acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models

Published: 10 October 2022 Publication History

Abstract

Traditional speech spoofing countermeasures (CM) typically contain a frontend which extract a two dimensional feature from the waveform, and a Convolutional Neural Network (CNN) based backend classifier. This pipeline is similar to an image classification task, in some degree. Pre-training is a widely used paradigm in many fields. Self-supervised pre-trained frontend such as Wav2Vec 2.0 has shown superior improvement in the speech spoofing detection task. However, these pre-trained models are only trained by bonafide utterances. Moreover, acoustic pre-trained frontends can also be used in the text-to-speech (TTS) and voice conversion (VC) task, which reveals that commonalities of speech are learnt by them, rather than discriminative information between real and fake data. Speech spoofing detection task and image classification task share the same pipeline. Based on the hypothesis that CNNs follow the same pattern in capturing artefacts in these two tasks, we apply image pre-trained CNN model to detect spoofed utterances, counterintuitively. To supplement the model with potentially missing acoustic features, we concatenate Jitter and Shimmer features to the output embedding. Our proposed CM achieve top-level performance on the ASVspoof 2019 dataset.

Supplementary Material

MP4 File (ddam22-012.mp4)
Presentation video of paper "Acoustic or Pattern? Speech Spoofing Countermeasure based on Image Pre-training Models" at ACM multimedia 2022

References

[1]
2022. Praat software website. http://www.fon.hum.uva.nl/praat/.
[2]
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, et al. 2021. XLS-R: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296 (2021).
[3]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449--12460.
[4]
Li-Wei Chen and Alexander Rudnicky. 2022. Fine-grained style control in Transformer-based Text-to-speech Synthesis. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7907--7911.
[5]
Xinhui Chen, You Zhang, Ge Zhu, and Zhiyao Duan. 2021. UR channel-robust synthetic speech detection system for ASVspoof 2021. arXiv preprint arXiv:2107.12018 (2021).
[6]
Xingliang Cheng, Mingxing Xu, and Thomas Fang Zheng. 2019. Replay detection using CQT-based modified group delay feature and ResNeWt network in ASVspoof 2019. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 540--545.
[7]
Rohan Kumar Das. 2021. Known-unknown data augmentation strategies for detection of logical access, physical access and speech deepfake attacks: ASVspoof 2021. In Proc. ASVspoof2021 Workshop.
[8]
Rohan Kumar Das, Jichen Yang, and Haizhou Li. 2020. Assessing the scope of generalized countermeasures for anti-spoofing. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6589--6593.
[9]
Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Roberto Font, Juan M Espín, and María José Cano. 2017. Experimental analysis of features for replay attack detection-results on the ASVspoof 2017 Challenge. In Interspeech. 7--11.
[12]
Yang Gao, Jiachen Lian, Bhiksha Raj, and Rita Singh. 2021. Detection and evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 544--551.
[13]
Wanying Ge, Jose Patino, Massimiliano Todisco, and Nicholas Evans. 2021. Raw differentiable architecture search for speech deepfake and spoofing detection. arXiv preprint arXiv:2107.12212 (2021).
[14]
Wanying Ge, Jose Patino, Massimiliano Todisco, and Nicholas Evans. 2022. Explaining deep learning models for spoofing and deepfake detection with SHapley Additive exPlanations. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6387--6391.
[15]
Kaiming He, Ross Girshick, and Piotr Dollár. 2019. Rethinking imagenet pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4918--4927.
[16]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451--3460.
[17]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[18]
Guang Hua, AndrewBeng Jin Teoh, and Haijian Zhang. 2021. Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters 28 (2021), 1265--1269.
[19]
Sally Jones et al. 2011. Speech is silver, silence is golden: The cultural importance of silence in Japan. The ANU Undergraduate Research Journal 3 (2011), 17--27.
[20]
Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong- Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2022. Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6367--6371.
[21]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[22]
Tomi Kinnunen et al. 2020. Tandem assessment of spoofing countermeasures and automatic speaker verification: Fundamentals. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2195--2210.
[23]
Tomi Kinnunen, Zhi-Zheng Wu, Kong Aik Lee, Filip Sedlak, Eng Siong Chng, and Haizhou Li. 2012. Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4401--4404.
[24]
Zhenchun Lei, Yingen Yang, Changhong Liu, and Jihua Ye. 2020. Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection. In INTERSPEECH. 1116--1120.
[25]
Xu Li, XixinWu, Hui Lu, Xunying Liu, and Helen Meng. 2021. Channel-wise gated res2net: Towards robust detection of synthetic speech attacks. arXiv preprint arXiv:2107.08803 (2021).
[26]
Songxiang Liu, Haibin Wu, Hung-yi Lee, and Helen Meng. 2019. Adversarial attacks on spoofing countermeasures of automatic speaker verification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 312--319.
[27]
Weiyang Liu, YandongWen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 212--220.
[28]
Jingze Lu, Yuxiang Zhang, Wenchao Wang, and Pengyuan Zhang. 2022. Robust Cross-SubBand Countermeasure Against Replay Attacks. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2022). 126--132. https://doi.org/10.21437/Odyssey.2022--18
[29]
Zhiqiang Lv, Shanshan Zhang, Kai Tang, and Pengfei Hu. 2022. Fake Audio Detection Based On Unsupervised Pretraining Models. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9231--9235.
[30]
Juan M Martín-Doñas and Aitor Álvarez. 2022. The Vicomtech Audio Deepfake Detection System Based on Wav2vec2 for the 2022 ADD Challenge. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9241--9245.
[31]
Yichuan Mo and Shilin Wang. 2022. Multi-Task Learning Improves Synthetic Speech Detection. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6392--6396.
[32]
Nicolas M Müller, Franziska Dieckmann, and Jennifer Williams. 2022. Attacker Attribution of Audio Deepfakes. arXiv preprint arXiv:2203.15563 (2022).
[33]
Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, and Ann Lee. 2022. Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation. arXiv preprint arXiv:2204.02967 (2022).
[34]
Douglas A Reynolds. 1995. Speaker identification and verification using Gaussian mixture speaker models. Speech communication 17, 1--2 (1995), 91--108.
[35]
Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. 2016. Grad-CAM: Why did you say that? arXiv preprint arXiv:1611.07450 (2016).
[36]
Vadim Shchemelinin and Konstantin Simonchik. 2013. Examining vulnerability of voice verification systems to spoofing attacks by means of a TTS system. In International Conference on Speech and Computer. Springer, 132--137.
[37]
Hubert Siuzdak, Piotr Dura, Pol van Rijn, and Nori Jacoby. 2022. WavThruVec: Latent speech representation as intermediate features for neural speech synthesis. arXiv preprint arXiv:2203.16930 (2022).
[38]
Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710 (2021).
[39]
Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, and Nicholas Evans. 2021. Graph attention networks for anti-spoofing. arXiv preprint arXiv:2104.03654 (2021).
[40]
Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans. 2022. Rawboost: A Raw Data Boosting and Augmentation Method Applied to Automatic Speaker Verification Anti-Spoofing. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6382--6386.
[41]
Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, and Massimiliano Todisco. 2020. Spoofing attack detection using the non-linear fusion of sub-band classifiers. arXiv preprint arXiv:2005.10393 (2020).
[42]
Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher. 2021. End-to-end anti-spoofing with rawnet2. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6369--6373.
[43]
Hemlata Tak, Massimiliano Todisco, XinWang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans. 2022. Automatic Speaker Verification Spoofing and Deepfake Detection Using Wav2vec 2.0 and Data Augmentation. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2022). 112--119. https://doi.org/10.21437/Odyssey.2022--16
[44]
João Paulo Teixeira and André Gonçalves. 2016. Algorithm for jitter and shimmer measurement in pathologic voices. Procedia Computer Science 100 (2016), 271--279.
[45]
João Paulo Teixeira, Carla Oliveira, and Carla Lopes. 2013. Vocal acoustic analysis--jitter, shimmer and hnr parameters. Procedia Technology 9 (2013), 1112--1122.
[46]
Massimiliano Todisco, Héctor Delgado, and Nicholas Evans. 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45 (2017), 516--535.
[47]
Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019).
[48]
Anton Tomilov, Aleksei Svishchev, Marina Volkova, Artem Chirkovskiy, Alexander Kondratev, and Galina Lavrentyeva. 2021. STC antispoofing systems for the ASVspoof2021 challenge. In Proc. ASVspoof 2021 Workshop. 61--67.
[49]
Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Matthew Baas, Hugo Seuté, and Herman Kamper. 2022. A comparison of discrete and soft speech units for improved voice conversion. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6562--6566.
[50]
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2017. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR) (2017).
[51]
Xingming Wang, Xiaoyi Qin, Tinglong Zhu, Chao Wang, Shilei Zhang, and Ming Li. 2021. The DKU-CMRI System for the ASVspoof 2021 Challenge: Vocoder based Replay Channel Response Estimation. Proc. 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge (2021), 16--21.
[52]
Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).
[53]
Xin Wang and Junichi Yamagishi. 2022. Estimating the confidence of speech spoofing countermeasure. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6372--6376.
[54]
Xin Wang and Junich Yamagishi. 2022. Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure. arXiv preprint arXiv:2203.14553 (2022).
[55]
Xin Wang and Junichi Yamagishi. 2022. Investigating Self-Supervised Front Ends for Speech Spoofing Countermeasures. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2022). 100--106. https://doi.org/10.21437/Odyssey. 2022--14
[56]
Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, and Helen Meng. 2022. Partially Fake Audio Detection by Self-Attention-Based Fake Span Discovery. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9236--9240.
[57]
Haibin Wu, Songxiang Liu, Helen Meng, and Hung-yi Lee. 2020. Defense against adversarial attacks on spoofing countermeasures of asv. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6564--6568.
[58]
Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. speech communication 66 (2015), 130--153.
[59]
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, et al. 2021. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537 (2021).
[60]
Rui Yan, Cheng Wen, Shuran Zhou, Tingwei Guo, Wei Zou, and Xiangang Li. 2022. Audio Deepfake Detection System with Neural Stitching for ADD 2022. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9226--9230.
[61]
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, et al. 2022. Add 2022: the first audio deep synthesis detection challenge. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9216--9220.
[62]
You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters 28 (2021), 937--941.
[63]
Yuxiang Zhang, Wenchao Wang, and Pengyuan Zhang. 2021. The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System. In Proc. Interspeech 2021. 4279--4283. https://doi.org/10.21437/Interspeech.2021--1281
[64]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921--2929.

Cited By

View all
  • (2024)Audio Multi-View Spoofing Detection Framework Based on Audio-Text-Emotion CorrelationsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343188819(7133-7146)Online publication date: 2024
  • (2024)An explainable deepfake of speech detection method with spectrograms and waveformsJournal of Information Security and Applications10.1016/j.jisa.2024.10372081:COnline publication date: 25-Jun-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia
October 2022
107 pages
ISBN:9781450394963
DOI:10.1145/3552466
  • General Chairs:
  • Jianhua Tao,
  • Haizhou Li,
  • Helen Meng,
  • Dong Yu,
  • Masato Akagi,
  • Program Chairs:
  • Jiangyan Yi,
  • Cunhang Fan,
  • Ruibo Fu,
  • Shan Lian,
  • Pengyuan Zhang
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anti-spoofing
  2. audio deepfakes
  3. image pre-training
  4. wav2vec2

Qualifiers

  • Research-article

Conference

MM '22
Sponsor:

Acceptance Rates

DDAM '22 Paper Acceptance Rate 12 of 14 submissions, 86%;
Overall Acceptance Rate 12 of 14 submissions, 86%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)256
  • Downloads (Last 6 weeks)44
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Audio Multi-View Spoofing Detection Framework Based on Audio-Text-Emotion CorrelationsIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343188819(7133-7146)Online publication date: 2024
  • (2024)An explainable deepfake of speech detection method with spectrograms and waveformsJournal of Information Security and Applications10.1016/j.jisa.2024.10372081:COnline publication date: 25-Jun-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media