research-article

Detection of Synthetic Speech Based on Spectrum Defects

Authors:

Mingyu DongAuthors Info & Claims

DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

Pages 3 - 8

https://doi.org/10.1145/3552466.3556529

Published: 10 October 2022 Publication History

Abstract

Synthetic spoofing speech has become a threat to online communication and automatic speaker verification (ASV) systems based on deep learning since the synthetic model can produce anyone's voice. The first Audio Deep Synthesis Detection Challenge (ADD 2022) is launched to spur researchers around the world to build innovative new technologies that can further accelerate and foster research on detecting deep synthesis and manipulated speech. This paper presents a spoofing detection system submitted to ADD 2022 Track 3.2 Detection task (FG-D). The system consists of two parts to detect synthetic speech. First, Mel-frequency cepstral coefficients (MFCCs), Linear frequency cepstral coefficients (LFCCs), Delta coefficients, and Delta-Delta coefficients features derived from speech spectrogram are fed into DenseNet for building the DenseNet detection system (DDS). Then Mute segment classifier (MSC), High-frequency classifier (HFC), and Block spectrogram classifier (BSC) algorithms are designed for the defects of the synthetic speech on the spectrogram and the spectrum defect detection system SPECT is formed. The experimental results of the fusion system composed of SPECT and DDS in ADD FG-D demonstrate an EER of 8.5%, and our final submission ranks 6th in the evaluation phase of ADD FG-D.

Supplementary Material

MP4 File (DDAM-03.mp4)

Presentation video

Download
71.59 MB

References

[1]

Moustafa Alzantot, Ziqi Wang, and Mani B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501.

[2]

Zhongxin Bai and Xiao-Lei Zhang. 2021. Speaker recognition based on deep learning: an overview. Neural Networks, 140, 65--99.

[3]

Joaqun Cáceres, Roberto Font, Teresa Grau, Javier Molina, and Biometric Vox SL. 2021. The biometric vox system for the asvspoof 2021 challenge. In Proc. ASVspoof2021 Workshop.

[4]

Donald D Greenwood. 1961. Auditory masking and the critical band. The journal of the acoustical society of America, 33, 4, 484--502.

[5]

John HL Hansen and Taufiq Hasan. 2015. Speaker recognition by machines and humans: a tutorial review. IEEE Signal processing magazine, 32, 6, 74--99.

[6]

Md Afzal Hossan, Sheeraz Memon, and Mark A Gregory. 2010. A novel approach for MFCC feature extraction. In 2010 4th International Conference on Signal Processing and Communication Systems. IEEE, 1--5.

[7]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700--4708.

[8]

Wen-Chin Huang, Chen-Chou Lo, Hsin-Te Hwang, Yu Tsao, and Hsin-Min Wang. 2018. Wavenet vocoder and its applications in voice conversion. In Proc. The 30th ROCLING Conference on Computational Linguistics and Speech Processing (ROCLING).

[9]

Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, and Paavo Alku. 2018. Speech waveform synthesis from mfcc sequences with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5679--5683.

Digital Library

[10]

Kshitiz Kumar, Chanwoo Kim, and Richard M Stern. 2011. Delta-spectral cepstral coefficients for robust speech recognition. In 2011 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 4784--4787.

[11]

Anwei Luo, Enlei Li, Yongliang Liu, Xiangui Kang, and Z Jane Wang. 2021. A capsule network based approach for detection of audio spoofing attacks. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6359--6363.

[12]

A. Nautsch, W. Xin, N. Evans, T. Kinnunen, and A. L. Kong. 2021. Asvspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Transactions on Biometrics Behavior and Identity Science, PP, 99, 1--1.

[13]

V. Tiwari. 2010. Mfcc and its applications in speaker recognition. international journal on emerging technologies issn.

[14]

Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio, Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. 2016. The voice conversion challenge 2016. In Interspeech, 1632--1636.

[15]

Anton Tomilov, Aleksei Svishchev, Marina Volkova, Artem Chirkovskiy, Alexander Kondratev, and Galina Lavrentyeva. 2021. Stc antispoofing systems for the asvspoof2021 challenge. In Proc. ASVspoof 2021 Workshop.

[16]

Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. 2016. Wavenet: a generative model for raw audio. SSW, 125, 2.

[17]

Z. Wu, O. Watts, and S. King. 2016. Merlin: an open source neural network speech synthesis system. In 9th ISCA Speech Synthesis Workshop.

[18]

Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: a survey. speech communication, 66, 130--153.

[19]

Zhizheng Wu, Junichi Yamagishi, Tomi Kinnunen, Cemal Hanilçi, Mohammed Sahidullah, Aleksandr Sizov, Nicholas Evans, Massimiliano Todisco, and Hector Delgado. 2017. Asvspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE Journal of Selected Topics in Signal Processing, 11, 4, 588--604.

[20]

Junichi Yamagishi et al. 2021. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537.

[21]

Jiangyan Yi et al. 2022. Add 2022: the first audio deep synthesis detection challenge. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9216--9220

Cited By

Li MAhmadiadli YZhang X(2025)A Survey on Speech Deepfake DetectionACM Computing Surveys10.1145/371445857:7(1-38)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3714458
Xue JFan CYi JZhou JLv Z(2024)Dynamic Ensemble Teacher-Student Distillation Framework for Light-Weight Fake Audio DetectionIEEE Signal Processing Letters10.1109/LSP.2024.343193631(2305-2309)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3431936
Zhang RWang HDu MLiu HZhou YZeng QEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery LocalizationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613767(8749-8759)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613767
Show More Cited By

Index Terms

Detection of Synthetic Speech Based on Spectrum Defects
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
  2. Computer forensics
    1. System forensics

Recommendations

Exploratory Study on the Perception of Intelligent Virtual Agents With Non-Native Accents Using Synthetic and Natural Speech in German
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper presents an exploratory study which investigates the impact of different non-native accents and the naturalness of speech on the correct assignment of an Intelligent Virtual Agent’s (IVA) mother tongue, as well as its perceived warmth, ...
Synthesized speech intelligibility and persuasion: Speech rate and non-native listeners

This experiment assessed the effect of variation in speech rate on comprehension and persuasiveness of a message presented in text-to-speech (TTS) synthesis to native and non-native listeners. Eighty non-native speakers of English and 80 native speakers ...
Text-dependent and text-independent speaker recognition of reverberant speech based on CNN
Abstract
Speaker recognition is one of several biometric recognition systems owing to its high importance in numerous applications of security and telecommunications. The key aspiration of speaker recognition systems is to know who is speaking depending on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DDAM '22: Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia

October 2022

107 pages

ISBN:9781450394963

DOI:10.1145/3552466

General Chairs:
Jianhua Tao
Institute of Automation, Chinese Academy of Sciences Beijing, China
,
Haizhou Li
National University of Singapore Singapore
,
Helen Meng
Chinese University of Hong Kong Hong Kong, China
,
Dong Yu
Tencent AI Lab Beijing, China
,
Masato Akagi
Japan Advanced Institute of Science and Technology Japan
,
Program Chairs:
Jiangyan Yi
Institute of Automation, Chinese Academy of Sciences Beijing, China
,
Cunhang Fan
School of Computer Science and Technology, Anhui University Hefei, China
,
Ruibo Fu
Institute of Automation, Chinese Academy of Sciences Beijing, China
,
Shan Lian
Institute of Automation, Chinese Academy of Sciences Beijing, China
,
Pengyuan Zhang
Institute of Acoustics, Chinese Academy of Sciences Beijing, China

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ningbo Natural Science Foundation
Ningbo Science and Technology Innovation Project
the National Natural Science Foundation of China

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 14, 2022

Lisboa, Portugal

Acceptance Rates

DDAM '22 Paper Acceptance Rate 12 of 14 submissions, 86%;

Overall Acceptance Rate 12 of 14 submissions, 86%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li MAhmadiadli YZhang X(2025)A Survey on Speech Deepfake DetectionACM Computing Surveys10.1145/371445857:7(1-38)Online publication date: 24-Jan-2025
https://dl.acm.org/doi/10.1145/3714458
Xue JFan CYi JZhou JLv Z(2024)Dynamic Ensemble Teacher-Student Distillation Framework for Light-Weight Fake Audio DetectionIEEE Signal Processing Letters10.1109/LSP.2024.343193631(2305-2309)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3431936
Zhang RWang HDu MLiu HZhou YZeng QEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for Temporal Forgery LocalizationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613767(8749-8759)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613767
Xue JFan CYi JWang CWen ZZhang DLv Z(2023)Learning From Yourself: A Self-Distillation Method For Fake Speech DetectionICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096837(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10096837
Ren YPeng HLi LXue XLan YYang Y(2023)A voice spoofing detection framework for IoT systems with feature pyramid and online knowledge distillationJournal of Systems Architecture10.1016/j.sysarc.2023.102981143(102981)Online publication date: Oct-2023
https://doi.org/10.1016/j.sysarc.2023.102981

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten