Abstract
Benefit from significant advances in deep learning, widespread Deepfake videos with the convincing manipulations, have posted a serious of threats to public security, thus the identification of fake videos has become increasingly active in current researches. However, most present Deepfake detection methods concentrate on exposing facial defects through direct facial feature analysis while merely considering the synergies with authentic behavior information outside the facial regions. Meanwhile, schemes based on meticulous-designed neural networks are rarely efficient to provide subjective interpretations of the final identification evidences. Therefore, to further enrich the diversity of detection method and increase the interpretability of detection evidences, this paper proposes a self-referential method to exploit audio-visual consistency by introducing synchronous audio recordings as reference. In preprocess phase, we propose an audio-visual matching strategy based on phonemes to segment videos, and control experiments have proved that strategy outperforms common equal-length partition. To deal with such video segments, an audio-visual coupling model (AVCM) is employed for audio-visual feature representations, then similarity metrics are measured for mouth frames and related speech segments. Actually, synchronized pairs mean the high scores of similarity and asynchronous pairs opposite. The evaluations on DeepfakeVidTIMIT indicate that our method has achieved competitive results compared with current main methods, especially in high quality datasets.
This work was supported by NSFC under 61902391, National Key Technology R&D Program under 2019QY2202 and 2019QY(Y)0207, and IIE CAS Climbing Program.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dolhansky, B., et al.: The Deepfake detection challenge dataset (2020)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Galbally, J., Marcel, S.: Face anti-spoofing based on general image quality assessment. In: 2014 22nd International Conference on Pattern Recognition, pp. 1173–1178 (2014)
Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguist. Data Consort. 1993 (1993)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint arXiv:1812.08685 (2018)
Korshunov, P., Marcel, S.: Speaker inconsistency detection in tampered video. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2375–2379. IEEE (2018)
Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Animat. 2(4), 118–122 (1991)
Li, H., Li, B., Tan, S., Huang, J.: Detection of deep network generated images using disparities in color components. arXiv preprint arXiv:1808.07276 (2018)
Li, Y., Chang, M.C., Lyu, S.: In Ictu Oculi: exposing AI generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877 (2018)
Li, Y., Lyu, S.: Exposing Deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656 (2018)
Matern, F., Riess, C., Stamminger, M.: Exploiting visual artifacts to expose Deepfakes and face manipulations. In: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 83–92. IEEE (2019)
Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-2117. IEEE (2002)
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11 (2019)
Rubin, S., Berthouzoz, F., Mysore, G.J., Li, W., Agrawala, M.: Content-based tools for editing audio stories. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 113–122. ACM (2013)
Sanderson, C.: The VidTIMIT database. Tech. rep., IDIAP (2002)
Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)
Yang, X., Li, Y., Lyu, S.: Exposing deep fakes using inconsistent head poses. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265. IEEE (2019)
Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-DF: a large-scale challenging dataset for Deepfake forensics. In: IEEE Conference on Computer Vision and Patten Recognition (CVPR) (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Gu, Y., Zhao, X., Gong, C., Yi, X. (2021). Deepfake Video Detection Using Audio-Visual Consistency. In: Zhao, X., Shi, YQ., Piva, A., Kim, H.J. (eds) Digital Forensics and Watermarking. IWDW 2020. Lecture Notes in Computer Science(), vol 12617. Springer, Cham. https://doi.org/10.1007/978-3-030-69449-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-69449-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69448-7
Online ISBN: 978-3-030-69449-4
eBook Packages: Computer ScienceComputer Science (R0)