Nothing Special   »   [go: up one dir, main page]

Skip to main content

Deepfake Video Detection Using Audio-Visual Consistency

  • Conference paper
  • First Online:
Digital Forensics and Watermarking (IWDW 2020)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12617))

Included in the following conference series:

Abstract

Benefit from significant advances in deep learning, widespread Deepfake videos with the convincing manipulations, have posted a serious of threats to public security, thus the identification of fake videos has become increasingly active in current researches. However, most present Deepfake detection methods concentrate on exposing facial defects through direct facial feature analysis while merely considering the synergies with authentic behavior information outside the facial regions. Meanwhile, schemes based on meticulous-designed neural networks are rarely efficient to provide subjective interpretations of the final identification evidences. Therefore, to further enrich the diversity of detection method and increase the interpretability of detection evidences, this paper proposes a self-referential method to exploit audio-visual consistency by introducing synchronous audio recordings as reference. In preprocess phase, we propose an audio-visual matching strategy based on phonemes to segment videos, and control experiments have proved that strategy outperforms common equal-length partition. To deal with such video segments, an audio-visual coupling model (AVCM) is employed for audio-visual feature representations, then similarity metrics are measured for mouth frames and related speech segments. Actually, synchronized pairs mean the high scores of similarity and asynchronous pairs opposite. The evaluations on DeepfakeVidTIMIT indicate that our method has achieved competitive results compared with current main methods, especially in high quality datasets.

This work was supported by NSFC under 61902391, National Key Technology R&D Program under 2019QY2202 and 2019QY(Y)0207, and IIE CAS Climbing Program.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Dolhansky, B., et al.: The Deepfake detection challenge dataset (2020)

    Google Scholar 

  2. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

    Chapter  Google Scholar 

  3. Galbally, J., Marcel, S.: Face anti-spoofing based on general image quality assessment. In: 2014 22nd International Conference on Pattern Recognition, pp. 1173–1178 (2014)

    Google Scholar 

  4. Garofolo, J.S.: Timit acoustic phonetic continuous speech corpus. Linguist. Data Consort. 1993 (1993)

    Google Scholar 

  5. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)

    Google Scholar 

  6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  7. Korshunov, P., Marcel, S.: Deepfakes: a new threat to face recognition? Assessment and detection. arXiv preprint arXiv:1812.08685 (2018)

  8. Korshunov, P., Marcel, S.: Speaker inconsistency detection in tampered video. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2375–2379. IEEE (2018)

    Google Scholar 

  9. Lewis, J.: Automated lip-sync: background and techniques. J. Vis. Comput. Animat. 2(4), 118–122 (1991)

    Article  Google Scholar 

  10. Li, H., Li, B., Tan, S., Huang, J.: Detection of deep network generated images using disparities in color components. arXiv preprint arXiv:1808.07276 (2018)

  11. Li, Y., Chang, M.C., Lyu, S.: In Ictu Oculi: exposing AI generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877 (2018)

  12. Li, Y., Lyu, S.: Exposing Deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656 (2018)

  13. Matern, F., Riess, C., Stamminger, M.: Exploiting visual artifacts to expose Deepfakes and face manipulations. In: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp. 83–92. IEEE (2019)

    Google Scholar 

  14. Morishima, S., Ogata, S., Murai, K., Nakamura, S.: Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II-2117. IEEE (2002)

    Google Scholar 

  15. Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: Faceforensics++: learning to detect manipulated facial images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1–11 (2019)

    Google Scholar 

  16. Rubin, S., Berthouzoz, F., Mysore, G.J., Li, W., Agrawala, M.: Content-based tools for editing audio stories. In: Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology, pp. 113–122. ACM (2013)

    Google Scholar 

  17. Sanderson, C.: The VidTIMIT database. Tech. rep., IDIAP (2002)

    Google Scholar 

  18. Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3D convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)

    Article  Google Scholar 

  19. Yang, X., Li, Y., Lyu, S.: Exposing deep fakes using inconsistent head poses. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8261–8265. IEEE (2019)

    Google Scholar 

  20. Li, Y., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb-DF: a large-scale challenging dataset for Deepfake forensics. In: IEEE Conference on Computer Vision and Patten Recognition (CVPR) (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaowei Yi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gu, Y., Zhao, X., Gong, C., Yi, X. (2021). Deepfake Video Detection Using Audio-Visual Consistency. In: Zhao, X., Shi, YQ., Piva, A., Kim, H.J. (eds) Digital Forensics and Watermarking. IWDW 2020. Lecture Notes in Computer Science(), vol 12617. Springer, Cham. https://doi.org/10.1007/978-3-030-69449-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-69449-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-69448-7

  • Online ISBN: 978-3-030-69449-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics