research-article

A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark

Authors:

Ling QianAuthors Info & Claims

ICMIP '22: Proceedings of the 2022 7th International Conference on Multimedia and Image Processing

Pages 86 - 92

https://doi.org/10.1145/3517077.3517091

Published: 22 May 2022 Publication History

Get Access

Abstract

Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time.

A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.

References

[1]

Qian Y, Weng C, Chang X, et al. Past review, current progress, and challenges ahead on the cocktail party problem[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 40-63.

Google Scholar

[2]

Zhu H, Luo M, Wang R, Deep Audio-Visual Learning: A Survey[J]. arXiv preprint arXiv:2001.04758, 2020.

Google Scholar

[3]

Ephrat A, Mosseri I, Lang O, Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation [J]. arXiv preprint arXiv:1804.03619, 2018.

Google Scholar

[4]

Rigal R, Chodorowski J, Zerr B. Deep Audio-Visual Speech Separation Based on Facial Motion}}[J]. Proc. Interspeech 2021, 2021: 3540-3544.

Google Scholar

[5]

Owens A, Efros A A. Audio-visual scene analysis with self-supervised multisensory features[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 631-648.

Google Scholar

[6]

Zhao H, Gan C, Rouditchenko A, The sound of pixels[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 570-586.

Google Scholar

[7]

Xiang J, Zhu G. Joint face detection and facial expression recognition with MTCNN[C]//2017 4th international conference on information science and control engineering (ICISCE). IEEE, 2017: 424-427.

Google Scholar

Recommendations

A Comparative Study of Face Recognition under Pose Variation
ICCMS '19: Proceedings of the 11th International Conference on Computer Modeling and Simulation

Face recognition algorithms enable computational devices to recognize faces. It has a widespread application in commerce, law enforcement and can be effectively used in criminal identification, healthcare, advertising, access and security, payments and ...
ChildTinyTalks (CTT): A Benchmark Dataset and Baseline for Expressive Child Speech Synthesis
Speech and Computer
Abstract
Designing expressive speech synthesis for child voice remains an unresolved problem. One of the major dilemmas faced by child TTS systems and child speech synthesis is the scarcity of datasets to train opaque data-hungry DNN-based models. Only a ...
A comparative study on illumination preprocessing in face recognition

Illumination preprocessing is an effective and efficient approach in handling lighting variations for face recognition. Despite much attention to face illumination preprocessing, there is seldom systemic comparative study on existing approaches that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICMIP '22: Proceedings of the 2022 7th International Conference on Multimedia and Image Processing

January 2022

250 pages

ISBN:9781450387408

DOI:10.1145/3517077

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

?????????

Conference

ICMIP 2022

ICMIP 2022: 2022 7th International Conference on Multimedia and Image Processing

January 14 - 16, 2022

Tianjin, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
24
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Recommendations

A Comparative Study of Face Recognition under Pose Variation

ChildTinyTalks (CTT): A Benchmark Dataset and Baseline for Expressive Child Speech Synthesis

A comparative study on illumination preprocessing in face recognition

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations