MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms

Tapas Chakraborty ORCID: orcid.org/0000-0003-0026-6426¹,
Rudrajit Bhattacharyya¹,
Nibaran Das¹,
Subhadip Basu¹ &
…
Mita Nasipuri¹

108 Accesses
Explore all metrics

Abstract

This paper presents an unique audio database, we named it Multivariate Audio Database (MAuD), where audio data has been collected in real life scenarios. MAuD contains 229 audio files, each of duration approx 5 minutes, collected across different conferencing apps, spoken languages, background noises and discussion topics. Various audio conferencing applications have been used for collecting these data e.g. Mobile conference calls, Zoom, Google Meet, Skype and Hangout. During this collection, speakers of different age, sex spoke in several languages and on various topics. Audio was recorded using devices of one of the speakers. Background noises were then introduced synthetically. Researchers may find this database useful as it can be used for several signal processing experiments e.g. conference app identification, background noise identification, speaker identification, identification of who speaks when. We have explored classification of some of the above mentioned mismatch cases (conference app and background noise). Pre-trained deep learning models (ResNet18 and DenseNet201) has been used for these purposes. We have achieved more than 98% accuracy in both the experiments that confirms MAuD contains high quality audio specific properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparison of semi-supervised deep learning algorithms for audio classification

Article Open access 19 September 2022

Distant Speech Recognition Experiments Using the AMI Corpus

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

Data Availability

Sample audio files can be found in the link [6] and detailed information about the database can be found in supplementary file named MAuD-DataDetails. Entire database will be made available for researchers after acceptance of this paper.

References

Aronowitz H, Aronowitz V (2010) Efficient score normalization for speaker recognition. In: ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings. pp. 4402–4405
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-019-07988-1
Article Google Scholar
Barai B, Chakraborty T, Das N, Basu S, Nasipuri M (2022) Closed-set speaker identification using vq and gmm based models. International Journal of Speech Technology, Springer
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters 13(5):308–311
Article Google Scholar
Canavan A, David G, George Z (1997) Callhome american english speech. Linguistic Data Consortium, Philadelphia
Google Scholar
Chakraborty T (2021) Audio files recorded using different voice calling platforms. figshare. media. In: https://doi.org/10.6084/m9.figshare.14731629.v1
Chakraborty T, Barai B, Chatterjee B, Das N, Basu S, Nasipuri M (2020) Closed-set device-independent speaker identification using cnn. In: Bhateja V, Satapathy SC, Zhang YD, Aradhya VNM (eds) Intelligent Computing and Communication. Springer Singapore, Singapore, pp 291–299
Chapter Google Scholar
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 6964–6968. IEEE
Esmaeilpour M, Cardinal P, Koerich AL (2022) From environmental sound representation to robustness of 2d cnn models against adversarial attacks. Applied Acoustics 195:108817. https://www.sciencedirect.com/science/article/pii/S0003682X22001918
Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG (2006) Speaker identification under noisy environments by using harmonic structure extraction and reliable frame weighting. In: Ninth international conference on spoken language processing
Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA
Ghahabi O, Hernando J (2018) Restricted boltzmann machines for vector representation of speech in speaker recognition. Computer Speech & Language 47:16–29
Article Google Scholar
Godfrey JJ, Edward H (1993) Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia
Google Scholar
Haris B, Pradhan G, Misra A, Shukla S, Sinha R, Prasanna S (2011) Multi-variability speech database for robust speaker recognition. In: Communications (NCC), 2011 national conference on. pp 1–5. IEEE
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR). pp 770–778
Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore C, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss R, Wilson K (2017) Cnn architectures for large-scale audio classification. In: International conference on acoustics, speech and signal processing (ICASSP). arxiv:1609.09430
Jumelle M, Sakmeche T (2018) Speaker clustering with neural networks and audio processing. arXiv:1803.08276
Madikeri S, Bourlard H (2015) Kl-hmm based speaker diarization system for meetings. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp 4435–4439. IEEE
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python
Mesaros A, Heittola T, Virtanen T (2016) TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary
Piczak KJ (2015) Esc: Dataset for environmental sound classification. https://doi.org/10.7910/DVN/YDEPUT
Rao KS, Sarkar S (2014) Robust speaker recognition in noisy environments. Springer
Book Google Scholar
Ren J, Hu Y, Tai YW, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn-a multimodal lstm for speaker identification. In: Thirtieth AAAI Conference on Artificial Intelligence
Robotham T, Singla A, Rummukainen OS, Raake A, Habets EAP (2022) Audiovisual database with 360$\circ $ video and higher-order ambisonics audio for perception, cognition, behavior, and qoe evaluation research. In: 2022 14th International conference on quality of multimedia experience (QoMEX). pp 1–6
Rose P (2006) Technical forensic speaker recognition: Evaluation, types and testing of evidence. Computer Speech & Language 20(2–3):159–191
Article Google Scholar
Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. 22nd ACM International Conference on Multimedia (ACM-MM’14). Orlando, FL, USA, pp 1041–1044
Chapter Google Scholar
Singh N, Khan R, Shree R (2012) Applications of speaker recognition. Procedia engineering 38:3122–3126
Google Scholar
Yamada T, Wang L, Kai A (2013) Improvement of distant-talking speaker identification using bottleneck features of dnn. In: Interspeech. pp 3661–3664
Zheng J (2022) Construction and application of music audio database based on collaborative filtering algorithm. Discrete Dynamics in Nature and Society, Hindawi

Download references

Acknowledgements

This project is partially supported by the CMATER laboratory of the Computer Science and Engineering Department, Jadavpur University, India. We acknowledge the contribution of Ms Adrita Chakrabarti for assisting in the data collection process.

Author information

Authors and Affiliations

Jadavpur University, Kolkata, 700032, India
Tapas Chakraborty, Rudrajit Bhattacharyya, Nibaran Das, Subhadip Basu & Mita Nasipuri

Authors

Tapas Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
Rudrajit Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar
Nibaran Das
View author publications
You can also search for this author in PubMed Google Scholar
Subhadip Basu
View author publications
You can also search for this author in PubMed Google Scholar
Mita Nasipuri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tapas Chakraborty.

Ethics declarations

Conflicts of interest

The authors state that there are no conflicts of interests. For this research work, We have not received any funding from any of these conferencing application providers. Our aim was to include all the frequently used conferencing applications without having any bias towards any corporate entity. We were not able to include some of the conferencing apps (for example Whats-App) as they do not have built-in recording features.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chakraborty, T., Bhattacharyya, R., Das, N. et al. MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms. Multimed Tools Appl 83, 38465–38479 (2024). https://doi.org/10.1007/s11042-023-16879-5

Download citation

Received: 15 January 2023
Revised: 08 August 2023
Accepted: 04 September 2023
Published: 05 October 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11042-023-16879-5

MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of semi-supervised deep learning algorithms for audio classification

Distant Speech Recognition Experiments Using the AMI Corpus

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparison of semi-supervised deep learning algorithms for audio classification

Distant Speech Recognition Experiments Using the AMI Corpus

Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation