Abstract
This paper presents an unique audio database, we named it Multivariate Audio Database (MAuD), where audio data has been collected in real life scenarios. MAuD contains 229 audio files, each of duration approx 5 minutes, collected across different conferencing apps, spoken languages, background noises and discussion topics. Various audio conferencing applications have been used for collecting these data e.g. Mobile conference calls, Zoom, Google Meet, Skype and Hangout. During this collection, speakers of different age, sex spoke in several languages and on various topics. Audio was recorded using devices of one of the speakers. Background noises were then introduced synthetically. Researchers may find this database useful as it can be used for several signal processing experiments e.g. conference app identification, background noise identification, speaker identification, identification of who speaks when. We have explored classification of some of the above mentioned mismatch cases (conference app and background noise). Pre-trained deep learning models (ResNet18 and DenseNet201) has been used for these purposes. We have achieved more than 98% accuracy in both the experiments that confirms MAuD contains high quality audio specific properties.
Similar content being viewed by others
Data Availability
Sample audio files can be found in the link [6] and detailed information about the database can be found in supplementary file named MAuD-DataDetails. Entire database will be made available for researchers after acceptance of this paper.
References
Aronowitz H, Aronowitz V (2010) Efficient score normalization for speaker recognition. In: ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings. pp. 4402–4405
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-019-07988-1
Barai B, Chakraborty T, Das N, Basu S, Nasipuri M (2022) Closed-set speaker identification using vq and gmm based models. International Journal of Speech Technology, Springer
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters 13(5):308–311
Canavan A, David G, George Z (1997) Callhome american english speech. Linguistic Data Consortium, Philadelphia
Chakraborty T (2021) Audio files recorded using different voice calling platforms. figshare. media. In: https://doi.org/10.6084/m9.figshare.14731629.v1
Chakraborty T, Barai B, Chatterjee B, Das N, Basu S, Nasipuri M (2020) Closed-set device-independent speaker identification using cnn. In: Bhateja V, Satapathy SC, Zhang YD, Aradhya VNM (eds) Intelligent Computing and Communication. Springer Singapore, Singapore, pp 291–299
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 6964–6968. IEEE
Esmaeilpour M, Cardinal P, Koerich AL (2022) From environmental sound representation to robustness of 2d cnn models against adversarial attacks. Applied Acoustics 195:108817. https://www.sciencedirect.com/science/article/pii/S0003682X22001918
Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG (2006) Speaker identification under noisy environments by using harmonic structure extraction and reliable frame weighting. In: Ninth international conference on spoken language processing
Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA
Ghahabi O, Hernando J (2018) Restricted boltzmann machines for vector representation of speech in speaker recognition. Computer Speech & Language 47:16–29
Godfrey JJ, Edward H (1993) Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia
Haris B, Pradhan G, Misra A, Shukla S, Sinha R, Prasanna S (2011) Multi-variability speech database for robust speaker recognition. In: Communications (NCC), 2011 national conference on. pp 1–5. IEEE
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR). pp 770–778
Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore C, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss R, Wilson K (2017) Cnn architectures for large-scale audio classification. In: International conference on acoustics, speech and signal processing (ICASSP). arxiv:1609.09430
Jumelle M, Sakmeche T (2018) Speaker clustering with neural networks and audio processing. arXiv:1803.08276
Madikeri S, Bourlard H (2015) Kl-hmm based speaker diarization system for meetings. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp 4435–4439. IEEE
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python
Mesaros A, Heittola T, Virtanen T (2016) TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary
Piczak KJ (2015) Esc: Dataset for environmental sound classification. https://doi.org/10.7910/DVN/YDEPUT
Rao KS, Sarkar S (2014) Robust speaker recognition in noisy environments. Springer
Ren J, Hu Y, Tai YW, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn-a multimodal lstm for speaker identification. In: Thirtieth AAAI Conference on Artificial Intelligence
Robotham T, Singla A, Rummukainen OS, Raake A, Habets EAP (2022) Audiovisual database with 360\(\circ \) video and higher-order ambisonics audio for perception, cognition, behavior, and qoe evaluation research. In: 2022 14th International conference on quality of multimedia experience (QoMEX). pp 1–6
Rose P (2006) Technical forensic speaker recognition: Evaluation, types and testing of evidence. Computer Speech & Language 20(2–3):159–191
Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. 22nd ACM International Conference on Multimedia (ACM-MM’14). Orlando, FL, USA, pp 1041–1044
Singh N, Khan R, Shree R (2012) Applications of speaker recognition. Procedia engineering 38:3122–3126
Yamada T, Wang L, Kai A (2013) Improvement of distant-talking speaker identification using bottleneck features of dnn. In: Interspeech. pp 3661–3664
Zheng J (2022) Construction and application of music audio database based on collaborative filtering algorithm. Discrete Dynamics in Nature and Society, Hindawi
Acknowledgements
This project is partially supported by the CMATER laboratory of the Computer Science and Engineering Department, Jadavpur University, India. We acknowledge the contribution of Ms Adrita Chakrabarti for assisting in the data collection process.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors state that there are no conflicts of interests. For this research work, We have not received any funding from any of these conferencing application providers. Our aim was to include all the frequently used conferencing applications without having any bias towards any corporate entity. We were not able to include some of the conferencing apps (for example Whats-App) as they do not have built-in recording features.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chakraborty, T., Bhattacharyya, R., Das, N. et al. MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms. Multimed Tools Appl 83, 38465–38479 (2024). https://doi.org/10.1007/s11042-023-16879-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16879-5