Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents an unique audio database, we named it Multivariate Audio Database (MAuD), where audio data has been collected in real life scenarios. MAuD contains 229 audio files, each of duration approx 5 minutes, collected across different conferencing apps, spoken languages, background noises and discussion topics. Various audio conferencing applications have been used for collecting these data e.g. Mobile conference calls, Zoom, Google Meet, Skype and Hangout. During this collection, speakers of different age, sex spoke in several languages and on various topics. Audio was recorded using devices of one of the speakers. Background noises were then introduced synthetically. Researchers may find this database useful as it can be used for several signal processing experiments e.g. conference app identification, background noise identification, speaker identification, identification of who speaks when. We have explored classification of some of the above mentioned mismatch cases (conference app and background noise). Pre-trained deep learning models (ResNet18 and DenseNet201) has been used for these purposes. We have achieved more than 98% accuracy in both the experiments that confirms MAuD contains high quality audio specific properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Sample audio files can be found in the link [6] and detailed information about the database can be found in supplementary file named MAuD-DataDetails. Entire database will be made available for researchers after acceptance of this paper.

References

  1. Aronowitz H, Aronowitz V (2010) Efficient score normalization for speaker recognition. In: ICASSP, IEEE International conference on acoustics, speech and signal processing - proceedings. pp. 4402–4405

  2. Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-019-07988-1

    Article  Google Scholar 

  3. Barai B, Chakraborty T, Das N, Basu S, Nasipuri M (2022) Closed-set speaker identification using vq and gmm based models. International Journal of Speech Technology, Springer

  4. Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm supervectors for speaker verification. IEEE signal processing letters 13(5):308–311

    Article  Google Scholar 

  5. Canavan A, David G, George Z (1997) Callhome american english speech. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  6. Chakraborty T (2021) Audio files recorded using different voice calling platforms. figshare. media. In: https://doi.org/10.6084/m9.figshare.14731629.v1

  7. Chakraborty T, Barai B, Chatterjee B, Das N, Basu S, Nasipuri M (2020) Closed-set device-independent speaker identification using cnn. In: Bhateja V, Satapathy SC, Zhang YD, Aradhya VNM (eds) Intelligent Computing and Communication. Springer Singapore, Singapore, pp 291–299

    Chapter  Google Scholar 

  8. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp. 6964–6968. IEEE

  9. Esmaeilpour M, Cardinal P, Koerich AL (2022) From environmental sound representation to robustness of 2d cnn models against adversarial attacks. Applied Acoustics 195:108817. https://www.sciencedirect.com/science/article/pii/S0003682X22001918

  10. Fujihara H, Kitahara T, Goto M, Komatani K, Ogata T, Okuno HG (2006) Speaker identification under noisy environments by using harmonic structure extraction and reliable frame weighting. In: Ninth international conference on spoken language processing

  11. Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: Proc. IEEE ICASSP 2017. New Orleans, LA

  12. Ghahabi O, Hernando J (2018) Restricted boltzmann machines for vector representation of speech in speaker recognition. Computer Speech & Language 47:16–29

    Article  Google Scholar 

  13. Godfrey JJ, Edward H (1993) Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia

    Google Scholar 

  14. Haris B, Pradhan G, Misra A, Shukla S, Sinha R, Prasanna S (2011) Multi-variability speech database for robust speaker recognition. In: Communications (NCC), 2011 national conference on. pp 1–5. IEEE

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on computer vision and pattern recognition (CVPR). pp 770–778

  16. Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A, Moore C, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M, Weiss R, Wilson K (2017) Cnn architectures for large-scale audio classification. In: International conference on acoustics, speech and signal processing (ICASSP). arxiv:1609.09430

  17. Jumelle M, Sakmeche T (2018) Speaker clustering with neural networks and audio processing. arXiv:1803.08276

  18. Madikeri S, Bourlard H (2015) Kl-hmm based speaker diarization system for meetings. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). pp 4435–4439. IEEE

  19. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: Audio and music signal analysis in python

  20. Mesaros A, Heittola T, Virtanen T (2016) TUT database for acoustic scene classification and sound event detection. In: 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary

  21. Piczak KJ (2015) Esc: Dataset for environmental sound classification. https://doi.org/10.7910/DVN/YDEPUT

  22. Rao KS, Sarkar S (2014) Robust speaker recognition in noisy environments. Springer

    Book  Google Scholar 

  23. Ren J, Hu Y, Tai YW, Wang C, Xu L, Sun W, Yan Q (2016) Look, listen and learn-a multimodal lstm for speaker identification. In: Thirtieth AAAI Conference on Artificial Intelligence

  24. Robotham T, Singla A, Rummukainen OS, Raake A, Habets EAP (2022) Audiovisual database with 360\(\circ \) video and higher-order ambisonics audio for perception, cognition, behavior, and qoe evaluation research. In: 2022 14th International conference on quality of multimedia experience (QoMEX). pp 1–6

  25. Rose P (2006) Technical forensic speaker recognition: Evaluation, types and testing of evidence. Computer Speech & Language 20(2–3):159–191

    Article  Google Scholar 

  26. Salamon J, Jacoby C, Bello JP (2014) A dataset and taxonomy for urban sound research. 22nd ACM International Conference on Multimedia (ACM-MM’14). Orlando, FL, USA, pp 1041–1044

    Chapter  Google Scholar 

  27. Singh N, Khan R, Shree R (2012) Applications of speaker recognition. Procedia engineering 38:3122–3126

    Google Scholar 

  28. Yamada T, Wang L, Kai A (2013) Improvement of distant-talking speaker identification using bottleneck features of dnn. In: Interspeech. pp 3661–3664

  29. Zheng J (2022) Construction and application of music audio database based on collaborative filtering algorithm. Discrete Dynamics in Nature and Society, Hindawi

Download references

Acknowledgements

This project is partially supported by the CMATER laboratory of the Computer Science and Engineering Department, Jadavpur University, India. We acknowledge the contribution of Ms Adrita Chakrabarti for assisting in the data collection process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tapas Chakraborty.

Ethics declarations

Conflicts of interest

The authors state that there are no conflicts of interests. For this research work, We have not received any funding from any of these conferencing application providers. Our aim was to include all the frequently used conferencing applications without having any bias towards any corporate entity. We were not able to include some of the conferencing apps (for example Whats-App) as they do not have built-in recording features.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chakraborty, T., Bhattacharyya, R., Das, N. et al. MAuD: a multivariate audio database of samples collected from benchmark conferencing platforms. Multimed Tools Appl 83, 38465–38479 (2024). https://doi.org/10.1007/s11042-023-16879-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16879-5

Keywords

Navigation