Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3512527.3531393acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments

Published: 27 June 2022 Publication History

Abstract

As one of the fundamental tasks in music and speech signal processing, pitch tracking has been attracting attention for decades. While a human can focus on the voiced pitch even in highly noisy environments, most existing automatic pitch tracking systems show unsatisfactory performance encountering noise. To mimic human auditory, a data-driven model named GIO is proposed in this paper, in which timbre information is introduced to guide pitch tracking. The proposed model takes two inputs: a short audio segment to extract pitch from and a timbre embedding derived from the speaker's or singer's voice. In experiments, we use a music artist classification model to extract timbre embedding vectors. A dual-branch structure and a two-step training method are designed to enable the model to predict voice presence. The experimental results show that the proposed model gains a significant improvement in noise robustness and outperforms existing state-of-the-art methods with fewer parameters.

Supplementary Material

MP4 File (ICMR22-fp190.mp4)
Presentation Video

References

[1]
Barry Arons. 1992. A review of the cocktail party effect. Journal of the American Voice I/O Society 12, 7 (1992), 35--50.
[2]
Arturo Camacho and John G Harris. 2008. A sawtooth waveform inspired pitch estimator for speech and music. The Journal of the Acoustical Society of America 124, 3 (2008), 1638--1652.
[3]
Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan, Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and Roger Jang. 2015. Vocal activity informed singing voice separation with the iKala dataset. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 718--722.
[4]
Rhodri Cusack and Brian Roberts. 2000. Effects of differences in timbre on sequential grouping. Perception & psychophysics 62, 5 (2000), 1112--1120.
[5]
Alain De Cheveigné and Hideki Kawahara. 2002. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111, 4 (2002), 1917--1930.
[6]
Mingye Dong, Jie Wu, and Jian Luan. 2019. Vocal Pitch Extraction in Polyphonic Music Using Convolutional Residual Network. In INTERSPEECH. 2010--2014.
[7]
Daniel PW Ellis. 2007. Classifying music audio with timbral and chroma features. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR.
[8]
Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, and Mihajlo Velimirovic. 2020. SPICE: Self-supervised pitch estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 1118--1128.
[9]
Sira Gonzalez and Mike Brookes. 2014. PEFAC-A pitch estimation algorithm robust to high levels of noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 2 (2014), 518--530.
[10]
Toni Heittola, Annamaria Mesaros, and Tuomas Virtanen. 2020. Acoustic scene classification in dcase 2020 challenge: generalization across devices and low complexity solutions. arXiv preprint arXiv:2005.14623 (2020).
[11]
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2020. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5, 50 (2020), 2154. https://doi.org/10.21105/joss.02154 Deezer Research.
[12]
Adrianus JM Houtsma. 1995. Pitch perception. Hearing 6 (1995), 262.
[13]
Chao-Ling Hsu and Jyh-Shing Roger Jang. 2009. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE trans- actions on audio, speech, and language processing 18, 2 (2009), 310--319.
[14]
Kavita Kasi. 2002. Yet Another Algorithm for Pitch Tracking:(YAAPT). Ph. D. Dissertation. Citeseer.
[15]
Hideki Kawahara, Alain de Cheveigné, Hideki Banno, Toru Takahashi, and Toshio Irino. 2005. Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In Ninth European Conference on Speech Communication and Technology.
[16]
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. 2018. Crepe: A convolutional representation for pitch estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 161--165.
[17]
Sangeun Kum and Juhan Nam. 2019. Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Applied Sciences 9, 7 (2019), 1324.
[18]
Matthias Mauch and Simon Dixon. 2014. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 659--663.
[19]
Matthias Mauch, Sebastian Ewert, et al . 2013. The audio degradation toolbox and its application to robustness evaluation. (2013).
[20]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Icml.
[21]
Zain Nasrullah and Yue Zhao. 2019. Musical Artist Classification with Convolutional Recurrent Neural Networks. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8. https://doi.org/10.1109/IJCNN.2019.8851988
[22]
Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. 2018. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV). 464--479.
[23]
Christopher J Plack, Andrew J Oxenham, and Richard R Fay. 2006. Pitch: neural coding and perception. Vol. 24. Springer Science & Business Media.
[24]
Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. 2014. mir_eval: A transparent implementation of common MIR metrics. In In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer.
[25]
Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, and Rachel Bittner. 2017. The MUSDB18 corpus for music separation. https://doi.org/10.5281/zenodo.1117372
[26]
Tenkasi Ramabadran, Alexander Sorin, Michael McLaughlin, Dan Chazan, David Pearce, and Ron Hoory. 2004. The ETSI extended distributed speech recognition (DSR) standards: server-side speech reconstruction. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, I--53.
[27]
Justin Salamon, Emilia Gómez, Daniel PW Ellis, and Gaël Richard. 2014. Melody extraction from polyphonic music signals: Approaches, applications, and challenges. IEEE Signal Processing Magazine 31, 2 (2014), 118--134.
[28]
Satwinder Singh, Ruili Wang, and Yuanhang Qiu. 2021. DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech Signals. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 61--65.
[29]
David Talkin and W Bastiaan Kleijn. 1995. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis 495 (1995), 518.
[30]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[31]
Shuai Yu, Xiaoheng Sun, Yi Yu, and Wei Li. 2021. Frequency-temporal attention network for singing melody extraction. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 251--255.

Cited By

View all
  • (2024)Visual signatures for music mood and timbreThe Visual Computer10.1007/s00371-024-03417-zOnline publication date: 31-May-2024
  • (2023)TAPE: An End-to-End Timbre-Aware Pitch EstimatorICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096762(1-5)Online publication date: 4-Jun-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. music information retrieval
  2. noise robustness
  3. pitch tracking
  4. timbre

Qualifiers

  • Research-article

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Visual signatures for music mood and timbreThe Visual Computer10.1007/s00371-024-03417-zOnline publication date: 31-May-2024
  • (2023)TAPE: An End-to-End Timbre-Aware Pitch EstimatorICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10096762(1-5)Online publication date: 4-Jun-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media