Transposition of Simple Waveforms from Raw Audio with Deep Learning

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13988))

Included in the following conference series:

International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar)

1569 Accesses

Abstract

A system that is able to automatically transpose an audio recording would have many potential applications, from music production to hearing aid design. We present a deep learning approach to transpose an audio recording directly from the raw time domain signal. We train recurrent neural networks with raw audio samples of simple waveforms (sine, square, triangle, sawtooth) covering the linear range of possible frequencies. We examine our generated transpositions for each musical semitone step size up to the octave and compare our results against two popular pitch shifting algorithms. Although our approach is able to accurately transpose the frequencies in a signal, these signals suffer from a significant amount of added noise. This work represents exploratory steps towards the development of a general deep transposition model able to quickly transpose to any desired spectral mapping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards End-to-End Raw Audio Music Synthesis

Neural Symbolic Music Genre Transfer Insights

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Notes

References

Bengio, Y., LeCun, Y., et al.: Scaling learning algorithms towards AI. Large-Scale Kernel Mach. 34(5), 1–41 (2007). https://doi.org/10.7551/mitpress/7496.003.0016
Article Google Scholar
Briot, J.-P., Pachet, F.: Deep learning for music generation: challenges and directions. Neural Comput. Appl. 32(4), 981–993 (2018). https://doi.org/10.1007/s00521-018-3813-6
Article Google Scholar
Choi, K., Fazekas, G., Cho, K., Sandler, M.: A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396 (2017). https://doi.org/10.48550/arXiv.1709.04396
Disch, S., Edler, B.: Frequency selective pitch transposition of audio signals. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 29–32. IEEE (2011). https://doi.org/10.1109/ICASSP.2011.5946320
Dolson, M.: The phase vocoder: a tutorial. Comput. Music. J. 10(4), 14–27 (1986). https://doi.org/10.2307/3680093
Article Google Scholar
Engel, J., Hantrakul, L.H., Gu, C., Roberts, A.: DDSP: differentiable digital signal processing. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=B1x1ma4tDr
Engel, J., et al.: Neural audio synthesis of musical notes with wavenet autoencoders. In: International Conference on Machine Learning, pp. 1068–1077. PMLR (2017). https://doi.org/10.48550/arXiv.1704.01279
Hernandez-Olivan, C., Beltran, J.R.: Music composition with deep learning: a review. arXiv preprint arXiv:2108.12290 (2021). https://doi.org/10.48550/arXiv:2108.12290
Jawahir, A., Haviluddin, H.: An audio encryption using transposition method. Int. J. Adv. Intell. Inform. 1(2), 98–106 (2015). https://doi.org/10.26555/ijain.v1i2.24
Khalil, R.A., Jones, E., Babar, M.I., Jan, T., Zafar, M.H., Alhussain, T.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.3390/app9194050
Article Google Scholar
Lawlor, B., Fagan, A.D.: A novel efficient algorithm for music transposition. Organ. Sound 4(3), 161–167 (2000). https://doi.org/10.1017/S135577180000306X
Article Google Scholar
Lin, S., Liu, N., Nazemi, M., Li, H., Ding, C., Wang, Y., Pedram, M.: FFT-based deep learning deployment in embedded systems. In: 2018 Design, Automation and Test in Europe Conference and Exhibition (DATE), pp. 1045–1050. IEEE (2018). https://doi.org/10.23919/DATE.2018.8342166
Luo, Y.J., Chen, M.T., Chi, T.S., Su, L.: Singing voice correction using canonical time warping. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 156–160. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8461280
Luo, Y.J., Lin, Y.J., Su, L.: Toward expressive singing voice correction: On perceptual validity of evaluation metrics for vocal melody extraction. arXiv preprint arXiv:2010.12196 (2020). https://doi.org/10.48550/arXiv.2010.12196
Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851 (2013). https://doi.org/10.48550/arXiv.1312.5851
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990). https://doi.org/10.1016/0167-6393(90)90021-Z
Article Google Scholar
Nye, M., Saxe, A.: Are efficient deep representations learnable? arXiv preprint arXiv:1807.06399 (2018). https://doi.org/10.48550/arXiv.1807.06399
van den Oord, A., et al.: Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016). https://doi.org/10.48550/arXiv.1609.03499
Peeters, G., Richard, G.: Deep learning for audio and music. In: Benois-Pineau, J., Zemmari, A. (eds.) Multi-faceted Deep Learning, pp. 231–266. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74478-6_10
Chapter Google Scholar
Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.Y., Sainath, T.: Deep learning for audio signal processing. IEEE J. Sel. Top. Sig. Process. 13(2), 206–219 (2019). https://doi.org/10.1109/JSTSP.2019.2908700
Article Google Scholar
Rosenzweig, S., Schwär, S., Driedger, J., Müller, M.: Adaptive pitch-shifting with applications to intonation adjustment in a cappella recordings. In: 2021 24th International Conference on Digital Audio Effects (DAFx), pp. 121–128. IEEE (2021). https://doi.org/10.23919/DAFx51585.2021.9768268
Roucos, S., Wilgus, A.: High quality time-scale modification for speech. In: ICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10, pp. 493–496. IEEE (1985). https://doi.org/10.1109/ICASSP.1985.1168381
Schedl, M.: Deep learning in music recommendation systems. Front. Appl. Math. Stat. 44 (2019). https://doi.org/10.3389/fams.2019.00044
Verhelst, W., Roelands, M.: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 554–557. IEEE (1993). https://doi.org/10.1109/ICASSP.1993.319366
Wager, S., Tzanetakis, G., Wang, C.i., Kim, M.: Deep autotuner: a pitch correcting network for singing performances. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246–250. IEEE (2020). https://doi.org/10.1109/ICASSP40776.2020.9054308
Zhou, F., Torre, F.d.l.: Canonical time warping for alignment of human behavior. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems. NIPS 2009, pp. 2286–2294. Curran Associates Inc., Red Hook (2009). https://doi.org/10.5555/2984093.2984349
Zou, F., Shen, L., Jie, Z., Zhang, W., Liu, W.: A sufficient condition for convergences of Adam and RMSPROP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11127–11135 (2019). https://doi.org/10.48550/arXiv.1811.09358

Download references

Author information

Authors and Affiliations

Oregon State University, Corvallis, OR, 97331, USA
Patrick J. Donnelly & Parker Carlson

Authors

Patrick J. Donnelly
View author publications
You can also search for this author in PubMed Google Scholar
Parker Carlson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick J. Donnelly .

Editor information

Editors and Affiliations

University of Nottingham, Nottingham, UK
Colin Johnson
University of A Coruña, A Coruña, Spain
Nereida Rodríguez-Fernández
University of Coimbra, Coimbra, Portugal
Sérgio M. Rebelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Donnelly, P.J., Carlson, P. (2023). Transposition of Simple Waveforms from Raw Audio with Deep Learning. In: Johnson, C., Rodríguez-Fernández, N., Rebelo, S.M. (eds) Artificial Intelligence in Music, Sound, Art and Design. EvoMUSART 2023. Lecture Notes in Computer Science, vol 13988. Springer, Cham. https://doi.org/10.1007/978-3-031-29956-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-29956-8_22
Published: 01 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29955-1
Online ISBN: 978-3-031-29956-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Transposition of Simple Waveforms from Raw Audio with Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards End-to-End Raw Audio Music Synthesis

Neural Symbolic Music Genre Transfer Insights

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Transposition of Simple Waveforms from Raw Audio with Deep Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Towards End-to-End Raw Audio Music Synthesis

Neural Symbolic Music Genre Transfer Insights

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation