research-article

Audiovisual Tool for Solfège Assessment

Authors:

Rodrigo Schramm,

Helena De Souza Nunes,

Cláudio Rosito JungAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 13, Issue 1

Article No.: 9, Pages 1 - 21

https://doi.org/10.1145/3007194

Published: 16 December 2016 Publication History

Abstract

Solfège is a general technique used in the music learning process that involves the vocal performance of melodies, regarding the time and duration of musical sounds as specified in the music score, properly associated with the meter-mimicking performed by hand movement. This article presents an audiovisual approach for automatic assessment of this relevant musical study practice. The proposed system combines the gesture of meter-mimicking (video information) with the melodic transcription (audio information), where hand movement works as a metronome, controlling the time flow (tempo) of the musical piece. Thus, meter-mimicking is used to align the music score (ground truth) with the sung melody, allowing assessment even in time-dynamic scenarios. Audio analysis is applied to achieve the melodic transcription of the sung notes and the solfège performances are evaluated by a set of Bayesian classifiers that were generated from real evaluations done by experts listeners.

References

[1]

Frédéric Bevilacqua, Bruno Zamborlin, Anthony Sypniewski, Norbert Schnell, Fabrice Guédy, and Nicolas Rasamimanana. 2010. Continuous realtime gesture following and recognition. In Proceedings of the 8th International Conference on Gesture in Embodied Communication and Human-Computer Interaction (GW’09). Springer-Verlag, Berlin, 73--84.

Digital Library

[2]

Alain de Cheveigné and Hideki Kawahara. 2002. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111, 4 (Apr. 2002), 1917--1930.

[3]

Richard O. Duda, Peter E. Hart, and David G. Stork. 2001. Pattern Classification (2nd ed.). Wiley-Interscience.

Digital Library

[4]

Emilia Gómez and J. Bonada. 2013. Towards computer-assisted flamenco transcription: An experimental comparison of automatic transcription algorithms as applied to a Cappella singing. Comput. Music J. 37 (2013), 73--90.

Digital Library

[5]

Emile Jaques-Dalcroze. 2014. Rhythm, Music and Education. Read Books Ltd.

[6]

Eamonn J. Keogh and Michael J. Pazzani. 2001. Derivative dynamic time warping. In Proceedings of First International Conference on Data Mining (SDM’01).

[7]

Seong-Ju Kim. 1992. The metrically trimmed mean as a robust estimator of location. Ann. Stat. 20, 3 (Sep. 1992), 1534--1547.

[8]

Anssi Klapuri and Manuel Davy. 2006. Signal Processing Methods for Music Transcription. Springer-Verlag, New York, NY.

Digital Library

[9]

Maartje Koning. 2015. A New Illusion in the Perception of Relative Pitch Intervals. Ph.D. Dissertation. Faculty of Humanities of the University of Amsterdam.

[10]

Chang-Hung Lin, Yuan-Shan Lee, Ming-Yen Chen, and Jia-Ching Wang. 2014. Automatic singing evaluating system based on acoustic features and rhythm. In Proceedings of IEEE International Conference on Orange Technologies (ICOT’14). 165--168.

[11]

Pieter-Jan Maes, Denis Amelynck, Micheline Lesaffre, Marc Leman, and D. K. Arvind. 2013. The “conducting master”: An interactive, real-time gesture monitoring system based on spatiotemporal motion templates. Int. J. Hum. Comput. Interact. 29, 7 (2013), 471--487.

[12]

Marcella Mandanici and Sylviane Sapir. 2012. Disembodied voices: A kinect virtual choir conductor. In Proceedings of the 9th Sound and Music Computing Conference, Sound and Music Computing (Eds.). 271--276.

[13]

Matthias Mauch and Simon Dixon. 2014. pYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’14). IEEE, 659--663.

[14]

Emilio Molina, Ana M. Barbancho, Lorenzo J. Tardón, and Isabel Barbancho. 2014. Evaluation framework for automatic singing transcription. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR’14). ISMIR, 567--572.

[15]

Emilio Molina, Isabel Barbancho, Emilia Gómez, Ana M. Barbancho, and Lorenzo J. Tardón. 2013. Fundamental frequency alignment vs. note-based melodic similarity for singing voice assessment. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’13). 744--748.

[16]

Emilio Molina, Lorenzo J. Tardón, Ana M. Barbancho, and Isabel Barbancho. 2015. SiPTH: Singing transcription based on hysteresis defined on the pitch-time curve. IEEE/ACM Trans. Audio, Speech Lang. Process. 23, 2 (Feb 2015), 252--263.

Digital Library

[17]

Meinard Müller. 2007. Information Retrieval for Music and Motion. Springer-Verlag, Berlin.

Digital Library

[18]

Meinard Müller. 2015. Fundamentals of Music Processing -- . Springer-Verlag, Berlin.

Digital Library

[19]

Eugene Narmour. 1990. The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model. The University of Chicago Press.

[20]

Max Rudolf. 1980. The Grammar of Conducting (2nd ed.). Schirmer Books Inc., New York, NY.

[21]

Matti Ryynänen and Anssi Klapuri. 2004. Modelling of note events for singing transcription. In Proceedings of ISCA—Tutorial and Research Workshop on Statistical and Perceptual Audio. MIT Press.

[22]

Rodrigo Schramm, Helena de Souza Nunes, and Cláudio Rosito Jung. 2015a. Automatic Solfège assessment. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR’15). 183--189.

[23]

Rodrigo Schramm, Cláudio Rosito Jung, and Eduardo Reck Miranda. 2015b. Dynamic time warping for music conducting gestures evaluation. IEEE Trans. Multimed. 17, 2 (Feb 2015), 243--255.

[24]

Keith Swanwick. 1994. Musical Knowledge, Intuition, Analysis and Music Education. Routledge, Londres.

[25]

Robert F. Tate. 1954. Correlation between a discrete and a continuous variable. Point-biserial correlation. Ann. Math. Stat. 25, 3 (1954), 603--607.

[26]

Leng-Wee Toh, W. Chao, and Yi-Shin Chen. 2013. An interactive conducting system using kinect. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 1--6.

[27]

Timo Viitaniemi, Anssi Klapuri, and Antti Eronen. 2003. A probabilistic model for the transcription of single-voice melodies. In Proceedings of the 2003 Finnish Signal Processing Symposium. 59--63.

[28]

Andrew R. Webb. 2011. Statistical Pattern Recognition (3rd ed.). Wiley, Chichester, UK.

[29]

Yang Zhang and T. F. Edgar. 2008. A robust dynamic time warping algorithm for batch trajectory synchronization. In Proceedings of American Control Conference. 2864--2869.

[30]

Katie Zhukov. 2015. Challenging approaches to assessment of instrumental learning. In Assessment in Music Education: From Policy to Practice, Don Lebler, Gemmal Carey, and Scott D. Harrison (Eds.). Vol. 16. Springer International, Switzerland.

Cited By

Ferreira DHaworth B(2022)DeepSolfège: Recognizing Solfège Hand Signs Using Convolutional Neural NetworksAdvances in Visual Computing10.1007/978-3-030-90439-5_4(39-50)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-90439-5_4

Recommendations

Fixation Prediction through Multimodal Analysis

In this article, we propose to predict human eye fixation through incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli’s visual features, yet they bypass all audio information. In the real ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 1

February 2017

278 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3012406

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2016

Accepted: 01 October 2016

Revised: 01 July 2016

Received: 01 March 2016

Published in TOMM Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

CAPES Foundation
Ministry of Education of Brazil

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
172
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ferreira DHaworth B(2022)DeepSolfège: Recognizing Solfège Hand Signs Using Convolutional Neural NetworksAdvances in Visual Computing10.1007/978-3-030-90439-5_4(39-50)Online publication date: 1-Jan-2022
https://doi.org/10.1007/978-3-030-90439-5_4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents