DATASET ARTICLE

GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music

Authors

Qiuqiang Kong
Bochen Li
Jitong Chen
Yuxuan Wang

Abstract

Symbolic music datasets are important for music information retrieval and musical analysis. However, there is a lack of large-scale symbolic datasets for classical piano music. In this article, we describe the creation of the GiantMIDI-Piano (GP) dataset containing 38,700,838 transcribed notes and 10,855 unique solo piano works composed by 2,786 composers. We extract the names of music works and the names of composers from the International Music Score Library Project (IMSLP). We search and download their corresponding audio recordings from the Internet. We further create a curated subset containing 7,236 works composed by 1,787 composers where the titles of downloaded audio recordings contain the surnames of composers. We apply a convolutional neural network to detect solo piano works. Then, we transcribe those solo piano recordings into Musical Instrument Digital Interface (MIDI) files using a high-resolution piano transcription system. Each transcribed MIDI file contains the onset, offset, pitch, and velocity attributes of piano notes and pedals. GiantMIDI-Piano includes 90% live performance MIDI files and 10% sequence input MIDI files. We analyse the statistics of GiantMIDI-Piano and show pitch class, interval, trichord, and tetrachord frequencies of six composers from different eras to show that GiantMIDI-Piano can be used for musical analysis. We evaluate the quality of GiantMIDI-Piano in terms of solo piano detection F1 scores, metadata accuracy, and transcription error rates. We release the source code for acquiring the GiantMIDI-Piano dataset at https://github.com/bytedance/GiantMIDI-Piano.

Keywords:

Year: 2022

Volume: 5 Issue: 1

Page/Article: 87–98

DOI: 10.5334/tismir.80

Submitted on Oct 25, 2020

Accepted on Feb 1, 2022

Published on May 12, 2022

Peer Reviewed

CC BY 4.0

1 Introduction

Symbolic music datasets are important for music information retrieval (MIR) and musical analysis. In the Western music tradition, musicians use musical notation to write music. This notation includes pitches, rhythms, and chords. Musicologists analyse music works by reading music notation. Recently, computers have been used to process and analyse large-scale data and have been widely used in MIR. However, there is a lack of large-scale symbolic music datasets covering a wide range of solo piano works.

One difficulty of computer-based MIR is that musical notation such as staffs is not directly readable by a computer. Therefore, converting music notation into computer-readable formats is important. Early works of converting music into symbolic representations can be traced back to the 1900s, when piano rolls (; ) were developed to record music that could be played on a musical instrument. Piano rolls are continuous rolls of paper with perforations punched into them. In 1981, the Musical Instrument Digital Interface (MIDI) () was proposed as a technical standard to represent music that can be read by a computer. MIDI files use event messages to specify the instructions of music, including pitch, onset, offset, and velocity of notes. MIDI files also carry rich information about music events such as sustain pedals. The MIDI format has been popular for music production in recent years.

In this work, we focus on building a large-scale MIDI dataset for classical solo piano music. There are several previous piano MIDI datasets including the Piano-midi.de dataset, the MAESTRO dataset (), the Classical Archives dataset, and the Kunstderfuge dataset However, those datasets are limited to hundreds of composers and hundreds of hours of unique works. MusicXML () is another symbolic format of music, but there are fewer MusicXML datasets than MIDI datasets. Other machine-readable formats include the music encoding initiative (MEI) (), Humdrum (), and LilyPond (). Optical music recognition (OMR) (; ) is a technique to transcribe score images into symbolic formats. However, the performance of OMR systems is limited by score quality.

In this article, we collect and transcribe a large-scale classical piano MIDI dataset called GiantMIDI-Piano. To our knowledge, GiantMIDI-Piano is the largest piano MIDI dataset so far. GiantMIDI-Piano is collected as follows: 1) We parse the names of composers and the names of music works from the International Music Score Library Project (IMSLP); 2) We search and download audio recordings of all matching music works from YouTube; 3) We build a solo piano detection system to detect solo piano recordings; 4) We transcribe solo piano recordings into MIDI files using a high-resolution piano transcription system (). In this article, we analyse the statistics of GiantMIDI-Piano, including the number of works, durations of works, and nationalities of composers. In addition, we analyse the statistics of note, interval, and chord distributions of six composers from different eras to show that GiantMIDI-Piano can be used for musical analysis.

1.1 Applications

The GiantMIDI-Piano dataset can be used in many research areas, including: 1) Computer-based musical analysis (; ) such as using computers to analyse the structure, chords, and melody of music works. 2) Symbolic music generation (; ). 3) Computer-based music information retrieval (; ) such as music transcription and music tagging. 4) Expressive performance analysis () such as analysing the performance of different pianists.

This paper is organised as follows: Section 2 surveys piano MIDI datasets; Section 3 introduces the collection of the GiantMIDI-Piano dataset; Section 4 investigates the statistics of the GiantMIDI-Piano dataset; and Section 5 evaluates the quality of the GiantMIDI-Piano dataset.

2 Dataset Survey

We introduce several piano MIDI datasets as follows. The Piano-midi.de dataset contains classical solo piano works entered via a MIDI sequencer. Piano-midi.de contains 571 works composed by 26 composers with a total duration of 36.7 hours until Feb. 2020. The Classical Archives collection contains a large number of MIDI files of classical music, including both piano and non-piano works. There are 133 composers with a total duration of 46.3 hours of MIDI files in this dataset. The KernScores dataset() contains classical music with a Humdrum format and is obtained by an optical music recognition system. The Kunstderfuge dataset contains solo piano and non-solo piano works of 598 composers. All of the Piano-midi.de, Classical Archives, and Kunstderfuge datasets are entered using a MIDI sequencer and are not played by pianists.

The MAPS dataset () used MIDI files from Piano-midi.de to render audio recordings by playing back the MIDI files on a Yamaha Disklavier. The MAESTRO dataset () contains over 200 hours of finely aligned MIDI files and audio recordings. In MAESTRO, virtuoso pianists performed on Yamaha Disklaviers and were recorded with the integrated MIDI capture system. MAESTRO contains music works from 62 composers. There are several duplicated works in MAESTRO. For example, there are 11 versions of Scherzo No. 2 in B-flat Minor, Op. 31 composed by Chopin. All duplicated works are removed when calculating the number and duration of works.

Table 1 shows the number of composers, the number of unique works, total durations, and data types of different MIDI datasets. Data types include sequenced (Seq.) MIDI files input using MIDI sequencers and performed (Perf.) MIDI files played by pianists. There are other MIDI datasets including the Lakh dataset (), the Bach Doodle dataset (), the Bach Chorales dataset (), the URMP dataset (), the Bach10 dataset (), the CrestMusePEDB dataset (), the SUPRA dataset (), and the ASAP dataset (). Huang et al. () collected 10,000 hours of piano recordings for music generation, but the dataset is not publicly available.

Table 1

Piano MIDI datasets. GP is the abbreviation for GiantMIDI-Piano.


DATASET	COMPOSERS	WORKS	HOURS	TYPE

Piano-midi.de	26	571	37	Seq.

Classical Archives	133	856	46	Seq.

Kunstderfuge	598	–	–	Seq.

KernScores	–	–	–	Seq.

SUPRA	111	410	–	Perf.

ASAP	16	222	–	Perf.

MAESTRO	62	529	84	Perf.

MAPS	–	270	19	Perf.

GiantMIDI-Piano	2,786	10,855	1,237	90% Perf.

Curated GP	1,787	7,236	875	89% Perf.

3 GiantMIDI-Piano Dataset

3.1 Metadata from IMSLP

To begin with, we acquire the names of composers and the names of music works by parsing the web pages of the International Music Score Library Project, the largest publicly available music library in the world. In IMSLP, each composer has a web page containing a list of their works. We acquire 143,701 music works composed by 18,067 composers by parsing those web pages. For each composer, if there exists a biography link on the composer page, we access that biography link to search for their birth year, death year, and nationality. We set the birth year, death year, and nationality to “unknown” if a composer does not have such a biography link. We obtain the nationalities of 4,274 composers and births of 5,981 composers out of 18,067 composers by automatically parsing the biography links.

As the automatically parsed meta-information of composers from the Internet is incomplete, we manually check the nationalities, births, and deaths for 2,786 composers. We label 2,291 birth years, 2,254 death years, and 2,115 nationalities by searching the information of composers on the Internet. We label not found birth years, death years, and nationalities as “unknown”. We create metadata files containing the information of composers and music works, respectively.

3.2 Search Audio

We search audio recordings on YouTube by using a keyword of first name, surname, music work name from the metadata. For each keyword, we select the first returned result on YouTube. However, there can be cases where the returned YouTube title may not exactly match the keyword. For example, for a keyword Frédéric Chopin, Scherzo No.2 Op.31, the top returned result can be Chopin – Scherzo No. 2, Op. 31 (Rubinstein). Although the keyword and the returned YouTube title are different, they indicate the same music work. We denote the set of words in a search keyword as X, and the set of words in a returned YouTube title as Y. We propose a modified Jaccard similarity () to evaluate how well a keyword and a returned result are matched.

The original Jaccard similarity is defined as J = |X ∩ Y|/(|X| ∪ |Y|). The drawback of this original Jaccard similarity is that the length of the searched YouTube title |Y| can be long, so that J will be small. This is often the case because searched YouTube titles usually contain extra words such as the names of performers and the dates of performances. Our aim is to define a metric where the denominator only depends on the search keyword |X| and is independent of the length of the searched YouTube title |Y|. We propose a modified Jaccard similarity () between X and Y as:

(1)

J = | X ∩ Y | / | X | .

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[J\,\, = \,\,|X \cap Y|/|X|.\] \end{document}

Higher J indicates that X and Y have larger similarity, and lower J indicates that X and Y have less similarity. We empirically set a similarity threshold to 0.6 to balance the precision and recall of searched results. If J is strictly larger than this threshold, then we say X are Y are matched, and vice versa. In total, we retrieve and download 60,724 audio recordings out of 143,701 music works.

3.3 Solo Piano Detection

We detect solo piano works from IMSLP to build the GiantMIDI-Piano dataset. Filtering music works with keywords containing “piano” may lead to incorrect results. For example, a “Piano Concerto” is an ensemble of piano and orchestra which is not solo piano. On the other hand, the keyword Chopin, Frédéric, Nocturnes, Op.62 does not contain the word “piano”, but it is indeed solo piano. To address this problem, we train an audio-based solo piano detection system using a convolutional neural network (CNN) (). The piano detection system takes 1-second segments as input and extracts log mel spectrograms as input to the CNN.

The CNN consists of four convolutional layers. Each convolutional layer consists of a linear convolutional operation with a kernel size of 3 × 3, a batch normalization (), and a ReLU nonlinearity (). The output of the CNN predicts the solo piano probability of a segment. Binary cross-entropy is used as a loss function to train the CNN. We collect solo piano recordings as positive samples and collect music and other sounds as negative samples. In addition, the mixtures of piano and other sounds are also used as negative samples. In inference, we average the predictions of all 1-second segments of a recording to calculate its solo piano probability. We regard an audio recording as solo piano if the probability is strictly larger than 0.5 and vice versa. In total, we obtain 10,855 solo piano recordings, composed by 2,786 composers, out of 60,724 downloaded audio recordings. These 10,855 audio files are transcribed into MIDI files which constitute the full GiantMIDI-Piano dataset.

3.4 Constrain Composer Surnames

Among the detected 10,855 solo piano works, there are several music works composed by not well-known composers which are attached to famous composers. For example, there are 273 searched music works assigned to Chopin, but only 102 of them are actually composed by Chopin, while other music works are composed by other composers. To alleviate this problem, we create a curated subset by constraining the titles of downloaded audio recordings to contain the surnames of composers. After this constraint, we obtain a curated GiantMIDI-Piano dataset containing 7,236 music works composed by 1,787 composers.

3.5 Piano Transcription

We transcribe all 10,855 solo piano recordings into MIDI files using an open-sourced high-resolution piano transcription system (), an improvement over the onsets and frames piano transcription system (, ) and other systems (; ). The piano transcription system is trained on the training subset of the MAESTRO dataset version 2.0.0 (). The training and testing subsets contain 161.3 and 20.5 hours of aligned piano recordings and MIDI files, respectively. The piano transcription system predicts all of the pitch, onset, offset, and velocity attributes of notes. The transcribed results also include sustain pedals. For piano note transcription, our system consists of a frame-wise classification, an onset regression, an offset regression, and a velocity regression sub-module. Each sub-module is modeled by a convolutional recurrent neural network (CRNN) with eight convolutional layers and two bi-directional gated recurrent unit (GRU) layers. The output of each module has a dimension of 88, equivalent to the number of notes on a modern piano.

The pedal transcription system has the same architecture as the note transcription system, except that there is only one output after the CRNN sub-module indicating the onset or offset probabilities of pedals. In inference, all piano recordings are converted into mono with a sampling rate of 16 kHz. We use a short-time Fourier transform (STFT) with a Hann window size 2048 and a hop size 160 to extract spectrograms, so there are 100 frames per second. Then, mel filter banks with 229 bins are used to extract a log mel spectrogram as input feature (). The transcription system outputs the predicted probabilities of pitch, onset, offset, and velocity. Finally, those outputs are post-processed into MIDI events.

4 Statistics

We analyse the statistics of GiantMIDI-piano, including the number and duration of music works composed by different composers, the nationality of composers, and the distribution of notes by composers. Then, we investigate the statistics of six composers from different eras by calculating their pitch class, interval, trichord, and tetrachord distributions. All of Figure 1 to Figure 11 except Figure 3 are plotted with the statistics of the curated GiantMIDI-Piano dataset. Figure 3 shows the manually-checked nationalities of 2,786 composers in the full GiantMIDI-Piano dataset.

Figure 1

Number of solo piano works in the curated GP dataset. Top 100 are shown.

4.1 The Number of Solo Piano Works

Figure 1 shows the number of piano works composed by each composer, sorted in descending order, for the curated GiantMIDI-Piano dataset. Figure 1 shows the statistics of the top 100 composers out of 2,786 composers. Blue bars show the number of solo piano works. Pink bars show the total number of works, including both solo piano and non-solo piano works. Figure 1 shows that there are 141 solo piano works composed by Liszt, followed by 140 and 129 solo piano works composed by Scarlatti and J. S. Bach. Some composers, such as Chopin, composed more solo piano works than non-solo piano works. For example, there are 96 solo piano works out of 109 complete works composed by Chopin in the curated GiantMIDI-Piano dataset. Figure 1 shows that the number of solo piano works of different composers has a long tail distribution.

4.2 The Duration of Solo Piano Works

Figure 2 shows the duration of solo piano works for each composer, sorted in descending order, for the curated GiantMIDI-Piano dataset. The duration of works composed by Liszt is the longest at 25 hours, followed by Beethoven at 21 hours and Schubert at 20 hours. Some composers composed more non-piano works than solo piano works. For example, there are 108 hours of works composed by Handel in the dataset, while only 2 hours of them are played on a modern piano. The rank of composers in Figure 2 is different from Figure 1, indicating that the average durations of solo piano works composed by different composers are different.

Figure 2

Duration of solo piano works in the curated GP dataset. Top 100 are shown.

4.3 Nationalities of Composers

Figure 3 shows the number of composers of each nationality sorted in descending order for the full GiantMIDI-Piano dataset. The nationalities of 2,786 composers are initially obtained from Wikipedia and are later manually checked.Figure 3 shows that there are 671 composers with unknown nationality. There are 364 German composers, followed by 322 French composers and 267 US American composers. We color-code the continent of nationalities from “Unknown”, “European”, “North American”, “South American”, “Asian”, to “African”. In GiantMIDI-Piano, the nationalities of most composers are European. The numbers of composers with nationalities from South America, Asia, and Africa are fewer.

Figure 3

Distribution of composers’ nationalities for the full GP dataset.

4.4 Note Histogram

Figure 4 shows the note histogram of the curated GiantMIDI-Piano dataset. There are 24,253,495 transcribed notes. The horizontal axis shows scientific pitch notation, which covers 88 notes on a modern piano from A₀ to C₈. Middle C is denoted as C₄. We do not distinguish enharmonic notes, for example, a note C♯/D♭ is simply denoted as C♯. The white and black bars correspond to the white and black keys on a modern piano, respectively. Figure 4 shows that the note histogram has a normal distribution. The most played note is G₄. There are more notes close to G₄ and less notes far from G₄. The most played notes are within the octave between C₄ and C₅. White keys are played more often than black keys.

Figure 4

Pitch distribution of the top 100 composers in the curated GP dataset.

Figure 5 visualizes the note histograms of three composers from different eras: J. S. Bach, Beethoven, and Liszt. The note range of J. S. Bach is mostly between C₂ and C₆ covering four octaves, which is consistent with the note range of a conventional harpsichord or organ. The note range of Beethoven is mostly between F₁ and C₇ covering five and a half octaves. The note range of Liszt is the widest, covering the whole range of a modern piano.

Figure 5

Note histogram for the curated GP dataset.

4.5 Pitch Distribution of the Top 100 Composers

Figure 6 shows the pitch distribution sorted in ascending order of average pitch for the top 100 composers in Figure 2 from the curated GiantMIDI-Piano dataset. The average pitches of most composers are between C₄ and C₅, where C₄ corresponds to a MIDI pitch value 60. The shades indicate the one standard deviation area of pitch distributions. Jeffrey Michael Harrington has the lowest average pitch value of C₄. Carl Czerny has the highest average pitch value of A₄.

Figure 6

Note histogram for J.S. Bach, Beethoven, and Liszt from the curated GP dataset.

4.6 The Number of Notes Per Second Distribution of the Top 100 Composers

Figure 7 shows the number of notes per second distribution, sorted in ascending order, for the top 100 composers in Figure 2 from the curated GiantMIDI-Piano dataset. The number of notes per second is calculated by dividing the number of notes in all works by the total duration of all works of a composer. The average numbers of notes per second of most composers are between 5 and 10. The shades indicate the one standard deviation area of the number of notes per second distribution. Alfred Grünfeld has the smallest number of notes per second with a value of 4.18. Carl Czerny has the largest number of notes per second with a value of 13.61.

Figure 7

The number of notes per second of the top 100 composers in the curated GP dataset.

4.7 Pitch Class Distribution

We denote the set of pitch classes as {C, C♯, D, D♯, E, F, F♯, G, G♯, A, A♯, B}. The notes from C to B are denoted as 0 to 11 (), respectively. We calculate the statistics of six composers from different eras including J. S. Bach, Mozart, Beethoven, Chopin, Liszt, and Debussy. Figure 8 shows that J. S. Bach used D, E, G, and A most in his solo piano works. Mozart used C, D, F, and G most in his solo piano works and used more A♯/B♭ than other composers. Beethoven used more C, D, and G than other notes. Chopin used D♯/E♭ and G♯/A♭ most in his solo piano works. Liszt and Debussy used all twelve pitch classes more uniformly in their solo piano works than other composers. Liszt used E most, and Debussy used C♯/D♭ most. As expected, most Baroque and Classical solo piano works were in keys close to C, whereas Romantic and later composers explored distant keys and tended to use all twelve pitch classes more uniformly.

Figure 8

Pitch class distribution of six composers for the curated GP dataset.

4.8 Interval Distribution

An interval is the pitch difference between two notes. Intervals can be either melodic intervals or harmonic intervals. A harmonic interval is the pitch difference of two notes being played at the same time. A melodic interval is the pitch difference between two successive notes. We consider both harmonic intervals and melodic intervals as intervals. We calculate the distribution of intervals of six composers. Notes are represented as a list of events in a MIDI format. We calculate an interval as:

(2)

Δ = y n − y n − 1,

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[\Delta = {y_n} - {y_{n - 1}},\] \end{document}

where y_n is the MIDI pitch of a note and n is the index of the note.

We calculate ordered intervals including both positive intervals and negative intervals. For example, the interval Δ for an upward progress from C₄ to D₄ is 2, and the interval Δ for a downward progress from C₄ to A₃ is –3. We only consider intervals between –11 and 11 (inclusive), and discard the intervals outside this region. For example, the value 11 indicates a major seventh interval. Figure 9 shows the interval distribution of six composers. All composers used major second and minor third most in their works. The interval distribution is not symmetric to the origin. For example, J. S. Bach and Mozart used more downward major seconds than upward major seconds. In the works of J. S. Bach, the dip in the interval 0 indicates that repeated notes are less commonly used than non-repeated notes. Other composers used more repeated notes than J. S. Bach. Major seventh and tritone are least used by all composers. Some Romantic and later composers, including Chopin, Liszt, and Debussy used all intervals more uniformly than J. S. Bach from the Baroque era.

Figure 9

Interval distribution of six composers for the curated GP dataset.

4.9 Trichord Distribution

We adopt the set musical theory () to analyse the chord distribution in GiantMIDI-Piano. A trichord is a set of any three pitch-classes (). Since GiantMIDI-Piano is transcribed from real recordings, notes of a chord are usually not played simultaneously. We consider notes with onsets in a window of 50 ms as a chord. The windows are non-overlapped. Each note only belongs to one chord. For a special case of a set of onsets at 0, 25, 50, 75, and 100 ms, our system first searches for chords in a window starting from 0 ms and returns {0, 25, 50}. Then, the system searches for chords in a window starting from 75 ms and returns {75, 100}. We discard pitch sets with more or less than three notes within 50 ms. The sliding window for counting pitch sets will ensure that there are no overlapping notes when counting trichords. A major triad can be written as {0, 4, 7}, where the interval between 0 and 4 is a major third, and the interval between 4 and 7 is a minor third. We transpose all chords to chords with lower notes C. For example, a chord {2, 6, 9} is transposed into {0, 4, 7}. We merge chords with the same prime form. Figure 10 shows the trichord distribution of six composers. All composers used the major triad {0, 4, 7} most, followed by minor triad {0, 3, 7}, in their works. Liszt used more augmented triads {0, 4, 8} than other composers. Debussy used more {0, 2, 5} and {0, 2, 4} than other composers, which distinguished him from other composers. Figure 10 shows that composers from different eras have different preferences for using trichords.

Figure 10

Trichord distribution of six composers for the curated GP dataset showing relative (rel.) frequencies of the top six trichords.

4.10 Tetrachord Distribution

A tetrachord is a set of any four pitch-classes (). Similar to trichords, we only consider notes with onsets within a window of 50 ms as a chord. We discard pitch sets with more or less than four notes within a 50 ms window to ensure chords are tetrachords. A dominant seventh chord can be denoted as {0, 4, 7, 10}. Figure 11 shows the tetrachord distributions of six composers. Seventh chords such as {0, 2, 6, 9} are transposed to root position seventh chords {0, 4, 7, 10}. Figure 11 shows that Bach, Beethoven, Mozart, and Chopin used dominant seventh tetrachords {0, 4, 7, 10} most. Liszt used diminished seventh {0, 3, 6, 9} most and Debussy used minor seventh {0, 3, 7, 10} most. J. S. Bach used less dominant seventh than the other five composers. The tetrachord distribution of Debussy is different from that of other composers. Figure 11 shows that composers from different eras have different preferences for using tetrachords.

Figure 11

Tetrachord distribution of six composers for the curated GP dataset showing the top six tetrachords.

5 Evaluation of GiantMIDI-Piano

5.1 Solo Piano Evaluation

We evaluate the solo piano detection system as follows. We manually label 200 randomly selected music works from 60,724 downloaded audio recordings. We calculate the precision, recall, and F1 scores of the solo piano detection system with different thresholds ranging from 0.1 to 0.9 and show results in Figure 12. Horizontal and vertical axes show different thresholds and scores, respectively. Figure 12 shows that higher thresholds lead to higher precision but lower recall. When we set the threshold to 0.5, the solo piano detection system achieves a precision, recall, and F1 score of 89.66%, 86.67%, and 88.14%, respectively. In this work, we set the threshold to 0.5 to balance the precision and recall for solo piano detection.

Figure 12

Precision, recall, and F1 score of solo piano detection.

5.2 Metadata Evaluation

We randomly select 200 solo piano works from the full GiantMIDI-Piano dataset and manually check how many audio recordings and metadata are matched. We observe that 174 out of 200 solo piano works are correctly matched, leading to a metadata accuracy of 87%. Most errors are caused by mismatched composer names. For example, when the keyword X is Chartier, Mathieu, Nocturne No.1 composed by Chartier, the retrieved YouTube title Y is Nocturne No. 1 composed by Chopin. After constraining surnames, 136 out of 140 solo piano works are correctly matched, leading to a precision of 97.14%. We also observe that there are 180 live performances and 20 sequenced MIDI files out of the 200 solo piano works.

Furthermore, Table 2 shows the number of matched music works composed by six different composers. Correct indicates that the retrieved solo piano works are indeed composed by the composer. Incorrect indicates that the retrieved music works are not composed by the composer but are composed by someone else and attached to the composer. Without the surname constraint, Liszt achieves the highest match accuracy of 90%, while Chopin achieves the lowest match accuracy of 37%. With the surname constraint, Table 3 shows that the match accuracy of Chopin increases from 37% to 82%. The accuracy of other composers also increases. The curated GiantMIDI-Piano dataset contains 7,236 MIDI files composed by 1,787 composers. We use a youtube_title_contains_surname flag in the metadata file to indicate whether the surname is verified.

Table 2

Accuracy of retrieved music works of six composers.


	J. S. BACH	MOZART	BEETHOVEN	CHOPIN	LISZT	DEBUSSY

Correct	147	85	82	102	197	29

Incorrect	102	35	70	171	22	9

Accuracy	59%	71%	54%	37%	90%	76%

Table 3

Accuracy of retrieved music works of six composers, using the surname constraint.


	J. S. BACH	MOZART	BEETHOVEN	CHOPIN	LISZT	DEBUSSY

Correct	129	72	76	96	141	27

Incorrect	44	16	5	21	6	3

Accuracy	75%	82%	94%	82%	96%	90%

5.3 Piano Transcription Evaluation

The piano transcription system achieves a state-of-the-art onset F1 score of 96.72%, an onset and offset F1 score of 82.47%, and an onset, offset, and velocity F1 score of 80.92% on the test set of the MAESTRO dataset. The sustain pedal transcription system achieves an onset F1 of 91.86%, and a sustain-pedal onset and offset F1 of 86.58%. The piano transcription system outperforms the previous onsets and frames system (, ) with an onset F1 score of 94.80%.

We evaluate the quality of GiantMIDI-Piano on 52 music works that appear in all of the GiantMIDI-Piano, the MAESTRO, and the Kunstderfuge datasets. Long music works such as Sonatas are split into movements. Repeated music sections are removed. Evaluating GiantMIDI-Piano is a challenging problem because there are no aligned ground-truth MIDI files, so the metrics of Hawthorne et al. () are not usable. In this work, we propose to use an alignment metric () called error rate (ER) to evaluate the quality of transcribed MIDI files. This metric reflects the substitutions, deletions, and insertions between a transcribed MIDI file and a target MIDI file. For a solo piano work, we align a transcribed MIDI file with its sequenced MIDI version using a hidden Markov model (HMM) tool (), where the sequenced MIDI files are from the Kunstderfuge dataset. The ER is defined as the normalized summation of substitutions, insertions, and deletions:

(3)

ER = S + D + I N,

M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ER = \frac{{S + D + I}}{N},\] \end{document}

where N is the number of reference notes, and S, I, and D are the numbers of substitutions, insertions, and deletions, respectively. A substitution indicates that a note replaces a ground truth note. An insertion indicates that an extra note is played. A deletion indicates that a note is missing. Lower ER indicates better transcription performance.

The ER of music works from GiantMIDI-Piano consists of three parts: 1) performance errors, 2) transcription errors, and 3) alignment errors:

(4)

E R G = e performance G + e transcription G + e alignment G

M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[E{R_{\rm{G}}} = {e_{{\rm{performanc}}{{\rm{e}}_{\rm{G}}}}} + {e_{{\rm{transcriptio}}{{\rm{n}}_{\rm{G}}}}} + {e_{{\rm{alignmen}}{{\rm{t}}_{\rm{G}}}}}\] \end{document}

where the subscript G is the abbreviation for GiantMIDI-Piano. The performance errors $e performance G$ M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{performanc}}{{\rm{e}}_{\rm{G}}}}}\] \end{document} come from a pianist accidentally missing or adding notes while performing (). The transcription errors $e transcription G$ M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{alignmen}}{{\rm{t}}_{\rm{G}}}}}\] \end{document} come from piano transcription system errors. The alignment errors $e alignment G$ M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{alignmen}}{{\rm{t}}_{\rm{G}}}}}\] \end{document} come from the sequence alignment algorithm ().

Audio recordings and MIDI files are perfectly aligned in the MAESTRO dataset, so there are no transcription errors. The ER can be written as:

(5)

E R M = e performance M + e alignment M

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[E{R_{\rm{M}}} = {e_{{\rm{performanc}}{{\rm{e}}_{\rm{M}}}}} + {e_{{\rm{alignmen}}{{\rm{t}}_{\rm{M}}}}}\] \end{document}

where the subscript M is the abbreviation for MAESTRO. For a given music work, we assume the approximation $e performance G ≈ e performance M$ M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{performanc}}{{\rm{e}}_{\rm{G}}}}} \approx {e_{{\rm{performanc}}{{\rm{e}}_{\rm{M}}}}}\] \end{document} despite the differences in performance among pianists. Similarly, we assume the approximation $e alignment G ≈ e alignment M$ M11 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{alignmen}}{{\rm{t}}_{\rm{G}}}}} \approx {e_{{\rm{alignmen}}{{\rm{t}}_{\rm{M}}}}}\] \end{document} although the alignment errors can be different.

Those approximations are more accurate when the levels of the two pianists are closer. Then, we propose a relative error by subtracting (4) and (5):

(6)

r = E R G − E R M ≈ e transcription G

M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[r = E{R_{\rm{G}}} - E{R_{\rm{M}}} \approx {e_{{\rm{transcriptio}}{{\rm{n}}_{\rm{G}}}}}\] \end{document}

The relative error r is a rough approximation of the transcription errors $e transcription G$ M12 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[{e_{{\rm{transcriptio}}{{\rm{n}}_{\rm{G}}}}}\] \end{document} . A lower r value indicates better transcription quality.

Table 4 shows the alignment performance. The median alignment S_M, D_M, I_M and ER_M on the MAESTRO dataset are 0.009, 0.024, 0.021 and 0.061 respectively. The median alignment S_G, D_G, I_G and ER_G on the GiantMIDI-Piano dataset are 0.015, 0.051, 0.069 and 0.154 respectively. The relative error r between MAESTRO and GiantMIDI-Piano is 0.094. The first column of Figure 13 shows the box plot metrics of MAESTRO. Some outliers are omitted from the figures for better visualization. Some outliers are mostly caused by different interpretations of trills and tremolos. The second column of Figure 13 shows the box plot metrics of GiantMIDI-Piano. In GiantMIDI-Piano, Keyboard Sonata in E-Flat Major, Hob. XVI/49 composed by Haydn achieves the lowest ER of 0.037, while Prelude and Fugue in A-flat major, BWV 862 composed by Bach achieves the highest ER of 0.679 (outlier beyond the plot range). This underperformance is due to the piano not being tuned to a standard pitch with A₄ of 440 Hz. The third column of Figure 13 shows the relative ER between MAESTRO and GiantMIDI-Piano. The relative median scores of S, D, I and ER are 0.006, 0.026, 0.047 and 0.094 respectively. Figure 13 also shows that there are fewer deletions than insertions.

Table 4

Piano transcription evaluation on the GiantMIDI-Piano dataset.


	D	I	S	ER

Maestro	0.009	0.024	0.018	0.061
GiantMIDI-Piano	0.015	0.051	0.069	0.154

Relative difference	0.006	0.026	0.047	0.094

Figure 13

From left to right: error rate (ER) of 52 solo piano works in the MAESTRO dataset; ER of 52 solo piano works in the GiantMIDI-Piano dataset; relative ER between the MAESTRO and the GiantMIDI-Piano dataset.

6 Conclusion

We collect and transcribe the large-scale GiantMIDI-Piano dataset containing 38,700,838 transcribed piano notes from 10,855 unique classical piano works composed by 2,786 composers. The total duration of GiantMIDI-Piano is 1,237 hours. The curated subset contains 24,253,495 piano notes from 7,236 works composed by 1,787 composers. GiantMIDI-Piano is transcribed from YouTube audio recordings searched using meta-information from IMSLP.

The solo piano detection system used in GiantMIDI-Piano achieves an F1 score of 88.14%, and the piano transcription system achieves a relative error rate of 0.094. The limitations of GiantMIDI-Piano include: 1) There are no pitch spellings to distinguish enharmonic notes; 2) GiantMIDI-Piano does not provide beats, time signatures, key signatures, and scores; and 3) GiantMIDI-Piano does not disentangle the music score and the expressive performance of pianists.

We have released the source code for acquiring GiantMIDI-Piano. In the future, GiantMIDI-Piano can be used in many research areas, including but not limited to musical analysis, music generation, music information retrieval, and expressive performance analysis.

Notes

Classical Piano MIDI Page by B. Krueger, 1996, http://www.piano-midi.de.
www.classicalarchives.com, 2000.
http://www.kunstderfuge.com, 2002.
https://imslp.org.
https://github.com/bytedance/piano_transcription.
There are debates on the nationality of some composers. We extract the nationality of composers from Wikipedia and do not discuss region debates in this work.

Acknowledgement

We thank all anonymous reviewers, editors, and copy editors for their substantial reviews of this manuscript. We thank Prof. Xiaofeng Zhang for passing his composition knowledge to Qiuqiang Kong during his undergraduate study at the South China University of Technology.

Competing Interests

The authors have no competing interests to declare.

References

Bainbridge, D., and Bell, T. (2001). The challenge of optical music recognition. Computers and the Humanities, 35(2):95–121. DOI: https://doi.org/10.1023/A:1002485918032
Bryner, B. (2002). The Piano Roll: A Valuable Recording Medium of the Twentieth Century. PhD thesis, Department of Music, University of Utah.
Cancino-Chacón, C. E., Grachten, M., Goebl, W., and Widmer, G. (2018). Computational models of expressive music performance: A comprehensive and critical review. Frontiers in Digital Humanities, 5:25. DOI: https://doi.org/10.3389/fdigh.2018.00025
Casey, M. A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., and Slaney, M. (2008). Contentbased music information retrieval: Current directions and future challenges. Proceedings of the IEEE, 96(4):668–696. DOI: https://doi.org/10.1109/JPROC.2008.916370
Choi, K., Fazekas, G., Cho, K., and Sandler, M. (2017). A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396.
Conklin, D., and Witten, I. H. (1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1):51–73. DOI: https://doi.org/10.1080/09298219508570672
Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8):2121–2133. DOI: https://doi.org/10.1109/TASL.2010.2042119
Emiya, V., Bertin, N., David, B., and Badeau, R. (2010). MAPS: A piano database for multipitch estimation and automatic transcription of music. Technical report, INRIA. 00544155f.
Forte, A. (1973). The Structure of Atonal Music. Yale University Press.
Foscarin, F., McLeod, A., Rigaux, P., Jacquemard, F., and Sakai, M. (2020). ASAP: A dataset of aligned scores and performances for piano transcription. In International Society for Music Information Retrieval (ISMIR) Conference.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the Conference on Artificial Intelligence and Statistics, pages 315–323.
Good, M. (2001). MusicXML: An internet-friendly format for sheet music. In XML Conference and Expo, pages 03–04.
Hashida, M., Matsui, T., and Katayose, H. (2008). A new music database describing deviation information of performance expressions. In International Conference on Music Information Retrieval (ISMIR), pages 489–494.
Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon, I., Raffel, C., Engel, J., Oore, S., and Eck, D. (2018). Onsets and frames: Dual-objective piano transcription. In International Society for Music Information Retrieval (ISMIR) Conference.
Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C. A., Dieleman, S., Elsen, E., Engel, J., and Eck, D. (2019). Enabling factorized piano music modeling and generation with the MAESTRO dataset. International Conference on Learning Representations (ICLR).
Huang, C.-Z. A., Hawthorne, C., Roberts, A., Dinculescu, M., Wexler, J., Hong, L., and Howcroft, J. (2019). The Bach Doodle: Approachable music composition with machine learning at scale. In International Society for Music Information Retrieval (ISMIR) Conference.
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., Dinculescu, M., and Eck, D. (2018). Music Transformer: Generating music with long-term structure. In International Conference on Learning Representations (ICLR).
Ioffe, S., and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning (ICML).
Kim, J. W., and Bello, J. P. (2019). Adversarial learning for improved onsets and frames music transcription. In International Society for Music Information Retrieval (ISMIR) Conference.
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., and Plumbley, M. D. (2020). PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880–2894.
Kong, Q., Li, B., Song, X., Wan, Y., and Wang, Y. (2021). High-resolution piano transcription with pedals by regressing onsets and offsets times. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717. DOI: https://doi.org/10.1109/TASLP.2021.3121991
Kwon, T., Jeong, D., and Nam, J. (2020). Polyphonic piano transcription using autoregressive multi-state note model. In International Society for Music Information Retrieval (ISMIR) Conference.
Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2018). Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2):522–535. DOI: https://doi.org/10.1109/TMM.2018.2856090
Meredith, D. (2016). Computational Music Analysis. Springer. DOI: https://doi.org/10.1007/978-3-319-25931-4
Nakamura, E., Yoshii, K., and Katayose, H. (2017). Performance error detection and post-processing for fast and accurate symbolic music alignment. In International Society for Music Information Retrieval (ISMIR) Conference, pages 347–353.
Nienhuys, H.-W., and Nieuwenhuizen, J. (2003). Lilypond, a system for automated music engraving. In Proceedings of the XIV Colloquium on Musical Informatics, volume 1, pages 167–171.
Niwattanakul, S., Singthongchai, J., Naenudorn, E., and Wanapu, S. (2013). Using of Jaccard coefficient for keywords similarity. In Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS), pages 380–384.
Raffel, C. (2016). Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia University.
Rebelo, A., Fujinaga, I., Paszkiewicz, F., Marcal, A. R., Guedes, C., and Cardoso, J. S. (2012). Optical music recognition: State-of-the-art and open issues. International Journal of Multimedia Information Retrieval, 1(3): 173–190.
Repp, B. H. (1996). The art of inaccuracy: Why pianists’ errors are difficult to hear. Music Perception, 14(2): 161–183.
Roland, P. (2002). The music encoding initiative (MEI). In Proceedings of the First International Conference on Musical Applications Using XML, pages 55–59.
Sapp, C. S. (2005). Online database of scores in the Humdrum file format. In International Conference on Music Information Retrieval (ISMIR), pages 664–665.
Shi, Z., Sapp, C. S., Arul, K., McBride, J., and Smith, J. O. III.. (2019). SUPRA: Digitizing the Stanford University Piano Roll Archive. In International Society for Music Information Retrieval (ISMIR) Conference, pages 517–523.
Smith, D., and Wood, C. (1981). The ‘USI’, or Universal Synthesizer Interface. In Audio Engineering Society Convention 70.
Volk, A., Wiering, F., and Kranenburg, P. (2011). Unfolding the potential of computational musicology. In International Conference on Informatics and Semiotics in Organisations (ICISO), pages 137–144.
Yang, L., Chou, S., and Yang, Y. (2017). MidiNet: A convolutional generative adversarial network for symbolic-domain music generation. In International Society for Music Information Retrieval (ISMIR) Conference, pages 324–331.