1 Introduction
Symbolic music datasets are important for music information retrieval (MIR) and musical analysis. In the Western music tradition, musicians use musical notation to write music. This notation includes pitches, rhythms, and chords. Musicologists analyse music works by reading music notation. Recently, computers have been used to process and analyse large-scale data and have been widely used in MIR. However, there is a lack of large-scale symbolic music datasets covering a wide range of solo piano works.
One difficulty of computer-based MIR is that musical notation such as staffs is not directly readable by a computer. Therefore, converting music notation into computer-readable formats is important. Early works of converting music into symbolic representations can be traced back to the 1900s, when piano rolls (; ) were developed to record music that could be played on a musical instrument. Piano rolls are continuous rolls of paper with perforations punched into them. In 1981, the Musical Instrument Digital Interface (MIDI) () was proposed as a technical standard to represent music that can be read by a computer. MIDI files use event messages to specify the instructions of music, including pitch, onset, offset, and velocity of notes. MIDI files also carry rich information about music events such as sustain pedals. The MIDI format has been popular for music production in recent years.
In this work, we focus on building a large-scale MIDI dataset for classical solo piano music. There are several previous piano MIDI datasets including the Piano-midi.de dataset, the MAESTRO dataset (), the Classical Archives dataset, and the Kunstderfuge dataset However, those datasets are limited to hundreds of composers and hundreds of hours of unique works. MusicXML () is another symbolic format of music, but there are fewer MusicXML datasets than MIDI datasets. Other machine-readable formats include the music encoding initiative (MEI) (), Humdrum (), and LilyPond (). Optical music recognition (OMR) (; ) is a technique to transcribe score images into symbolic formats. However, the performance of OMR systems is limited by score quality.
In this article, we collect and transcribe a large-scale classical piano MIDI dataset called GiantMIDI-Piano. To our knowledge, GiantMIDI-Piano is the largest piano MIDI dataset so far. GiantMIDI-Piano is collected as follows: 1) We parse the names of composers and the names of music works from the International Music Score Library Project (IMSLP); 2) We search and download audio recordings of all matching music works from YouTube; 3) We build a solo piano detection system to detect solo piano recordings; 4) We transcribe solo piano recordings into MIDI files using a high-resolution piano transcription system (). In this article, we analyse the statistics of GiantMIDI-Piano, including the number of works, durations of works, and nationalities of composers. In addition, we analyse the statistics of note, interval, and chord distributions of six composers from different eras to show that GiantMIDI-Piano can be used for musical analysis.
1.1 Applications
The GiantMIDI-Piano dataset can be used in many research areas, including: 1) Computer-based musical analysis (; ) such as using computers to analyse the structure, chords, and melody of music works. 2) Symbolic music generation (; ). 3) Computer-based music information retrieval (; ) such as music transcription and music tagging. 4) Expressive performance analysis () such as analysing the performance of different pianists.
This paper is organised as follows: Section 2 surveys piano MIDI datasets; Section 3 introduces the collection of the GiantMIDI-Piano dataset; Section 4 investigates the statistics of the GiantMIDI-Piano dataset; and Section 5 evaluates the quality of the GiantMIDI-Piano dataset.
2 Dataset Survey
We introduce several piano MIDI datasets as follows. The Piano-midi.de dataset contains classical solo piano works entered via a MIDI sequencer. Piano-midi.de contains 571 works composed by 26 composers with a total duration of 36.7 hours until Feb. 2020. The Classical Archives collection contains a large number of MIDI files of classical music, including both piano and non-piano works. There are 133 composers with a total duration of 46.3 hours of MIDI files in this dataset. The KernScores dataset() contains classical music with a Humdrum format and is obtained by an optical music recognition system. The Kunstderfuge dataset contains solo piano and non-solo piano works of 598 composers. All of the Piano-midi.de, Classical Archives, and Kunstderfuge datasets are entered using a MIDI sequencer and are not played by pianists.
The MAPS dataset () used MIDI files from Piano-midi.de to render audio recordings by playing back the MIDI files on a Yamaha Disklavier. The MAESTRO dataset () contains over 200 hours of finely aligned MIDI files and audio recordings. In MAESTRO, virtuoso pianists performed on Yamaha Disklaviers and were recorded with the integrated MIDI capture system. MAESTRO contains music works from 62 composers. There are several duplicated works in MAESTRO. For example, there are 11 versions of Scherzo No. 2 in B-flat Minor, Op. 31 composed by Chopin. All duplicated works are removed when calculating the number and duration of works.
Table 1 shows the number of composers, the number of unique works, total durations, and data types of different MIDI datasets. Data types include sequenced (Seq.) MIDI files input using MIDI sequencers and performed (Perf.) MIDI files played by pianists. There are other MIDI datasets including the Lakh dataset (), the Bach Doodle dataset (), the Bach Chorales dataset (), the URMP dataset (), the Bach10 dataset (), the CrestMusePEDB dataset (), the SUPRA dataset (), and the ASAP dataset (). Huang et al. () collected 10,000 hours of piano recordings for music generation, but the dataset is not publicly available.
DATASET | COMPOSERS | WORKS | HOURS | TYPE |
---|---|---|---|---|
Piano-midi.de | 26 | 571 | 37 | Seq. |
Classical Archives | 133 | 856 | 46 | Seq. |
Kunstderfuge | 598 | – | – | Seq. |
KernScores | – | – | – | Seq. |
SUPRA | 111 | 410 | – | Perf. |
ASAP | 16 | 222 | – | Perf. |
MAESTRO | 62 | 529 | 84 | Perf. |
MAPS | – | 270 | 19 | Perf. |
GiantMIDI-Piano | 2,786 | 10,855 | 1,237 | 90% Perf. |
Curated GP | 1,787 | 7,236 | 875 | 89% Perf. |
3 GiantMIDI-Piano Dataset
3.1 Metadata from IMSLP
To begin with, we acquire the names of composers and the names of music works by parsing the web pages of the International Music Score Library Project, the largest publicly available music library in the world. In IMSLP, each composer has a web page containing a list of their works. We acquire 143,701 music works composed by 18,067 composers by parsing those web pages. For each composer, if there exists a biography link on the composer page, we access that biography link to search for their birth year, death year, and nationality. We set the birth year, death year, and nationality to “unknown” if a composer does not have such a biography link. We obtain the nationalities of 4,274 composers and births of 5,981 composers out of 18,067 composers by automatically parsing the biography links.
As the automatically parsed meta-information of composers from the Internet is incomplete, we manually check the nationalities, births, and deaths for 2,786 composers. We label 2,291 birth years, 2,254 death years, and 2,115 nationalities by searching the information of composers on the Internet. We label not found birth years, death years, and nationalities as “unknown”. We create metadata files containing the information of composers and music works, respectively.
3.2 Search Audio
We search audio recordings on YouTube by using a keyword of first name, surname, music work name from the metadata. For each keyword, we select the first returned result on YouTube. However, there can be cases where the returned YouTube title may not exactly match the keyword. For example, for a keyword Frédéric Chopin, Scherzo No.2 Op.31, the top returned result can be Chopin – Scherzo No. 2, Op. 31 (Rubinstein). Although the keyword and the returned YouTube title are different, they indicate the same music work. We denote the set of words in a search keyword as X, and the set of words in a returned YouTube title as Y. We propose a modified Jaccard similarity () to evaluate how well a keyword and a returned result are matched.
The original Jaccard similarity is defined as J = |X ∩ Y|/(|X| ∪ |Y|). The drawback of this original Jaccard similarity is that the length of the searched YouTube title |Y| can be long, so that J will be small. This is often the case because searched YouTube titles usually contain extra words such as the names of performers and the dates of performances. Our aim is to define a metric where the denominator only depends on the search keyword |X| and is independent of the length of the searched YouTube title |Y|. We propose a modified Jaccard similarity () between X and Y as:
Higher J indicates that X and Y have larger similarity, and lower J indicates that X and Y have less similarity. We empirically set a similarity threshold to 0.6 to balance the precision and recall of searched results. If J is strictly larger than this threshold, then we say X are Y are matched, and vice versa. In total, we retrieve and download 60,724 audio recordings out of 143,701 music works.
3.3 Solo Piano Detection
We detect solo piano works from IMSLP to build the GiantMIDI-Piano dataset. Filtering music works with keywords containing “piano” may lead to incorrect results. For example, a “Piano Concerto” is an ensemble of piano and orchestra which is not solo piano. On the other hand, the keyword Chopin, Frédéric, Nocturnes, Op.62 does not contain the word “piano”, but it is indeed solo piano. To address this problem, we train an audio-based solo piano detection system using a convolutional neural network (CNN) (). The piano detection system takes 1-second segments as input and extracts log mel spectrograms as input to the CNN.
The CNN consists of four convolutional layers. Each convolutional layer consists of a linear convolutional operation with a kernel size of 3 × 3, a batch normalization (), and a ReLU nonlinearity (). The output of the CNN predicts the solo piano probability of a segment. Binary cross-entropy is used as a loss function to train the CNN. We collect solo piano recordings as positive samples and collect music and other sounds as negative samples. In addition, the mixtures of piano and other sounds are also used as negative samples. In inference, we average the predictions of all 1-second segments of a recording to calculate its solo piano probability. We regard an audio recording as solo piano if the probability is strictly larger than 0.5 and vice versa. In total, we obtain 10,855 solo piano recordings, composed by 2,786 composers, out of 60,724 downloaded audio recordings. These 10,855 audio files are transcribed into MIDI files which constitute the full GiantMIDI-Piano dataset.
3.4 Constrain Composer Surnames
Among the detected 10,855 solo piano works, there are several music works composed by not well-known composers which are attached to famous composers. For example, there are 273 searched music works assigned to Chopin, but only 102 of them are actually composed by Chopin, while other music works are composed by other composers. To alleviate this problem, we create a curated subset by constraining the titles of downloaded audio recordings to contain the surnames of composers. After this constraint, we obtain a curated GiantMIDI-Piano dataset containing 7,236 music works composed by 1,787 composers.
3.5 Piano Transcription
We transcribe all 10,855 solo piano recordings into MIDI files using an open-sourced high-resolution piano transcription system (), an improvement over the onsets and frames piano transcription system (, ) and other systems (; ). The piano transcription system is trained on the training subset of the MAESTRO dataset version 2.0.0 (). The training and testing subsets contain 161.3 and 20.5 hours of aligned piano recordings and MIDI files, respectively. The piano transcription system predicts all of the pitch, onset, offset, and velocity attributes of notes. The transcribed results also include sustain pedals. For piano note transcription, our system consists of a frame-wise classification, an onset regression, an offset regression, and a velocity regression sub-module. Each sub-module is modeled by a convolutional recurrent neural network (CRNN) with eight convolutional layers and two bi-directional gated recurrent unit (GRU) layers. The output of each module has a dimension of 88, equivalent to the number of notes on a modern piano.
The pedal transcription system has the same architecture as the note transcription system, except that there is only one output after the CRNN sub-module indicating the onset or offset probabilities of pedals. In inference, all piano recordings are converted into mono with a sampling rate of 16 kHz. We use a short-time Fourier transform (STFT) with a Hann window size 2048 and a hop size 160 to extract spectrograms, so there are 100 frames per second. Then, mel filter banks with 229 bins are used to extract a log mel spectrogram as input feature (). The transcription system outputs the predicted probabilities of pitch, onset, offset, and velocity. Finally, those outputs are post-processed into MIDI events.
4 Statistics
We analyse the statistics of GiantMIDI-piano, including the number and duration of music works composed by different composers, the nationality of composers, and the distribution of notes by composers. Then, we investigate the statistics of six composers from different eras by calculating their pitch class, interval, trichord, and tetrachord distributions.
We analyse the statistics of GiantMIDI-piano, including the number and duration of music works composed by different composers, the nationality of composers, and the distribution of notes by composers. Then, we investigate the statistics of six composers from different eras by calculating their pitch class, interval, trichord, and tetrachord distributions. All of Figure 1 to Figure 11 except Figure 3 are plotted with the statistics of the curated GiantMIDI-Piano dataset. Figure 3 shows the manually-checked nationalities of 2,786 composers in the full GiantMIDI-Piano dataset.
4.1 The Number of Solo Piano Works
Figure 1 shows the number of piano works composed by each composer, sorted in descending order, for the curated GiantMIDI-Piano dataset. Figure 1 shows the statistics of the top 100 composers out of 2,786 composers. Blue bars show the number of solo piano works. Pink bars show the total number of works, including both solo piano and non-solo piano works. Figure 1 shows that there are 141 solo piano works composed by Liszt, followed by 140 and 129 solo piano works composed by Scarlatti and J. S. Bach. Some composers, such as Chopin, composed more solo piano works than non-solo piano works. For example, there are 96 solo piano works out of 109 complete works composed by Chopin in the curated GiantMIDI-Piano dataset. Figure 1 shows that the number of solo piano works of different composers has a long tail distribution.
4.2 The Duration of Solo Piano Works
Figure 2 shows the duration of solo piano works for each composer, sorted in descending order, for the curated GiantMIDI-Piano dataset. The duration of works composed by Liszt is the longest at 25 hours, followed by Beethoven at 21 hours and Schubert at 20 hours. Some composers composed more non-piano works than solo piano works. For example, there are 108 hours of works composed by Handel in the dataset, while only 2 hours of them are played on a modern piano. The rank of composers in Figure 2 is different from Figure 1, indicating that the average durations of solo piano works composed by different composers are different.
4.3 Nationalities of Composers
Figure 3 shows the number of composers of each nationality sorted in descending order for the full GiantMIDI-Piano dataset. The nationalities of 2,786 composers are initially obtained from Wikipedia and are later manually checked.Figure 3 shows that there are 671 composers with unknown nationality. There are 364 German composers, followed by 322 French composers and 267 US American composers. We color-code the continent of nationalities from “Unknown”, “European”, “North American”, “South American”, “Asian”, to “African”. In GiantMIDI-Piano, the nationalities of most composers are European. The numbers of composers with nationalities from South America, Asia, and Africa are fewer.
4.4 Note Histogram
Figure 4 shows the note histogram of the curated GiantMIDI-Piano dataset. There are 24,253,495 transcribed notes. The horizontal axis shows scientific pitch notation, which covers 88 notes on a modern piano from A0 to C8. Middle C is denoted as C4. We do not distinguish enharmonic notes, for example, a note C♯/D♭ is simply denoted as C♯. The white and black bars correspond to the white and black keys on a modern piano, respectively. Figure 4 shows that the note histogram has a normal distribution. The most played note is G4. There are more notes close to G4 and less notes far from G4. The most played notes are within the octave between C4 and C5. White keys are played more often than black keys.
Figure 5 visualizes the note histograms of three composers from different eras: J. S. Bach, Beethoven, and Liszt. The note range of J. S. Bach is mostly between C2 and C6 covering four octaves, which is consistent with the note range of a conventional harpsichord or organ. The note range of Beethoven is mostly between F1 and C7 covering five and a half octaves. The note range of Liszt is the widest, covering the whole range of a modern piano.
4.5 Pitch Distribution of the Top 100 Composers
Figure 6 shows the pitch distribution sorted in ascending order of average pitch for the top 100 composers in Figure 2 from the curated GiantMIDI-Piano dataset. The average pitches of most composers are between C4 and C5, where C4 corresponds to a MIDI pitch value 60. The shades indicate the one standard deviation area of pitch distributions. Jeffrey Michael Harrington has the lowest average pitch value of C4. Carl Czerny has the highest average pitch value of A4.
4.6 The Number of Notes Per Second Distribution of the Top 100 Composers
Figure 7 shows the number of notes per second distribution, sorted in ascending order, for the top 100 composers in Figure 2 from the curated GiantMIDI-Piano dataset. The number of notes per second is calculated by dividing the number of notes in all works by the total duration of all works of a composer. The average numbers of notes per second of most composers are between 5 and 10. The shades indicate the one standard deviation area of the number of notes per second distribution. Alfred Grünfeld has the smallest number of notes per second with a value of 4.18. Carl Czerny has the largest number of notes per second with a value of 13.61.
4.7 Pitch Class Distribution
We denote the set of pitch classes as {C, C♯, D, D♯, E, F, F♯, G, G♯, A, A♯, B}. The notes from C to B are denoted as 0 to 11 (), respectively. We calculate the statistics of six composers from different eras including J. S. Bach, Mozart, Beethoven, Chopin, Liszt, and Debussy. Figure 8 shows that J. S. Bach used D, E, G, and A most in his solo piano works. Mozart used C, D, F, and G most in his solo piano works and used more A♯/B♭ than other composers. Beethoven used more C, D, and G than other notes. Chopin used D♯/E♭ and G♯/A♭ most in his solo piano works. Liszt and Debussy used all twelve pitch classes more uniformly in their solo piano works than other composers. Liszt used E most, and Debussy used C♯/D♭ most. As expected, most Baroque and Classical solo piano works were in keys close to C, whereas Romantic and later composers explored distant keys and tended to use all twelve pitch classes more uniformly.
4.8 Interval Distribution
An interval is the pitch difference between two notes. Intervals can be either melodic intervals or harmonic intervals. A harmonic interval is the pitch difference of two notes being played at the same time. A melodic interval is the pitch difference between two successive notes. We consider both harmonic intervals and melodic intervals as intervals. We calculate the distribution of intervals of six composers. Notes are represented as a list of events in a MIDI format. We calculate an interval as:
where yn is the MIDI pitch of a note and n is the index of the note.
We calculate ordered intervals including both positive intervals and negative intervals. For example, the interval Δ for an upward progress from C4 to D4 is 2, and the interval Δ for a downward progress from C4 to A3 is –3. We only consider intervals between –11 and 11 (inclusive), and discard the intervals outside this region. For example, the value 11 indicates a major seventh interval. Figure 9 shows the interval distribution of six composers. All composers used major second and minor third most in their works. The interval distribution is not symmetric to the origin. For example, J. S. Bach and Mozart used more downward major seconds than upward major seconds. In the works of J. S. Bach, the dip in the interval 0 indicates that repeated notes are less commonly used than non-repeated notes. Other composers used more repeated notes than J. S. Bach. Major seventh and tritone are least used by all composers. Some Romantic and later composers, including Chopin, Liszt, and Debussy used all intervals more uniformly than J. S. Bach from the Baroque era.
4.9 Trichord Distribution
We adopt the set musical theory () to analyse the chord distribution in GiantMIDI-Piano. A trichord is a set of any three pitch-classes (). Since GiantMIDI-Piano is transcribed from real recordings, notes of a chord are usually not played simultaneously. We consider notes with onsets in a window of 50 ms as a chord. The windows are non-overlapped. Each note only belongs to one chord. For a special case of a set of onsets at 0, 25, 50, 75, and 100 ms, our system first searches for chords in a window starting from 0 ms and returns {0, 25, 50}. Then, the system searches for chords in a window starting from 75 ms and returns {75, 100}. We discard pitch sets with more or less than three notes within 50 ms. The sliding window for counting pitch sets will ensure that there are no overlapping notes when counting trichords. A major triad can be written as {0, 4, 7}, where the interval between 0 and 4 is a major third, and the interval between 4 and 7 is a minor third. We transpose all chords to chords with lower notes C. For example, a chord {2, 6, 9} is transposed into {0, 4, 7}. We merge chords with the same prime form. Figure 10 shows the trichord distribution of six composers. All composers used the major triad {0, 4, 7} most, followed by minor triad {0, 3, 7}, in their works. Liszt used more augmented triads {0, 4, 8} than other composers. Debussy used more {0, 2, 5} and {0, 2, 4} than other composers, which distinguished him from other composers. Figure 10 shows that composers from different eras have different preferences for using trichords.
4.10 Tetrachord Distribution
A tetrachord is a set of any four pitch-classes (). Similar to trichords, we only consider notes with onsets within a window of 50 ms as a chord. We discard pitch sets with more or less than four notes within a 50 ms window to ensure chords are tetrachords. A dominant seventh chord can be denoted as {0, 4, 7, 10}. Figure 11 shows the tetrachord distributions of six composers. Seventh chords such as {0, 2, 6, 9} are transposed to root position seventh chords {0, 4, 7, 10}. Figure 11 shows that Bach, Beethoven, Mozart, and Chopin used dominant seventh tetrachords {0, 4, 7, 10} most. Liszt used diminished seventh {0, 3, 6, 9} most and Debussy used minor seventh {0, 3, 7, 10} most. J. S. Bach used less dominant seventh than the other five composers. The tetrachord distribution of Debussy is different from that of other composers. Figure 11 shows that composers from different eras have different preferences for using tetrachords.
5 Evaluation of GiantMIDI-Piano
5.1 Solo Piano Evaluation
We evaluate the solo piano detection system as follows. We manually label 200 randomly selected music works from 60,724 downloaded audio recordings. We calculate the precision, recall, and F1 scores of the solo piano detection system with different thresholds ranging from 0.1 to 0.9 and show results in Figure 12. Horizontal and vertical axes show different thresholds and scores, respectively. Figure 12 shows that higher thresholds lead to higher precision but lower recall. When we set the threshold to 0.5, the solo piano detection system achieves a precision, recall, and F1 score of 89.66%, 86.67%, and 88.14%, respectively. In this work, we set the threshold to 0.5 to balance the precision and recall for solo piano detection.
5.2 Metadata Evaluation
We randomly select 200 solo piano works from the full GiantMIDI-Piano dataset and manually check how many audio recordings and metadata are matched. We observe that 174 out of 200 solo piano works are correctly matched, leading to a metadata accuracy of 87%. Most errors are caused by mismatched composer names. For example, when the keyword X is Chartier, Mathieu, Nocturne No.1 composed by Chartier, the retrieved YouTube title Y is Nocturne No. 1 composed by Chopin. After constraining surnames, 136 out of 140 solo piano works are correctly matched, leading to a precision of 97.14%. We also observe that there are 180 live performances and 20 sequenced MIDI files out of the 200 solo piano works.
Furthermore, Table 2 shows the number of matched music works composed by six different composers. Correct indicates that the retrieved solo piano works are indeed composed by the composer. Incorrect indicates that the retrieved music works are not composed by the composer but are composed by someone else and attached to the composer. Without the surname constraint, Liszt achieves the highest match accuracy of 90%, while Chopin achieves the lowest match accuracy of 37%. With the surname constraint, Table 3 shows that the match accuracy of Chopin increases from 37% to 82%. The accuracy of other composers also increases. The curated GiantMIDI-Piano dataset contains 7,236 MIDI files composed by 1,787 composers. We use a youtube_title_contains_surname flag in the metadata file to indicate whether the surname is verified.
J. S. BACH | MOZART | BEETHOVEN | CHOPIN | LISZT | DEBUSSY | |
---|---|---|---|---|---|---|
Correct | 147 | 85 | 82 | 102 | 197 | 29 |
Incorrect | 102 | 35 | 70 | 171 | 22 | 9 |
Accuracy | 59% | 71% | 54% | 37% | 90% | 76% |
J. S. BACH | MOZART | BEETHOVEN | CHOPIN | LISZT | DEBUSSY | |
---|---|---|---|---|---|---|
Correct | 129 | 72 | 76 | 96 | 141 | 27 |
Incorrect | 44 | 16 | 5 | 21 | 6 | 3 |
Accuracy | 75% | 82% | 94% | 82% | 96% | 90% |
5.3 Piano Transcription Evaluation
The piano transcription system achieves a state-of-the-art onset F1 score of 96.72%, an onset and offset F1 score of 82.47%, and an onset, offset, and velocity F1 score of 80.92% on the test set of the MAESTRO dataset. The sustain pedal transcription system achieves an onset F1 of 91.86%, and a sustain-pedal onset and offset F1 of 86.58%. The piano transcription system outperforms the previous onsets and frames system (, ) with an onset F1 score of 94.80%.
We evaluate the quality of GiantMIDI-Piano on 52 music works that appear in all of the GiantMIDI-Piano, the MAESTRO, and the Kunstderfuge datasets. Long music works such as Sonatas are split into movements. Repeated music sections are removed. Evaluating GiantMIDI-Piano is a challenging problem because there are no aligned ground-truth MIDI files, so the metrics of Hawthorne et al. () are not usable. In this work, we propose to use an alignment metric () called error rate (ER) to evaluate the quality of transcribed MIDI files. This metric reflects the substitutions, deletions, and insertions between a transcribed MIDI file and a target MIDI file. For a solo piano work, we align a transcribed MIDI file with its sequenced MIDI version using a hidden Markov model (HMM) tool (), where the sequenced MIDI files are from the Kunstderfuge dataset. The ER is defined as the normalized summation of substitutions, insertions, and deletions:
where N is the number of reference notes, and S, I, and D are the numbers of substitutions, insertions, and deletions, respectively. A substitution indicates that a note replaces a ground truth note. An insertion indicates that an extra note is played. A deletion indicates that a note is missing. Lower ER indicates better transcription performance.
The ER of music works from GiantMIDI-Piano consists of three parts: 1) performance errors, 2) transcription errors, and 3) alignment errors:
where the subscript G is the abbreviation for GiantMIDI-Piano. The performance errors ). The transcription errors come from piano transcription system errors. The alignment errors come from the sequence alignment algorithm ().
come from a pianist accidentally missing or adding notes while performing (Audio recordings and MIDI files are perfectly aligned in the MAESTRO dataset, so there are no transcription errors. The ER can be written as:
where the subscript M is the abbreviation for MAESTRO. For a given music work, we assume the approximation
despite the differences in performance among pianists. Similarly, we assume the approximation although the alignment errors can be different.Those approximations are more accurate when the levels of the two pianists are closer. Then, we propose a relative error by subtracting (4) and (5):
The relative error r is a rough approximation of the transcription errors
. A lower r value indicates better transcription quality.Table 4 shows the alignment performance. The median alignment SM, DM, IM and ERM on the MAESTRO dataset are 0.009, 0.024, 0.021 and 0.061 respectively. The median alignment SG, DG, IG and ERG on the GiantMIDI-Piano dataset are 0.015, 0.051, 0.069 and 0.154 respectively. The relative error r between MAESTRO and GiantMIDI-Piano is 0.094. The first column of Figure 13 shows the box plot metrics of MAESTRO. Some outliers are omitted from the figures for better visualization. Some outliers are mostly caused by different interpretations of trills and tremolos. The second column of Figure 13 shows the box plot metrics of GiantMIDI-Piano. In GiantMIDI-Piano, Keyboard Sonata in E-Flat Major, Hob. XVI/49 composed by Haydn achieves the lowest ER of 0.037, while Prelude and Fugue in A-flat major, BWV 862 composed by Bach achieves the highest ER of 0.679 (outlier beyond the plot range). This underperformance is due to the piano not being tuned to a standard pitch with A4 of 440 Hz. The third column of Figure 13 shows the relative ER between MAESTRO and GiantMIDI-Piano. The relative median scores of S, D, I and ER are 0.006, 0.026, 0.047 and 0.094 respectively. Figure 13 also shows that there are fewer deletions than insertions.
D | I | S | ER | |
---|---|---|---|---|
Maestro | 0.009 | 0.024 | 0.018 | 0.061 |
GiantMIDI-Piano | 0.015 | 0.051 | 0.069 | 0.154 |
Relative difference | 0.006 | 0.026 | 0.047 | 0.094 |
6 Conclusion
We collect and transcribe the large-scale GiantMIDI-Piano dataset containing 38,700,838 transcribed piano notes from 10,855 unique classical piano works composed by 2,786 composers. The total duration of GiantMIDI-Piano is 1,237 hours. The curated subset contains 24,253,495 piano notes from 7,236 works composed by 1,787 composers. GiantMIDI-Piano is transcribed from YouTube audio recordings searched using meta-information from IMSLP.
The solo piano detection system used in GiantMIDI-Piano achieves an F1 score of 88.14%, and the piano transcription system achieves a relative error rate of 0.094. The limitations of GiantMIDI-Piano include: 1) There are no pitch spellings to distinguish enharmonic notes; 2) GiantMIDI-Piano does not provide beats, time signatures, key signatures, and scores; and 3) GiantMIDI-Piano does not disentangle the music score and the expressive performance of pianists.
We have released the source code for acquiring GiantMIDI-Piano. In the future, GiantMIDI-Piano can be used in many research areas, including but not limited to musical analysis, music generation, music information retrieval, and expressive performance analysis.