CN110580458A

CN110580458A - music score image recognition method combining multi-scale residual error type CNN and SRU

Info

Publication number: CN110580458A
Application number: CN201910787184.3A
Authority: CN
Inventors: 吴琼; 李锵; 关欣
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-08-25
Filing date: 2019-08-25
Publication date: 2019-12-17

Abstract

The invention relates to a music score image recognition method combining multi-scale residual error CNN and SRU, which comprises the following steps: firstly, establishing a data set of a music score image; secondly, constructing a model: combining the multi-scale residual difference CNN and the SRU; thirdly, training a model: and performing model training by using the data set after data enhancement, inputting the model into a music score image in the data set, gradually adjusting each parameter of the network through a chain time sequence classification loss function to achieve the optimal value by using a truth label as a semantic label corresponding to the image, and finally outputting a predicted value of the note semantic information.

Description

Music score image recognition method combining multi-scale residual error type CNN and SRU

Technical Field

the invention is an important branch of the field of serialized image recognition, applies a neural network to the recognition of images, optimizes a note recognition network aiming at difficult notes and realizes more accurate and rapid conversion of music score images.

Background

the music score describes related information such as notes, tones, duration and the like in detail, and becomes the most direct way for musicians to learn, share and transmit music, but not a few classical music scores are damaged or even lost due to environmental changes and epoch changes, so that all music scores cannot be completely and nondestructively preserved due to artificial storage. With the rapid development of advanced technologies such as computer application and image scanning, paper-based Music files can be converted into electronic files which can be read and understood by a computer through an Optical Music Recognition (OMR) technology, so that the paper-based Music files can be widely applied to the fields of Music information retrieval, Music auxiliary teaching and the like. However, because the general music score recognition algorithm has a complex structure and high implementation difficulty, and the existing commercial recognition software has low precision, an OMR algorithm which is easy to implement and high in precision is urgently needed to be researched.

Bainbridge et al^[1]An early OMR algorithm universal framework is provided, which mainly comprises the parts of image preprocessing, note identification, music information reconstruction, final expression construction and the like, and staff detection and deletion, note segmentation, identification and note information recombination are technical difficulties, but each step is difficult to realize, and the overall identification precision is insufficient. In recent years, along with the driving of big data, machine learning and deep neural network are widely applied, Sober-Mira and the like^[2]The Convolutional Neural Network (CNN) is applied to a note recognition part, so that the precision of a general framework algorithm is improved; shi, etc^[3]Firstly, a convolution cyclic Neural network (CRNN) is proposed and applied to scene text recognition with obvious effect; Calvo-Zaragoza et al^[4]Shi and the like are adopted in music score recognition^[3]The method comprises the steps of carrying out model optimization and quantitative analysis, preprocessing an input picture, inputting three monaural music score images with a ratio of 1:4 into a network in a unified mode, adopting a Bi-directional Short-Term Memory (BilTM) network to form a C-BilSTM network for a feature recognition part in the CRNN network, and finally obtaining about 22.37% of sequence error rate and 2.16% of symbol error rate in an input image with the size of 60 x 240, wherein recognition accuracy of difficult notes such as partials, minor pitch lines and the like is insufficient due to insufficient feature extraction capability.

The OMR algorithm studies so far have the following problems: 1) the algorithm based on a general framework is complicated in steps and has difficulty in each step: the detection part of the staff needs to balance the noise resistance and the deformation resistance of the algorithm; the staff deleting part increases the difficulty of identifying the punctuation notes; the note identification and classification part selects different identification methods according to different characteristics of notes, a general algorithm is difficult to select, and the classification effect is obvious in difference among different notes. These problems will make the overall recognition accuracy of the OMR task insufficient; 2) the complexity of a general framework is simplified by using an end-to-end trained deep neural network algorithm, key steps in an OMR task are not analyzed and researched respectively, the possibility of introducing errors in a multi-step framework is reduced, but the OMR task is sensitive to detail information, and particularly for recognition of difficult notes, the improvement of recognition precision is severely limited due to insufficient feature extraction capability of a model; 3) the note sequence in the data set is only the combination of simple notes, and the generalization capability of the model is poor due to insufficient richness and diversity, so that the problem of overfitting is easily caused; 4) the network model adopting the BilSTM feature recognition usually has slow convergence in the training process and consumes longer time.

the note sequence has sequence and specificity, namely, the note at the current moment has strong correlation with the note at the previous moment and the note at the next moment, and a Recurrent Neural Network (RNN) can effectively identify the serialized data, so that the method can be used for identifying the note. In the training process, the problem of gradient disappearance is easy to occur to long sequence data, and most of the RNN structures control information flow through a gate mechanism model such as LSTM or GRU so as to alleviate the problem of gradient disappearance/explosion. However, the forgotten gate, input gate and unit state of the model LSTM \ GRU still need to hide the output of the unit at the previous time except for the input at the current time, so that the parallel operation speed is limited to a great extent, and the problem is effectively solved by the Simple loop Units (SRUs). The SRU accelerates the convergence of the model by enabling the calculation of the gate state to depend on the information input at the current moment only by utilizing the weak cyclicity and the high parallelism, relieving the dependency of the current moment on the state at the previous moment and enabling most of the calculation to be carried out synchronously.

Reference documents:

[1]Bainbridge D and Bell T,The challenge ofoptical music recognition[J],Computers and the Humanities,2001,35(2):95–121.

[2]Sober-Mira J,Calvo-Zaragoza J,Rizo D,et al.Pen-Based Music Document Transcription with Convolutional Neural Networks.In:Fornés A.,Lamiroy B.(eds)Graphics Recognition.Current Trends and Evolutions[C].GREC.2017,71-80.

[3]Shi B,Bai X,Yao C.An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene TextRecognition[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2015,39(11):2298-2304.

[4]Jorge Calvo-Zaragoza,Jose J.Valero-Mas,Antonio Pertusa.End-to-end Optical Music Recognition using Neural Networks[C],18th International Society for Music Information Retrieval Conference,Suzhou,China,2017,472-477.

Disclosure of Invention

the invention provides a note identification method based on a deep neural network, aiming at the problem of note identification in an optical music score image. The method can identify difficult notes in the music score images under different qualities, can ensure higher accuracy, and simultaneously accelerates the model training speed, and the technical scheme is as follows:

A music score image recognition method combining multi-scale residual CNN and SRU comprises the following steps:

Firstly, establishing a data set of a music score image: selecting a spectrum case and using an image enhancement technology to enable the data set to contain a music score image under an undesirable condition so as to expand the data set;

Secondly, constructing a model: combining the multi-scale residual difference CNN and the SRU;

(1) Scale residual CNN network: the multi-scale residual type CNN is composed of five convolution residual blocks and carries out multi-scale feature fusion, input image data sequentially passes through the five residual block convolution layers to obtain feature maps C1, C2, C3, C4 and C5, the sizes of convolution kernels of the feature maps C1, C2, C3, C4 and C5 are all 33, and the number of the convolution kernels is increased in a mode of 32, 64, 128, 256 and 256 layer by layer. And fusing the result of the last layer of feature map C5 after 2 times of upsampling with the result of feature map C4 after 11 convolution operations to obtain a feature F5, and performing the same processing on F5 and C3 by C5 and C4 to obtain a feature F4.

(2) An SRU part: the method comprises two layers of bidirectional SRUs (hidden layer units), wherein the cycle length of each layer is kept unchanged due to the fact that the height of a music score image and the number of selected convolution kernels are determined, and forward learning and backward propagation of weights in each SRU are achieved through 512 hidden layer units;

Thirdly, training a model: and performing model training by using the data set after data enhancement, inputting the model into a music score image in the data set, gradually adjusting each parameter of the network through a chain time sequence classification loss function to achieve the optimal value by using a truth label as a semantic label corresponding to the image, and finally outputting a predicted value of the note semantic information.

The invention provides a network combining residual error CNN and SRU, wherein difficult notes in an optical music score image are taken as research objects in the network, a residual error CNN structure is used in a feature extraction part, and multi-scale feature fusion is added, so that multi-level features are concentrated in a unified feature map to improve the accuracy of subsequent recognition; and the SRU is adopted in the feature recognition part, so that the model training speed is higher.

Drawings

FIG. 1 Algorithm Structure

Fig. 2 is a multi-scale fusion effect graph, (a) is an original graph, (b) is a feature graph extracted by shallow convolution, (c) is a feature graph extracted by deeper convolution, (d) is a feature graph extracted by deep convolution, and (e) is a feature graph extracted after multi-scale fusion.

TABLE 1 network specific parameters

Table 2 different network accuracy comparisons

Detailed Description

in order to make the technical scheme of the invention clearer, the invention is further explained below by combining the attached drawings. The invention is realized by the following steps:

The experimental environment of the invention is as follows: ubuntu16.04 operating system, Intel Core i7-8700 CPU,16G running memory, Nvidia GTX1080Ti GPU, deep learning framework Tensorflow. Adam optimization is adopted in the network, the learning rate is set to be 1e-3, the batch _ size is set to be 16, BN layers are added to accelerate convergence, loss is printed once after every 1000 times of iterative training, the accuracy of the loss is verified, and 64000 iterations are performed in total.

First, a music score image data set is established.

The data used in The present invention are derived from The PrIMus Dataset (Printed Images of Music tables) in The open, wherein 87687 real spectra were collected, each with only one staff and containing about 4-7 bars. And (3) adding image enhancement methods such as Berlin noise, white Gaussian noise, elastic deformation and the like to the randomly selected part of data set to simulate the music score images under various undesirable conditions.

And secondly, constructing a model.

the whole algorithm is formed by combining a multi-scale residual error type CNN network and an SRU.

(1) Multi-scale residual error type CNN: the multi-scale residual CNN is composed of five convolutional residual blocks and performs multi-scale feature fusion, as shown in fig. 1. The input image data are sequentially subjected to five residual block convolutional layers to obtain feature maps C1, C2, C3, C4 and C5, the sizes of convolution kernels of the feature maps C1, C2, C3, C4 and C5 are all 3 multiplied by 3, and the number of the convolution kernels is increased by 32, 64, 128, 256 and 256 layer by layer. And fusing the result of 2 times upsampling of the final layer of feature map C5 with the result of 1 × 1 convolution operation of the feature map C4 to obtain a feature F5, and performing the same processing on F5 and C3 to obtain a feature F4 through C5 and C4.

(2) SRU network: and combining the network and the SRU to form the multi-scale residual type CNN and SRU network based on the multi-scale residual type CNN in the last step. Wherein the SRU portion consists of two layers of bi-directional SRUs. The cycle length of each layer is kept unchanged due to the determination of the height of the music score image and the number of selected convolution kernels, forward learning and backward propagation of the weight in each SRU are realized through 512 hidden layer units, and network specific parameters are shown in Table 1.

And thirdly, training the model. And training the constructed model by using the data set to obtain an optimal model and storing the optimal model. The deep learning network model inputs a data set music score image, the truth labels are semantic information corresponding to notes in the music score image, parameters of the network are gradually adjusted through a chain time sequence classification loss function to achieve the optimal value, and the predicted value of the note semantic information is finally output.

TABLE 1 network specific parameters

Table 2 different network accuracy comparisons

The multi-scale residual CNN and SRU method identifies the music score image. Firstly, expanding data of a music score image data set by methods of deformation, noise adding and the like to improve the generalization capability of a model; secondly, a residual difference type CNN structure is adopted in the feature extraction part, and the extracted features of each layer of residual difference type structure are subjected to multi-scale fusion, so that multi-level features are concentrated in a unified feature map to enhance the feature representation capability of the model, and further improve the subsequent RNN identification precision; and finally, an SRU model is adopted in the characteristic identification RNN part, so that the model convergence speed is increased, and the training time is shortened.

Claims

1. A music score image recognition method combining multi-scale residual CNN and SRU comprises the following steps:

Firstly, establishing a data set of a music score image: spectral examples are selected and image enhancement techniques are used so that the data set contains images of the score in the undesirable case to augment the data set.

(1) Scale residual CNN network: the multi-scale residual type CNN is composed of five convolution residual blocks and carries out multi-scale feature fusion, input image data sequentially passes through the five residual block convolution layers to obtain feature maps C1, C2, C3, C4 and C5, the sizes of convolution kernels of the feature maps C1, C2, C3, C4 and C5 are all 3 multiplied by 3, and the number of the convolution kernels is increased by 32, 64, 128, 256 and 256 layer by layer. And fusing the result of 2 times upsampling of the final layer of feature map C5 with the result of 1 × 1 convolution operation of the feature map C4 to obtain a feature F5, and performing the same processing on F5 and C3 to obtain a feature F4 through C5 and C4.