Singer Identification by Vocal Parts Detection and Singer Classification Using LSTM Neural Networks
Singer Identification by Vocal Parts Detection and Singer Classification Using LSTM Neural Networks
Singer Identification by Vocal Parts Detection and Singer Classification Using LSTM Neural Networks
https://doi.org/10.22214/ijraset.2022.46473
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VIII August 2022- Available at www.ijraset.com
Abstract: Identification of singers is considered an important research area in audio signal processing. It has acquired the
scientist’s intrigues in two primary branches,1) recognizing vocal parts of polyphonic music, and 2) Classifying Singer. Here,
we plan to handle the two issues, simultaneously. Techniques like GMM, SVM, and HMM have previously been utilized in
classifying singers. In this work, we proposed a system for singer identification using Deep Learning and Feed Forward
Neural Networks, which to the best of our knowledge have not been used for this purpose before. Preprocessing involves
examining large sets of audio features to extract the most efficient set for the recognition stage. Our work is divided into
several stages. To begin with, the vocal parts of all music files are identified utilizing an LSTM network which can perform
well for the time series information, for example, audio signals. Then, at that point, an MLP network is integrated and
contrasted with an SVM classifier in order to classify the gender of the singers. At last, one more LSTM network is involved to
recognize every singer ID, and contrasted with MLP network in a similar task. In each step, various classifiers are analyzed
and the outcomes are looked at, which affirm the effectiveness of our technique contrasted with the best in class.
Keywords: Long Short Term Memory, Mel-Frequency Cepstral Coefficients, Support Vector Machine, Multi Layer
Perceptron, Non-Negative Matrix Factorization, Non-Negative Matrix Partial Co- Factorization.
I. INTRODUCTION
Singer ID is a significant area of examination in signal handling. our Identification system as a subcategory of SPID is a hotly
debated issue that has drawn the consideration of scientists. It discovers the ID of a singer in music joined by different instruments.
With the development of the music business and the rising comfort of complex recording methods, an enormous number of
melodies are delivered and played over the Internet, TV, and radio channels consistently. One of the main highlights related to a
melody is the vocalist. Many individuals utilize the voice of a singer to recognize the tunes rapidly.
The characteristics of the vocal line make it more vital, and remarkable for individuals as a suitable way to deal with organizing the
music databases.
What makes this issue extremely challenging is because of the backup music instruments, background vocal and recording errors.
Hence, the SID framework ought to segregate between various wellsprings of sound. Two primary methodologies are proposed such
a long way to handle the previously mentioned issue: 1) MFCC features are to be directly computed from polyphonic music and
involve them for classifying singer Id, and 2) utilizing a separation technique for extracting the singer’s voice out of the music,
trailed by a SPID strategy.
The main methodology depends on the understanding that the processed features are adequately strong and strong to separate the
singing voice partitions. One of the important stages in the primary methodology is to identify the vocal segments. The precision of
the classification system is improved as it is well trained with spoken information. The creator models the question melody as a
GMM built utilizing features extracted from the sung notes of the tune. Then, at that point, the model is contrasted with a vocalist
based GMM or a GMM built from one more melody sung by a similar artist. The difference was estimated by the Kullback Leibler
uniqueness.
The subsequent methodology (i.e., the partition of the singer’s voice) is investigated in and. This work plays out a non-negative
network factorization (NMF) to isolate the vocal voice from polyphonic music. A two-stage strategy has been proposed.
The primary stage utilizes non-negative matrix co-factorization (NMPCF) and the subsequent stage separates the pitch and
harmonic parts of the artist voices.
Our technique is close to the main methodology, in which we give an evaluation of various characterization techniques for
polyphonic music.
We characterize our list of features, and then exploit artificial neural networks and recurrent deep neural networks in these stages
and implement them.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1644
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VIII August 2022- Available at www.ijraset.com
A. Machine Learning
Machine Learning helps us to build a mathematical model for understanding data, we give input data to the model for training the
model, and based on this information, the model learns from it, analyzes the pattern, and finds the best way to predict the results,
Hence machine learning models are helpful in our system.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1645
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VIII August 2022- Available at www.ijraset.com
Gaussian Mixture Model will be utilized as a characterization method. GMM classifier is a sort of classifier which consolidates the
upside of parametric and non-parametric methods. GMM doesn't need the capacity of whole preparation vectors to make an order. It
is a truly adaptable model that can adjust to include practically any distribution of information. The GMM model is trained with
extracted features and when tested with input it produces less accurate results if the voice has background noise and GMM model
uses the Expectation Maximization Algorithm to classify results
These are previous work algorithms that have limitations, It can’t produce results within in given time and it always does not
produce correct results which leads to low accuracy compared to these two models GMM and HMM which are used in the existing
system, our proposed model called Long short term memory have high accuracy.
IV. TRAINING
LSTM is a supervised model first it is trained in several epochs by shuffling the random data and uses algorithms like optimization,
and gradient descent methods are used to calculate gradient which helps us during the optimization process, so it can change the
values of the weight of LSTM network with respect to error and corresponding weight, Problem occurs in gradient descent method
when error gradients are disappeared exponentially in proportion to the size of the delay between significant events. we have the
option to recover these error values from the output layer and this process continues to feed errors to previous layers until the model
learns to correctly fit data in the model and remove unwanted data which raises errors.
We build two LSTM models, Identification model is used to identify gender, and the classification model is to identify the singer,
First, the model is built by using a convolution 2D input layer and uses above mentioned hidden layers and an output layer called
the dense layer which uses activation function like soft max and relu which makes us classify singer ID, and then model is compiled
using adam optimizer to calculate loss and metrics, and then fit function is used to fit train input and output data into the model and
then weights are saved based on output labels and convert this file to JSON format, the history data file which contains information
used to build model is dumped in pickle file, so if once the model is built, it can be used directly instead of building again, and also
we can load data from pickle file to calculate accuracy
The MIR1K dataset is given as input to the LSTM model and from this MFCC Features are extracted and trained to model and then
the testing process starts where new data or test data is given and compare actual results with predicted results to calculate accuracy.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1646
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VIII August 2022- Available at www.ijraset.com
In the above figure, we can observe after clicking on Upload Test Audio & Classify button, It opens the test audio file dialog box
which contains audios of unseen data, and next select any audio clip and click on open, Now using predict function code it extracts
mfcc features of a clip using librosa and then ravels it to convert into a sequence of features and these are resized and reshaped into
an array and to utilize data processing technique converts it into NumPy array, then give this data into the identification and
classification model, argmax function classification depends on maximum probability to which result belongs like if gender voice is
of 80% as female voice and 20% as male voice then it produces a result as female, In this similar singer can also be classified and
observe the below results after uploading audio clip. It displays results on the window as Uploaded Vocal Parts identified as Female
Uploaded Audio file classified as singer name: Annar.
REFERENCES
[1] survey of audio-based music classification and annotation." IEEE transactions on multimedia 13, no. 2 (2011): 303-319.
[2] Tsai, Wei-Ho, and Hao-Ping Lin. "Background music removal based on cepstrum transformation for popular singer identification." IEEE Transactions on
Audio, Speech, and Language Processing 19, no. 5 (2011): 1196-1205.
[3] Pikrakis, Aggelos, Yannis Kopsinis, Nadine Kroher, and José- Miguel Díaz-Báñez. "Unsupervised singing voice detection using dictionary learning." In Signal
Processing Conference (EUSIPCO), 2016 24th European, pp. 1212-1216. IEEE, 2016.
[4] Song, Liming, Ming Li, and Yonghong Yan. "Automatic vocal segments detection in popular music." In 2013 Ninth International Conference on
Computational Intelligence and Security, pp. 349-352. IEEE, 2013.
[5] Tsai, Wei-Ho, and Hsin-Chieh Lee. "Singer identification based on spoken data in voice characterization." IEEE Transactions on Audio, Speech, and Language
Processing 20, no. 8 (2012): 2291-2300.
[6] Regnier, Lise, and Geoffroy Peters. "Singer verification: singer model. vs. song model." In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE
International Conference on, pp. 437-440. IEEE, 2012.
[7] Zhu, Bilei, Wei Li, Ruijiang Li, and Xiangyang Xue. "Multi-stage non-negative matrix factorization for monaural singing voice separation." IEEE Transactions
on audio, speech, and language processing 21, no. 10 (2013): 2096-2107.
[8] Hu, Ying, and Guizhong Liu. "Separation of singing voice using nonnegative matrix partial co-factorization for singer identification." IEEE Transactions on
Audio, Speech, and Language Processing 23, no. 4 (2015): 643-653.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1647
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VIII August 2022- Available at www.ijraset.com
[9] Logan, Beth. "Mel Frequency Cepstral Coefficients for Music Modeling." In ISMIR, vol. 270, pp. 1-11. 2000.
[10] Eronen, Antti, and Anssi Klapuri. "Musical instrument recognition using cepstral coefficients and temporal features." In Acoustics, Speech, and Signal
Processing, 2000. ICASSP'00. Proceedings. 2000 IEEE International Conference on, vol. 2, pp. II753- II756. IEEE, 2000.
[11] S. Kooshan, H. Fard and R. M. Toroghi, "Singer Identification by Vocal Parts Detection and Singer Classification Using LSTM Neural Networks," 2019 4th
International Conference on Pattern Recognition and Image Analysis (IPRIA), 2019, pp. 246-250, doi: 10.1109/PRIA.2019.8786009.
[12] Xulong Zhang, Jiale Qian, Yi Yu, Yifu Sun, Wei Li, "Singer Identification Using Deep Timbre Feature Learning with KNN-NET", ICASSP 2021 - 2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.3380-3384, 2021
[13] Sharmila Biswas, Sandeep Singh Solanki, "Speaker recognition: an enhanced approach to identify singer voice using neural network", International Journal of
Speech Technology, vol.24, no.1, pp.9, 2021.
[14] Graves Alex, Navdeep Jaitly and Abdel-Rahman Mohamed, "Hybrid speech recognition with deep bidirectional LSTM", Automatic Speech Recognition and
Understanding (ASRU) 2013 IEEE Workshop on, pp. 273-278, 2013.
[15] F. Gers, N. Schraudolph and J. Schmidhuber, "Learning Precise Timing with LSTM Recurrent Networks", Journal of Machine Learning Research, vol. 3, pp.
115-143, 2002.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1648