Spotify Top Hits
Spotify Top Hits
Spotify Top Hits
Abstract — Music is eternal. Since the beginning of the human Starting by the extraction of the Spotify dataset from
existence until eternity, music is always in constant development, Kaggle, in this problem, it is used a predictive analysis, using
and so is one of the most famous applications nowadays to show regression and classification techniques, to achieve a
it to everyone: Spotify. Spotify is undoubtedly a case of success successful model that can be used to generate a personalized
when it concerns listening to music and how people organize music playlist based on a certain criterion.
their own playlists and songs, according to their music tastes, and
its use is growing even more along time. Creating a playlist could After discussing the background theory in Section II, the
involve lots of parameters to distribute the different songs in a methodology of the proposed approach is provided in Section
specific way, as it could be organized, for example by music III, followed by the results and discussion in Section IV.
gender, or even by the danceability of its songs or even its Finally, the paper is concluded in Section V followed by
popularity. This research is based on the relationships between acknowledging those who provided indispensable help in the
songs available on a Spotify database that were appointed for this succeeding section.
article, and their display in different playlists according to their
tempo, duration, etc. To do this research in a well-analysed way, II. BACKGROUND THEORY
this article was sustained on 3 big algorithms: The Decision Tree,
The Random Forest, and The Naïve Bayes.
In this work, three different models were used: the Decision
Tree, the Random Forest, and Naïve Bayes.
I. INTRODUCTION & PROBLEM DEFINITION
A. Decision Tree Algorithm
Since as long as we can remember, music has been a feature
of all societies. Its diversity is due to these cultural differences The Decision Tree algorithm is part of the supervised
and time eras. As such, with a billion-dollar turnover, the music learning algorithms family. The decision tree approach, unlike
industry continues to expand with each passing year. The other supervised learning algorithms, may also be utilized to
global recorded music market generated 20.2 billion US dollars solve regression and classification issues. [4]
in revenue in 2019 and 21.6 billion US dollars in 2020. People By learning simple decision rules inferred from prior data,
may now listen to music whenever and wherever they choose, the purpose of utilizing a Decision Tree is to develop a
thanks to the digitalization of the music industry. Modern
training model that can be used to predict the class or value of
computer technologies have also made it easier to collect and
the target variable (training data). [4]
store large amounts of data. [1]
With a fast-growing user base, Spotify is one of the most We start from the root of the tree when using Decision
popular on-demand music services. Over 1.2 trillion songs Trees to forecast a class label for a record. The values of the
were expected to be streamed on demand in 2018, root attribute and the record's attribute are compared. We
underscoring the cultural significance of music in our lives. [2] follow the branch that corresponds to that value and jump to
the next node based on the comparison. [4]
Music's popularity and trends change in very short periods
of time. The popularity and success of songs are influenced by The decision to make strategic splits has a significant
a variety of elements, including genre, chords, danceability, impact on a tree's accuracy. The decision criteria for
tempo, artist, play-counts, and so on. If music producers can classification and regression trees are different. [4]
learn more about the factors that influence a track's popularity,
To decide whether to break a node into two or more sub-
they will, also, be able to tailor tracks to the preferences of their
target audiences, ensuring its success to some extent. [1] nodes, decision trees employ a variety of techniques. The
homogeneity of the generated sub-nodes improves with the
The attributes of the songs can be successfully broken generation of sub-nodes. To put it another way, the purity of
down and analysed using artificial intelligence and machine the node improves as the target variable grows. The decision
learning to predict their popularity. [1] tree divides the nodes into sub-nodes based on all available
Personalization of music services is also one of the primary variables, then chooses the split that produces the most
concerns in the music industry and digitalization. homogeneous sub-nodes. [4]
Automatically creating customised music playlists and giving We decided to use this algorithm since due to handling non-
a way to switch styles result in substantial progress in the linear data sets effectively.
personalization and intelligence of music services, as well as
increased customer happiness. [3]
The Naive Bayes model is simple to construct and is Valence: A scale ranging from 0.0 to 1.0 that describes
how positive a track is musically. Tracks with a high valence
especially good for huge data sets. Naive Bayes is renowned
sound more optimistic (e.g., joyful, cheerful, euphoric),
to outperform even the most advanced classification systems
whereas tracks with a low valence sound gloomier (e.g. tragic,
due to its simplicity. [7] melancholy). [8]
Naive Bayes uses a similar method to predict the Tempo: The overall estimated tempo of a track in beats per
probability of different class based on various attributes. This minute (BPM). The tempo is the rhythm or pace of a piece in
algorithm is mostly used in text classification and with musical terms, and it is taken directly from the average beat
problems having multiple classes. length. These values were normalized, so a high tempo (near
III. METHODOLOGY 1.0) translates into a higher speed. [8]
The research also identifies the factors that have played a All parameters were normalized, for an easier
significant role in deciding the songs' popularity. The entire understanding about the diversity on graphics where we
methodology can be divided into 3 major sections: associate the popularity and parameters values. This is similar
Prepossessing of data-set, Model Development, Analysis of to the work done in [8]. This also helps understanding which
Graphs. parameters are the most influential in the popularity of a music.
A new column was created in the data base to be able to
A. Pre-processing of dataset
serve as label for the upcoming models’ predictions and
The paper's data was culled from Spotify, a popular music analysis. This classifies a song as very popular or not. A song
service. Spotify's music sorting criteria are built in. The piece with popularity index over 70 was considered very popular. A
is inspired by the same metrics, which were used to compare total of 699/1942 songs were in this category.
songs in the project.
B. Model Development One process was created with a very straightfoward
We now show the models developed for this work. They application of the Naïve Bayes operator. The classification
were developed using the software “Rapid Miner”. model is delivered from the output port and the model can be
applied to the unlabelled data, which will generate predictions
1) Decision Tree Model that will then be applied, and subsequently evaluated by the
The Decision Tree Model consists of a very elementary Performance operator. Furthermore, the operator Split Data
method. The generated tree contains various nodes in which divides the original data set into two parts: one is used to train
each node represents a splitting rule for one specific attribute the Naive Bayes model, and the other serves to evaluate the
and the propragation of new nodes is repeated until the model. [10]
stopping criteria is met. [9] With the results obtained it is possible to create different
types of song playlists with distinct characteristics. This will
be further demonstrated in the results section of the article.
C. Analysis of Graphs
The final section of approach is dedicated to analysis. After
the training and testing, the model clearly displays the
maximum number of songs with a significant diversity
between them.
Further analysis is on the musical parameters namely
Fig. 1: Decision Tree Model. acousticness, liveliness, danceability, instrumentalness, tempo,
loudness, speechiness and energy. The following parameters
After generation, the decision tree model can be applied to are plotted with respect to popularity for both the data-set to
new examples using the Apply Model operator. [9] check how they fare against each other. In the following
graphs, parameters are plotted alongside X-axis while the
2) Random Forest Model
popularity on Y-axis. [8]
The Random Forest Model is essentially a learning method
that operates by constructing a multitude of decision trees at Their results and reports are answered in the following
training time. Due to this, it generally outperforms an chapter.
individual Decision Tree Model, however, a single boosted IV. RESULTS AND DISCUSSION
tree might achieve better results. A total of 200 trees were used
to maximize the accuracy of the predictions. The scatter plots of both the sets show some striking in the
diverse fields, comparing with the popularity of a song. The
popularity is an important attribute to understand the
behaviour of other attributes.
In Fig. 6, it is visible that a song becomes more popular In Fig. 10, it is visible that the popularity gets higher when
when it has high values of energy. Songs with low values of the speechness atribute has low values.
energy are not popular at all.
Fig. 13: Atributes weight in the popularity, using the Random Forest Model .
1) Decision Tree Model V. CONCLUSIONS
Firstly, with the Decision Tree model it is noticeable that In this work, the following statements were concluded:
the results acquired were not desirable, with a very low • Some paramenters have an big influence on a song’s
number of class recall results and very few predicted Yes’s popularity.
compared to the predicted No’s.
• Loudness, danceability, energy were three of the most
important parameters in a song’s popularity.
• Three different models were able to solve our work’s
problem, using the following algoritms: the Decision
Tree, the Random Forest, and Naïve Bayes.
Fig. 14: Decision Tree Model statistics.
• With the results, it was stated that the Naïve Bayes
2) Random Forest Model model was the one with the highest accuracy. Thus
Using the Random Forest method which predict based on, being the most suitable for our work.
not one, but multiple decision trees, the following results were • Moreover, we were also able to generate a playlist
achieved. based on the models results, having wrongly predicted
top hit songs.
There were more ideas for this project, however this was
the maximum that we could have done with the time given. So
for future work, it would be interesting, with the results in this
Fig. 15: Random Forest statistics.
paper, to create different playlists considering the influence of
The accuracy achieved with the Random Forest method is certaint paramenters while using different models.
far from ideal, as is visible in Fig. 16. This is due to the
ACKNOWLEDGMENTS
individual decision trees being far from optimized, but it is
still better than the outcome of a single tree model (Decision This paper and the research behind it would not have been
Tree). possible without the exceptional support of Professors Eduardo
Oliveira, Teresa Galvão and José Luís Borges.
Even changing the number of trees, as well as the depth,
and applying different pruning and preprunning values did not REFERENCES
translate to a significant improvement in the accuracy. 1. Kamal, J., et al. A Classification Based Approach to the
Prediction of Song Popularity. in 2021 International Conference on
3) Naïve Bayes Model Innovative Computing, Intelligent Communication and Smart
On the other hand, the Naive Bayes Model yielded the Electrical Systems (ICSES). 2021.
better results out of all the three models tested. As such, the 2. Gulmatico, J.S., et al. SpotiPred: A Machine Learning Approach
accuracy and class recall percentages were fairly reasonable Prediction of Spotify Music Popularity by Audio Features. in 2022
and it was in this model that the conclusions of this paper were Second International Conference on Power, Control and Computing
Technologies (ICPC2T). 2022.
based on.
3. Bakhshizadeh, M., et al. Automated Mood Based Music Playlist
Generation By Clustering The Audio Features. in 2019 9th
International Conference on Computer and Knowledge Engineering
(ICCKE). 2019.
4. Chauhan, N.S. Decision Tree Algorithm, Explained. Machine
Learning 2022 [cited 2022; Available from:
Fig. 16: Naïve Bayes Statistics https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
explained.html.
4) Creation of Playlists
5. Introduction to Random Forest in Machine Learning. Machine
This model can therefore be used to create playlists based Learning 2020; Available from: https://www.section.io/engineering-
on its predictions. An interesting idea would be, for example, education/introduction-to-random-forest-in-machine-learning/.
the creation of a playlist comprised of songs that the model 6. Chauhan, N.S. Naïve Bayes Algorithm: Everything You Need to
predicted would be very popular, based on the values of Know. Machine Learning 2022; Available from:
attributes mentioned above, while having, however, a https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-
popularity rating below 70 in reality. everything.html.
7. Ray, S. 6 Easy Steps to Learn Naive Bayes Algorithm with codes
in Python and R. 2017; Available from:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-
explained/.
8. Datta, A., et al. Multi-Class Classification of Different Region Pop
Songs using Spotify Database. in 2021 5th International Conference
Fig. 17. Example of songs for the playlist.
on Electronics, Communication and Aerospace Technology (ICECA).
2021.
In Fig. 17 three examples of songs that the model wrongly 9. Decision Tree. [cited 2022; Available from:
predicted to be very popular are highlighted. This could be https://docs.rapidminer.com/latest/studio/operators/modeling/predicti
used to create a playlist named “Great songs you might not ve/trees/parallel_decision_tree.html#:~:text=Description,rule%20for
know”, for example. %20one%20specific%20Attribute.
10. Naive Bayes. Available from:
https://docs.rapidminer.com/latest/studio/operators/modeling/predicti
ve/bayesian/naive_bayes.html