Spotify Top Hits

Spotify Top Hits
Daniel Moraes Gonçalo Moutinho Helena Ferreira

Faculty of Engineering Faculty of Engineering Faculty of Engineering
University of Porto University of Porto University of Porto
Porto, Portugal Porto, Portugal Porto, Portugal
up201800181@edu.fe.up.pt up201506397@edu.fe.up.pt up201706264@edu.fe.up.pt
Ines David Rui Gomes Tomás Miranda

Faculty of Engineering Faculty of Engineering Faculty of Engineering
University of Porto University of Porto University of Porto
Porto, Portugal Porto, Portugal Porto, Portugal
up201806552@edu.fe.up.pt up201705528@edu.fe.up.pt up201704718@edu.fe.up.pt
Abstract — Music is eternal. Since the beginning of the human Starting by the extraction of the Spotify dataset from
existence until eternity, music is always in constant development, Kaggle, in this problem, it is used a predictive analysis, using
and so is one of the most famous applications nowadays to show regression and classification techniques, to achieve a
it to everyone: Spotify. Spotify is undoubtedly a case of success successful model that can be used to generate a personalized
when it concerns listening to music and how people organize music playlist based on a certain criterion.
their own playlists and songs, according to their music tastes, and
its use is growing even more along time. Creating a playlist could After discussing the background theory in Section II, the
involve lots of parameters to distribute the different songs in a methodology of the proposed approach is provided in Section
specific way, as it could be organized, for example by music III, followed by the results and discussion in Section IV.
gender, or even by the danceability of its songs or even its Finally, the paper is concluded in Section V followed by
popularity. This research is based on the relationships between acknowledging those who provided indispensable help in the
songs available on a Spotify database that were appointed for this succeeding section.
article, and their display in different playlists according to their
tempo, duration, etc. To do this research in a well-analysed way, II. BACKGROUND THEORY
this article was sustained on 3 big algorithms: The Decision Tree,
The Random Forest, and The Naïve Bayes.
In this work, three different models were used: the Decision
Tree, the Random Forest, and Naïve Bayes.
I. INTRODUCTION & PROBLEM DEFINITION
A. Decision Tree Algorithm
Since as long as we can remember, music has been a feature
of all societies. Its diversity is due to these cultural differences The Decision Tree algorithm is part of the supervised
and time eras. As such, with a billion-dollar turnover, the music learning algorithms family. The decision tree approach, unlike
industry continues to expand with each passing year. The other supervised learning algorithms, may also be utilized to
global recorded music market generated 20.2 billion US dollars solve regression and classification issues. [4]
in revenue in 2019 and 21.6 billion US dollars in 2020. People By learning simple decision rules inferred from prior data,
may now listen to music whenever and wherever they choose, the purpose of utilizing a Decision Tree is to develop a
thanks to the digitalization of the music industry. Modern
training model that can be used to predict the class or value of
computer technologies have also made it easier to collect and
the target variable (training data). [4]
store large amounts of data. [1]
With a fast-growing user base, Spotify is one of the most We start from the root of the tree when using Decision
popular on-demand music services. Over 1.2 trillion songs Trees to forecast a class label for a record. The values of the
were expected to be streamed on demand in 2018, root attribute and the record's attribute are compared. We
underscoring the cultural significance of music in our lives. [2] follow the branch that corresponds to that value and jump to
the next node based on the comparison. [4]
Music's popularity and trends change in very short periods
of time. The popularity and success of songs are influenced by The decision to make strategic splits has a significant
a variety of elements, including genre, chords, danceability, impact on a tree's accuracy. The decision criteria for
tempo, artist, play-counts, and so on. If music producers can classification and regression trees are different. [4]
learn more about the factors that influence a track's popularity,
To decide whether to break a node into two or more sub-
they will, also, be able to tailor tracks to the preferences of their
target audiences, ensuring its success to some extent. [1] nodes, decision trees employ a variety of techniques. The
homogeneity of the generated sub-nodes improves with the
The attributes of the songs can be successfully broken generation of sub-nodes. To put it another way, the purity of
down and analysed using artificial intelligence and machine the node improves as the target variable grows. The decision
learning to predict their popularity. [1] tree divides the nodes into sub-nodes based on all available
Personalization of music services is also one of the primary variables, then chooses the split that produces the most
concerns in the music industry and digitalization. homogeneous sub-nodes. [4]
Automatically creating customised music playlists and giving We decided to use this algorithm since due to handling non-
a way to switch styles result in substantial progress in the linear data sets effectively.
personalization and intelligence of music services, as well as
increased customer happiness. [3]
10th June 2022

B. Random Forest Algorithm The parameters and their significance are as follows:
A random forest is a machine learning technique for solving Acousticness: A scale from 0.0 to 1.0 that shows whether
classification and regression problems. It makes use of the track is acoustic. The number 1.0 denotes a high level of
ensemble learning, which is a technique for solving certainty that the track is acoustic. [8]
complicated problems by combining multiple classifiers. [5]
Danceability: Dance ability is a musical concept that refers
Many decision trees make up a random forest algorithm. to how appropriate a track is for dancing depending on a
Bagging or bootstrap aggregation are used to train the 'forest' variety of factors such as tempo, rhythm stability, beat power,
formed by the random forest method. Bagging is a meta- and overall regularity. The least danceable value is 0.0, and the
algorithm that increases the accuracy of machine learning most danceable value is 1.0. [8]
algorithms by grouping them together. [5] Energy: Energy is a measure from 0.0 to 1.0 and represents
The (random forest) algorithm determines the outcome a perceptual measure of intensity and activity. Typically,
based on decision tree predictions. It forecasts by averaging or energetic tracks feel fast, loud, and noisy. For example, “death
averaging the output of various trees. The precision of the metal”, has a high energy level, whereas a Bach prelude has a
result improves as the number of trees grows. [5] low one. Perceptual features contributing include dynamic
range, timbre, onset rate, and general entropy. [8]
A random forest method overcomes the drawbacks of a
Instrumentalness: Determines whether a track has no
decision tree algorithm. It reduces dataset overfitting and vocals. In this case, the sounds “ooh” and” aah” are viewed as
improves precision. It generates forecasts without requiring instrumental. Tracks like rap or spoken word are simply
many package setups. [5] “vocal.” The higher the instrumentalism score gets to 1.0, the
more often the track is devoid of vocals. Instrumental tracks
C. Naïve Bayes Algorithm
are represented by values above 0.5, but as the value reaches
Nave Bayes is a fantastic example of how the simplest 1.0 optimism increases. [8]
answers are often the most powerful. Despite recent
advancements in Machine Learning, it has proven to be not Liveliness: The participation of an audience in the video is
detected. Higher liveliness values indicate a greater likelihood
only easy, but also quick, accurate, and dependable. [6]
of the track being played live. If the value is greater than 0.8
It has been effectively utilized for a variety of purposes, but the track is almost certainly live. [8]
it excels at natural language processing (NLP) issues. [6] Loudness: Loudness values are summed over the whole
The Bayes Theorem provides the basis for the Nave Bayes track and can be used to compare the relative loudness of
algorithm, which is utilized in a wide range of classification different tracks. The main psychological correlate of physical
problems. [6] power is loudness, which is the quality of sound (amplitude).
Typical values vary from -60 to 0 decibels. This values were
It's a classification method based on Bayes' Theorem and normalized between 0.0 and 1.0, more calm music have a lower
the assumption of predictor independence. A Naive Bayes value, near 0.0. [8]
classifier, in simple terms, posits that the existence of one
Speechiness: The appearance of spoken words in a track is
feature in a class is unrelated to the presence of any other
detected by speechiness. The attribute rating is like 1.0 the
feature. [7] more purely speech-like the recording is (e.g., talk show, audio
For example, if a fruit is red, round, and roughly 3 inches book, poetry). Tracks with a value greater than 0.66 are almost
in diameter, it is termed an apple. Even if these characteristics always completely made up of spoken language. Tracks with
are reliant on one another or on the presence of other values between 0.33 and 0.66 which include both music and
characteristics, they all add to the likelihood that this fruit is speech in parts or layers, such as rap music. Music and other
an apple, which is why it is called 'Naive.' [7] non-speech-like songs are likely to have values below 0.33. [8]
The Naive Bayes model is simple to construct and is Valence: A scale ranging from 0.0 to 1.0 that describes
how positive a track is musically. Tracks with a high valence
especially good for huge data sets. Naive Bayes is renowned
sound more optimistic (e.g., joyful, cheerful, euphoric),
to outperform even the most advanced classification systems
whereas tracks with a low valence sound gloomier (e.g. tragic,
due to its simplicity. [7] melancholy). [8]
Naive Bayes uses a similar method to predict the Tempo: The overall estimated tempo of a track in beats per
probability of different class based on various attributes. This minute (BPM). The tempo is the rhythm or pace of a piece in
algorithm is mostly used in text classification and with musical terms, and it is taken directly from the average beat
problems having multiple classes. length. These values were normalized, so a high tempo (near
III. METHODOLOGY 1.0) translates into a higher speed. [8]
The research also identifies the factors that have played a All parameters were normalized, for an easier
significant role in deciding the songs' popularity. The entire understanding about the diversity on graphics where we
methodology can be divided into 3 major sections: associate the popularity and parameters values. This is similar
Prepossessing of data-set, Model Development, Analysis of to the work done in [8]. This also helps understanding which
Graphs. parameters are the most influential in the popularity of a music.
A new column was created in the data base to be able to
A. Pre-processing of dataset
serve as label for the upcoming models’ predictions and
The paper's data was culled from Spotify, a popular music analysis. This classifies a song as very popular or not. A song
service. Spotify's music sorting criteria are built in. The piece with popularity index over 70 was considered very popular. A
is inspired by the same metrics, which were used to compare total of 699/1942 songs were in this category.
songs in the project.
B. Model Development One process was created with a very straightfoward
We now show the models developed for this work. They application of the Naïve Bayes operator. The classification
were developed using the software “Rapid Miner”. model is delivered from the output port and the model can be
applied to the unlabelled data, which will generate predictions
1) Decision Tree Model that will then be applied, and subsequently evaluated by the
The Decision Tree Model consists of a very elementary Performance operator. Furthermore, the operator Split Data
method. The generated tree contains various nodes in which divides the original data set into two parts: one is used to train
each node represents a splitting rule for one specific attribute the Naive Bayes model, and the other serves to evaluate the
and the propragation of new nodes is repeated until the model. [10]
stopping criteria is met. [9] With the results obtained it is possible to create different
types of song playlists with distinct characteristics. This will
be further demonstrated in the results section of the article.
C. Analysis of Graphs
The final section of approach is dedicated to analysis. After
the training and testing, the model clearly displays the
maximum number of songs with a significant diversity
between them.
Further analysis is on the musical parameters namely
Fig. 1: Decision Tree Model. acousticness, liveliness, danceability, instrumentalness, tempo,
loudness, speechiness and energy. The following parameters
After generation, the decision tree model can be applied to are plotted with respect to popularity for both the data-set to
new examples using the Apply Model operator. [9] check how they fare against each other. In the following
graphs, parameters are plotted alongside X-axis while the
2) Random Forest Model
popularity on Y-axis. [8]
The Random Forest Model is essentially a learning method
that operates by constructing a multitude of decision trees at Their results and reports are answered in the following
training time. Due to this, it generally outperforms an chapter.
individual Decision Tree Model, however, a single boosted IV. RESULTS AND DISCUSSION
tree might achieve better results. A total of 200 trees were used
to maximize the accuracy of the predictions. The scatter plots of both the sets show some striking in the
diverse fields, comparing with the popularity of a song. The
popularity is an important attribute to understand the
behaviour of other attributes.
Fig. 2: Random Forest Model.
Two processes were created for the Random Forest Model.

The first one predicts whether a song will be very popular or
not and evaluates the performance of the Model as well. The Fig. 4: Popularity vs. Acousticness of songs.
second gives us the weight of each of the attributes on said In Fig. 4, it is visible that a song becomes more popular
prediction. when it has acoustic on the track, but not too much. Also, it is
3) Naïve Bayes Model visible that there are more songs with low values of
acousticness than high.
The Naive Bayes model is a high-bias, low-variance
classifier and it’s generally able to build a good model . It is
simple to use and computationally inexpensive with good
accuracy especially for bigger amounts of data available. [10]
Fig. 5: Popularity vs. Danceability of songs.
In Fig. 5, it is visible that a song becomes more popular

when it has high values of danceability, but does not become
Fig. 3: Naive Bayes Model.
so popular when it has intermediate or even low values of
danceability.
Fig. 6: Popularity vs. Energy of songs. Fig. 10: Popularity vs. Speechiness
In Fig. 6, it is visible that a song becomes more popular In Fig. 10, it is visible that the popularity gets higher when
when it has high values of energy. Songs with low values of the speechness atribute has low values.
energy are not popular at all.
Fig. 11: Popularity vs. Tempo of songs.

Fig. 7: Popularity vs. Instrumentalness of songs.
In Fig. 11, it is visible that there are songs with a high
In Fig. 7, it is visible that there are few songs with popularity when the tempo has high values or has low values.
intrumentalness atribute, and on the songs that it is present, it In other words, this parameter is shown to not be influential
is stated that the popularity does not increase when the when it comes to the popularity of a song.
intrumentalness is low. In this case, we can perceive that this
parameter does not affect has much the popularity with this
data base, considering that the majority of songs use similar
values of instrumentalness.
Fig. 12: Popularity vs. Valence of songs.
In Fig. 12, it is visible that there are songs with a high

popularity when the valence has high values or has low values.
Just like tempo, this parameter does not influence the
Fig. 8: Popularity vs. Liveliness of songs.
popularity of a song, considering the diversity valence in
In Fig. 8, it is visible that there are songs with more songs in the data base used.
popularity when the liveness atribute has low values.
After this graphical analysis, it is now possible to
understand which are the songs atributes that have more
inlfluence on the songs popularity. Using the different
algorithms, we tried to prove effectively which were the
different attributes with more weight to define a popularity of
a song. Loudness, danceability, energy were, for example,
three important aspects, using the Random Forest Model.
Fig. 9: Popularity vs. Loudness of songs.
In Fig. 9, it is visible that there are songs with more

popularity when the loudness atribute has high values. There
are almost no popular songs with low values of loudness.
Fig. 13: Atributes weight in the popularity, using the Random Forest Model .
1) Decision Tree Model V. CONCLUSIONS
Firstly, with the Decision Tree model it is noticeable that In this work, the following statements were concluded:
the results acquired were not desirable, with a very low • Some paramenters have an big influence on a song’s
number of class recall results and very few predicted Yes’s popularity.
compared to the predicted No’s.
• Loudness, danceability, energy were three of the most
important parameters in a song’s popularity.
• Three different models were able to solve our work’s
problem, using the following algoritms: the Decision
Tree, the Random Forest, and Naïve Bayes.
Fig. 14: Decision Tree Model statistics.
• With the results, it was stated that the Naïve Bayes
2) Random Forest Model model was the one with the highest accuracy. Thus
Using the Random Forest method which predict based on, being the most suitable for our work.
not one, but multiple decision trees, the following results were • Moreover, we were also able to generate a playlist
achieved. based on the models results, having wrongly predicted
top hit songs.
There were more ideas for this project, however this was
the maximum that we could have done with the time given. So
for future work, it would be interesting, with the results in this
Fig. 15: Random Forest statistics.
paper, to create different playlists considering the influence of
The accuracy achieved with the Random Forest method is certaint paramenters while using different models.
far from ideal, as is visible in Fig. 16. This is due to the
ACKNOWLEDGMENTS
individual decision trees being far from optimized, but it is
still better than the outcome of a single tree model (Decision This paper and the research behind it would not have been
Tree). possible without the exceptional support of Professors Eduardo
Oliveira, Teresa Galvão and José Luís Borges.
Even changing the number of trees, as well as the depth,
and applying different pruning and preprunning values did not REFERENCES
translate to a significant improvement in the accuracy. 1. Kamal, J., et al. A Classification Based Approach to the
Prediction of Song Popularity. in 2021 International Conference on
3) Naïve Bayes Model Innovative Computing, Intelligent Communication and Smart
On the other hand, the Naive Bayes Model yielded the Electrical Systems (ICSES). 2021.
better results out of all the three models tested. As such, the 2. Gulmatico, J.S., et al. SpotiPred: A Machine Learning Approach
accuracy and class recall percentages were fairly reasonable Prediction of Spotify Music Popularity by Audio Features. in 2022
and it was in this model that the conclusions of this paper were Second International Conference on Power, Control and Computing
Technologies (ICPC2T). 2022.
based on.
3. Bakhshizadeh, M., et al. Automated Mood Based Music Playlist
Generation By Clustering The Audio Features. in 2019 9th
International Conference on Computer and Knowledge Engineering
(ICCKE). 2019.
4. Chauhan, N.S. Decision Tree Algorithm, Explained. Machine
Learning 2022 [cited 2022; Available from:
Fig. 16: Naïve Bayes Statistics https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
explained.html.
4) Creation of Playlists
5. Introduction to Random Forest in Machine Learning. Machine
This model can therefore be used to create playlists based Learning 2020; Available from: https://www.section.io/engineering-
on its predictions. An interesting idea would be, for example, education/introduction-to-random-forest-in-machine-learning/.
the creation of a playlist comprised of songs that the model 6. Chauhan, N.S. Naïve Bayes Algorithm: Everything You Need to
predicted would be very popular, based on the values of Know. Machine Learning 2022; Available from:
attributes mentioned above, while having, however, a https://www.kdnuggets.com/2020/06/naive-bayes-algorithm-
popularity rating below 70 in reality. everything.html.
7. Ray, S. 6 Easy Steps to Learn Naive Bayes Algorithm with codes
in Python and R. 2017; Available from:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-
explained/.
8. Datta, A., et al. Multi-Class Classification of Different Region Pop
Songs using Spotify Database. in 2021 5th International Conference
Fig. 17. Example of songs for the playlist.
on Electronics, Communication and Aerospace Technology (ICECA).
2021.
In Fig. 17 three examples of songs that the model wrongly 9. Decision Tree. [cited 2022; Available from:
predicted to be very popular are highlighted. This could be https://docs.rapidminer.com/latest/studio/operators/modeling/predicti
used to create a playlist named “Great songs you might not ve/trees/parallel_decision_tree.html#:~:text=Description,rule%20for
know”, for example. %20one%20specific%20Attribute.
10. Naive Bayes. Available from:
https://docs.rapidminer.com/latest/studio/operators/modeling/predicti
ve/bayesian/naive_bayes.html

Spotify Top Hits

Uploaded by

Copyright:

Available Formats

Spotify Top Hits

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spotify Top Hits

Uploaded by

Copyright:

Available Formats

Spotify Top Hits

Daniel Moraes Gonçalo Moutinho Helena Ferreira

Ines David Rui Gomes Tomás Miranda

10th June 2022

Fig. 2: Random Forest Model.

Two processes were created for the Random Forest Model.

Fig. 5: Popularity vs. Danceability of songs.

In Fig. 5, it is visible that a song becomes more popular

Fig. 11: Popularity vs. Tempo of songs.

Fig. 12: Popularity vs. Valence of songs.

In Fig. 12, it is visible that there are songs with a high

Fig. 9: Popularity vs. Loudness of songs.

In Fig. 9, it is visible that there are songs with more

You might also like