1. Introduction
Recommender systems (RS) are computerized services offered to the user that diminish information overload by filtering out unnecessary and annoying information, thus simplifying the process of finding interesting and/or relevant content which improves comfort and quality of life. The output of a typical RS is a list of recommendations produced by one of the several prediction generation algorithms (e.g., word vectors [
1,
2], decision trees [
3], (naïve) Bayes classifiers [
2,
3], k-nearest neighbors [
4], support vector machines [
5,
6], etc.) built upon a specific user model (e.g., collaborative, content-based, or hybrid) [
7]. The first application of a recommender algorithm was recorded in the 1980s when Salton [
8] published an article about a word-vector-based algorithm for text document search. The algorithm was expanded to a wider range of content for applications ranging from a document search [
7,
9,
10,
11] to e-mail filtering [
12] and personalized multimedia item retrieval [
3,
4,
13,
14,
15]. Nowadays, RSs are used in on-line shops such as Amazon to recommend additional articles to the user, in video streaming services such as Netflix to help users find something interesting to view, in advertising to limit the number of displayed advertisements to those that meet the interests of the target audience, and even in home appliances. The field of RSs is undergoing a new evolution in which researchers are tackling the topics of recommendation diversity [
16], contextualization [
17], and general optimization/automation of the recommendation process. It is important to note that these mentioned aspects often counteract each other so that, for example, increased diversity often leads to lower accuracy and vice versa [
18].
Recommender systems include a large number of parameters that can (and should) be optimized, such as the number of nearest neighbors, the number of items to recommend, the number of latent features, and which context fields are used, just to name a few. We are therefore dealing with a multidimensional optimization problem which is also time dependent, as stated in [
19]. The optimization of these parameters requires a lot of manual work by a system administrator/developer and often cannot be performed in real time as it requires the system to go off-line until the new values are determined. For this article, we focused on the matrix factorization (MF) algorithm [
20], which is currently one of the most widespread collaborative RS algorithms and is implemented in several software packages and server frameworks. Despite this, the most often used approach of selecting the best values for the algorithm parameters is still by trial and error (see, for example, the Surprise [
21] and Scikit-Learn [
22] documentation, as well as articles such as [
23,
24]). In addition, the MF approach is also
highly sensitive to the learning rate, whose initial choice and adaptation strategy are crucial, as stated in [
25].
Evolutionary computing is emerging as one of the automatic optimization approaches in recommender systems [
26], and there have been several attempts of using genetic algorithms on the matrix factorization (MF) algorithm [
20,
27,
28,
29]. Balcar [
30] used the multiple island method in order to find a better way of calculating the latent factors using the stochastic gradient descent method. Navgaran et al. [
28] had a similar idea by using genetic algorithm to directly calculate latent factor matrices for user and items which worked but encountered issues with scalability when dealing with larger datasets. Razaei et al. [
31], on the other hand, focused on the initialization of optimal parameters and used a combination of multiple island and genetic algorithm to achieve this. Lara-Cabrera et al. [
32] went a step further and used Genetic Programming to evolve new latent factor calculation strategies.
With the exception of [
32], the presented approaches focus on algorithm initialization or direct calculation of factor matrices (which introduces large chromosomes even for relatively small datasets). They do not, however, optimize the latent feature calculation procedure itself.
In this article, we present a novel approach that uses grammatical evolution (GE) [
33] to automatically optimize the MF algorithm. Our aim is not only to use GE for optimization of the MF algorithm but also to do so in a black-box manner that is completely autonomous and does not rely on any domain knowledge. The reasoning behind this approach is that we want to create a tool for an average user who wants to use the MF algorithm but lacks any domain/specialized knowledge of the algorithm’s settings. Our approach would therefore enable a user to activate and correctly use modules featured on their server framework (such as Apache). Out of existing evolutionary approaches, we selected GE because we want to develop new equations—latent factor calculation methods. Alternatives such as genetic algorithm or particle swarm optimization focus on parameter optimization and cannot be easily used for the task of evolving new expressions. Genetic Programming [
32] would also be an option but is more restrictive in terms of its grammar rules. GE, on the other hand, allows the creation of specialized functions and incorporation of expert knowledge.
In the first experiment, we used GE for automatic optimization of the parameter values of the MF algorithm to achieve the best possible performance. We then expanded our experiments to use GE for modification of the latent feature update equations of the MF algorithm to further optimize its performance. This is a classical problem of meta-optimization, where we tune the parameters of an optimization algorithm using another optimization algorithm. We evaluated our approach using four diverse datasets (CoMoDa, MovieLens, Jester, and Book Crossing) featuring different sizes, saturation, and content types.
In
Section 2, we outline the original MF algorithm and
n-fold cross-validation procedure for evaluating the algorithm’s performance that we used in our experiments.
Section 3 summarizes some basic concepts behind GE with a focus on specific settings that we implemented in our work. In
Section 4, we describe both used datasets and the hardware used to run the experiments. Finally, we present and discuss the evolved equations and numerical results obtained using those equations in the MF algorithm in
Section 5.
2. Matrix Factorization Algorithm
Matrix factorization (MF), as presented in [
19], is a
collaborative filtering approach that builds a vector of latent factors for each user or item to describe its character. The higher the coherence between the user and item factor vectors, the more likely the item will be recommended to the user. One of the benefits of this approach is that it can be fairly easily modified to include additional options such as contextualization [
17], diversification, or any additional criteria.
The algorithm, however, still has some weak points. Among them, we would like to mention the so-called cold start problem (i.e., adding new and unrated content items to the system or adding users who did not rate a single item yet), the dimensionality problem (the algorithm requires several passes over the whole dataset, which could become quite large during the lifespan of the service), and the optimization problem (the algorithm contains several parameters which must be tuned by hand before running the service).
The MF algorithm used in this research is the original singular value decomposition (SVD) model which is one of the state-of-the-art methods in collaborative filtering [
20,
21,
25,
34]. The MF algorithm itself is based on an approach similar to the principal component analysis because it decomposes the user-by-item sparse matrix into static biases and latent factors of each existing user
u and item
i. These factors are then used to calculate the missing values in the original matrix which, in turn, are used as predicting ratings.
2.1. An Overview of a Basic Matrix Factorization Approach
In a matrix factorization model, users and items are represented as vectors of latent features in a joint
f-dimensional latent factor space. Specifically, item
i is represented by vector
and user
u is represented by vector
. Individual elements of these vectors express either how much of a specific factor the item possesses or how interested the user is in a specific factor. Although there is no clear explanation to these features (i.e., one cannot directly interpret them as genres, actors, or other metadata), it has been shown that these vectors can be used in MF to predict the user’s interest in items they have not yet rated. This is achieved by calculating the dot product of the selected user’s feature vector
and the feature vector of the potentially interesting item
, as shown in (
1). The result of this calculation is the predicted rating
which serves as a measure of whether or not this item should be presented to the user.
The most intriguing challenge of the MF algorithm is the calculation of the factor vectors
and
which is usually accomplished by using a regularized model to avoid overfitting [
20]. A system learns the factor vectors through the minimization of the regularized square error on the training set of known ratings:
with
representing the training set (i.e., the set of user/item pairs for which rating
is known). Because the system uses the calculated latent factor values to predict future, unknown ratings, the system must avoid overfitting to the observed data. This is accomplished by regularizing the calculated factors using the constant
, whose value is usually determined via cross-validation.
2.2. Biases
In reality, Equation (
1) does not completely explain the observed variations in rating values. A great deal of these variations are contributed by effects independent of any user/item interaction. These contributions are called
biases because they model inclinations of some users to give better/worse ratings than others or tendencies of some items to receive higher/lower ratings than others. Put differently, biases measure how much a certain user or item deviates from the average. Apart from the user (
) and item (
) biases, the overall average rating (
) is also considered as a part of a rating:
With the addition of biases and the average rating, the regularized square error (
2) expands into
2.3. The Algorithm
The system calculates latent factor vectors by minimizing Equation (
4). For our research, we used the stochastic gradient descent algorithm as it is one of the more often used approaches for this task. The parameter values used for the baseline evaluation were the same as presented in [
17], and we summarized them in
Table 1. The initial value assigned to all the latent factors was 0.03 for the CoMoDa dataset [
35] and random values between 0.01 and 0.09 for the others.
The minimization procedure is depicted in Algorithm 1 and can be summarized as follows. The algorithm begins by initializing latent factor vectors
and
with default values (0.03) for all users and items. These values—together with the constant biases and the overall average rating—are then used to calculate the prediction error for each observed rating in the dataset:
The crucial part of this procedure is the computation of new user/item latent factor values. We compute a new
kth latent factor value
of user
u and a new
kth latent factor value
of item
i as follows:
where
and
determine learning rates for users and items, respectively, and
controls the regularization.
The computation of Equations (
5)–(
7) is carried out for all observed ratings and is repeated until a certain stopping criterion is met, as outlined in Algorithm 1.
Algorithm 1 Stochastic gradient descent |
- 1:
Initialize user and item latent factor vectors and - 2:
Calculate constant biases and and the overall average rating - 3:
fordo - 4:
for each observed rating do - 5:
Compute the prediction error ( 5) - 6:
for do - 7:
Compute new factor values ( 6) and ( 7) - 8:
end for - 9:
end for - 10:
end for
|
2.4. Optimization Task
As already mentioned, the original MF algorithm still suffers from a few weaknesses. In our research, we focused on the problem of algorithm initialization and optimization during its lifetime. As seen in (
6) and (
7), we needed to optimize several parameter values for the algorithm to work properly: learning rates
and
, regularization factor
, and initial latent feature values that are used during the first iteration. Apart from that, the optimal values of these parameters change as the dataset matures through time and the users provide more and more ratings.
Despite the widespread use of the MF algorithm and its implementation in several frameworks, an efficient methodology to automatically determine the values of the algorithm’s parameters is still missing. During our initial work with the MF algorithm [
17], we spent a lot of time comparing RMSE plots and CSV values in order to determine the initial parameter values. The listed articles [
21,
22,
23,
24] show that this is still the usual practice with this algorithm.
The main goal of our research was to use GE to achieve the same (or possibly better) RMSE performance as the original (hand-tuned) MF algorithm without manually setting any of its parameters. This way, we would be able to build an automatic recommender service yielding the best algorithm performance without any need of human intervention.
2.5. Evaluation of Algorithm’s Performance
The selection of the evaluation metric for MF depends on whether we look at the algorithm as a regression model or as a pure RS problem. With a regression model, we focus on fitting all the values and do not discern between “good” and “bad” values (i.e., ratings above and below a certain value). An RS model, on the other hand, is focused on classification accuracy which is interpreted as the number of correctly recommended items. Such an item has both ratings (predicted and true) above the selected threshold value, for example, above 4 in a system with rating on a scale from 1 to 5. Such a metric is therefore more forgiving because it does not care if the predicted rating is different from the actual rating by a large factor as long as both of them end up in the same category (above/below the threshold value). Typical examples of regression metrics include mean-squared error, root-mean-squared error (RMSE), and R-squared, while classification metrics include precision, recall, f-measure, ROC-curve, and intra-list diversity.
Because our focus was on optimizing the MF algorithm, we chose the regression aspect and wanted to match all the existing ratings as closely as possible. We used the RMSE measure in combination with
n-fold cross-validation as this combination is most often used in the RS research community [
36,
37,
38,
39,
40,
41], providing us with a lot of benchmark values with which to compare.
In addition, we also verified our findings using the Wilcoxon ranked sum test with significance level
. All of our experiments resulted in a
p-value that was lower than the selected significance level, which confirms that our experiments were statistically significant. A summary of our statistical testing is given in
Table 2.
2.5.1. RMSE
RMSE is one of the established measures in regression models and is calculated using the following equation:
where
is the actual rating of user
u for item
i and
is the system’s predicted rating.
is the cardinality of the training set (i.e., the number of known user ratings).
The RMSE values from experiments of other research groups on the datasets used in this article are summarized in
Table 3.
2.5.2. Overfitting and Cross-Validation
Overfitting is one of the greatest nemeses of RS algorithms because it tends to fit the existing data perfectly instead of predicting missing (future) data, which is what RSs are meant for. All algorithms must therefore either include a special safety mechanism or be trained using additional techniques such as cross-validation.
The original MF algorithm used in this research uses a safety mechanism in the form of regularization parameter whose value had to be set manually. The value of the parameter was set in a way that achieved the best performance on the test set and thus reduced the amount of overfitting. It is important to note that the optimal value of the regularization parameter changes depending on the dataset as well as with time (i.e., with additional ratings added to the system).
Regularization alone is however not enough to avoid overfitting, especially when its value is subject to automated optimization which could reduce it to zero, thus producing maximum fitting to the existing data. In order to avoid overfitting, we used cross-validation in our experiments. n-fold cross-validation is one of the gold standard tests for RS assessment. The test aims to reduce the chance of overfitting the training data by going over the dataset multiple times while rearranging the data before each pass. The dataset is split into n equal parts. During each pass over the dataset, one of the parts is used as the test set for evaluation while the remaining parts represent the training set that is used for training (calculating) the system’s parameters. The final quality of the system is calculated as the average RMSE of n, thus performed tests.
3. Grammatical Evolution
The first and the most important step when using GE is defining a suitable grammar by using the Backus–Naur form (BNF). This also defines the search space and the complexity of our task. Once the grammar is selected, the second step is to choose the evolution parameters such as the population size and the type and probability of crossover and mutation.
3.1. The Grammar
Because we want to control the initialization process and avoid forming invalid individuals due to infinite mapping of the chromosome, we need three different sections of our grammar: recursive, non-recursive, and normal [
46]. The recursive grammar includes rules that never produce a non-terminal. Instead, they result in direct or indirect recursion. The non-recursive grammar, on the other hand, never leads to the same derivation rule and thus avoids recursion. We then control the depth of a derivation tree by alternating between non-recursive and recursive grammar. Using recursive grammar will result in tree growth, while switching to non-recursive stops this growth. The last of the grammars—normal—is the combination of the recursive and non-recursive grammar and can therefore produce trees of varying depth.
The following derivation rules constitute our non-recursive grammar:
<expr> :: = <const> | <var>
<const> :: = <sign><n>.<n><n>
<sign> :: = + | −
<n> :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<var> :: = | |
This is the recursive grammar:
<expr> :: = <expr> <binOper> <expr> | <unOper> <expr>
<binOper> :: = + | − | ∗ | /
<unOper> :: = log() |
The normal grammar is the combination of the above two grammars where both derivation rules for an expression are merged into a single one:
<expr> :: = <expr> <binOper> <expr> |
<unOper> <expr> |
<const> | <var>
3.2. Initialization
In GE, the genotype consists of a sequence of codons which are represented by eight-bit integers. We used this codon sequence to synthesize the derivation tree (i.e., the phenotype) based on the selected start symbol. Each codon was then used to select the next derivation rule from the grammar. The start symbol determined the first rule, while the remaining rules were selected based on whatever symbols appear in the derivation tree until all the branches end in a terminal rule. Because the range of possible codon values is usually larger than the number of available rules for the current symbol, we used a modus operator to map the codon value to the range of the number of rules for that symbol. This also means that we seldom used the complete codon sequence in the genotype. Either we produced a complete tree before using all the codons or we ran out of codons and used wrapping to complete the tree (i.e., we simply carried on with mapping using the same string of codons again).
In order to overcome the problem of starting genotype strings failing to map to finite phenotypes [
33], we used the sensible initialization technique proposed in [
46]. This technique enables the use of a monitored selection of rules from either a non-recursive or recursive grammar. This method mimics the original ramped half-and-half method described in [
47]. The ramped half-and-half method results in a population half of the trees having individual leafs of the same depth, while the other half have leafs with arbitrary depths no deeper than the specified maximum depth.
Such simple procedures often result in improved performance because they allow us to potentially avoid the pitfall of having all individuals start close to a localized extreme. In such a case, the algorithm could become stuck there or spend a lot of time (generations) to find the true optimal value. In our previous experiments [
42,
48,
49], using the sensible initialization proved to be sufficient. Should this fail, however, one could also consider additional techniques such as Cluster-Based Population Initialization proposed by Poikolainen et al. [
50].
3.3. Generating Individuals from Chromosomes—An Example
Each of the individuals begins as a single chromosome from which an update equation is derived and used in our GE enhanced MF algorithm (see Algorithm 2). In most of our experiments, the chromosome produces an equation that is used to calculate the values of user/item latent factors. To help understand our approach, let us look at an example and assume that we have the chromosome {14, 126, 200, 20, 75, 12, 215, 178, 48, 88, 78, 240, 137, 160, 190, 98, 247, 11} and use the start symbol from
Section 5.2 (in fact, we will derive the two equations from Equation (
9)). The first symbol in the start symbol is
<expr> and, using our grammar from
Section 3.1, we perform the following modulo operation:
(rule number) =
(codon integer value) mod
(number of rules for the current non-terminal)
As our start symbol has four possible rules (we use the normal grammar), the codon value of 15 selects the third production rule (i.e., 14 mod
), hence choosing the constant (
<const>). The next four codons are then used to derive the constant value which, according to the grammar, consists of a sign and three digits. The second equation (i.e., the right equation in Equation (
9)) is derived from the remaining codons. The whole mapping process is summarized in
Table 4. The first two columns list the codons and the symbols that are being used for the derivation, while the second two columns give the results of the modulo operations and selected terminals/non-terminals using the corresponding production rule on non-terminals from the second row.
The result of this procedure is Equation (
9) which was then used as part of our further experiments (see
Section 5.3 for further details).
3.4. Optimizing MF—Evaluating Individuals
For each of the experiments, the just described derivation procedure is used to compose code representing Equations (
6) and (
7) within Algorithm 2. Thus, the obtained algorithm represents an individual in the population whose fitness we evaluated according to standard 10-fold cross-validation: First, the algorithm was run on a train set to obtain user and item latent factor values. Second, RMSE is computed using those factor values on a test set. The procedure was repeated for all ten folds and the final fitness of an individual is the average of the obtained ten RMSEs.
Algorithm 2 GE enhanced stochastic gradient descent |
Initialize user and item latent factor vectors and - 2:
Calculate constant biases and and the overall average rating fordo - 4:
for each observed rating do Compute the prediction error ( 5) - 6:
for do Compute new factor values using functions from the individual’s chromosome - 8:
end for end for - 10:
end for
|
In cases where the individual’s chromosome would result in an unfeasible solution (e.g., division by zero or too large latent factor values), we immediately mark the individual for deletion by setting its fitness function to an extremely large value even before the evaluation of the individual starts. With this approach, we quickly trimmed the population of unwanted individuals and start searching within the feasible solution set after the first few generations.
3.5. Crossover and Mutation
The original one-point crossover which was proposed by O’Neil et al. [
33] has a destructive effect on the information contained by the parents. This occurs because changing the position in the genotype string results in a completely different phenotype.
For our experiments, we decided to use the LHS replacement crossover [
51] instead because it does not ruin the information contained in the parents’ phenotype. This is a two-point crossover where only the first crossover point is randomly selected. The crossover point in the second parent is then limited so that the selected codon expands the same type of non-terminal as the selected codon of the first parent. Both crossover points therefore feature codons that expand the expressions starting at the first crossover points.
The LHS replacement crossover has many similarities with the crossover proposed in [
47] but also has some additional advantages. It is not limited by closure and maintains the advantages of using BNF grammar.
Mutation can result in the same destructive effect when not properly controlled. Byrne [
52] called this a
structural mutation. For our experiment, we created a mutation operator that works in a similar manner as the LHS replacement crossover. It mutates a randomly selected codon and reinterprets the remaining codons.
As proposed by [
47], we set the probability of crossover and mutation to a relatively small value. This method is easy to implement and reduces the amount of situations where only a single terminal is mutated or two terminals are exchanged by a crossover operation. Although there is not much evidence for or against this practice, it has been demonstrated that changing node selection to favor larger subtrees can noticeably improve GE performance on basic standard problems [
53].
3.6. Evolution Settings
Table 5 shows the parameters in our experiment. In order to control bloat, we limited the maximal tree depth [
54] and the count of nodes of individuals by penalizing those that overstepped either of the two limits. This penalty was performed by raising the fitness values of such individuals to infinity.
Each following generation was created performing three steps. First, we created duplicates of a randomly selected 10% of individuals from the current population and mutated them. We then selected 20% of the current individuals to act as parents in crossover operations. This selection was performed using the tournament selection [
55] with the tournament size of two. Tournament selection chooses fitter individuals with higher probability and disregards how much better they are. This impacts fitness values to create constant pressure on the population. We determined the two above percentages based on previous experience and some preliminary experiments. In the last phase, we substituted the worst 30% of individuals in the population with the offspring obtained in the previous two steps (i.e., mutated copies (10%) and the crossover offspring (20%)).
Using a maximum depth of 12 in our individually generated trees and assuming that a number is terminal, we can estimate the size of the search space to be in the range of 10 to the power of 18 (4 at depth level one, 361 at depth level two, 4272 at depth level three, and so on). With our number of generations (150), population size (50), and elite individual size (10%), we can calculate that we only need to evaluate 6755 individuals to find our solution. Considering the total search space size, this is a very low number of evaluations to perform. We can also assume that methods such as Exhaustive Search would have serious issues in such a large space.
5. Results
In this section, we present the results of our work using the four databases listed in
Table 3. The CoMoDa dataset is covered in
Section 5.1 and
Section 5.2. On this dataset, we first optimized only the learning rates and regularization factors (as real constants) from latent factors update equations (Equations (
6) and (
7)). After that, in
Section 5.2, we also evolved the complete latent factors update equations. Armed with these preliminary results—and the convergence results presented in
Section 5.3—we then moved to the three much larger datasets (MovieLens, Book-Crossing, and Jester) in the last three subsections of this section. As the reader will learn, one of the most important findings of our work shows how GE detects when the static biases prevail over the information stored in latent factors, which is a phenomenon commonly observed in large datasets.
5.1. Automatic MF Algorithm Initialization
In the first part of the experiment, we only optimized the real parameters used in the original MF algorithm outlined in Algorithm 2. Specifically, we optimized parameters
and
representing the user and item learning rates and the regularization parameter
. By doing this, we used GE as a tool that can automatically select the values that work best without requiring any input from the operator, thus avoiding problems arising from the selection of the wrong values at the start (as stated in [
25]). It can be argued that this approach does not use the full available power of GE, as it is “demoted” to the level of genetic algorithm (i.e., it only searches for the best possible set of constants in a given equation). Although this is partially true, we wanted to determine the best baseline value with which to compete in the following experiments and our GE method was flexible enough to be used in such a way. Otherwise, we would have used any other genetic algorithm which would most likely produce the same results but require additional complications in our existing framework.
In order to constrain GE to act only on parameter values of the algorithm, we used the following start symbol in our grammar:
Because we only wished to optimize certain parameters of the algorithm, 50 generations were enough for the evolution of an acceptable solution.
Table 7 shows the results of 10 evolution runs ordered by the RMSE score of the best obtained programs. Note that there is a slight change from the original equation, because the evolution produced two different values for the regularization factor
, denoted by
and
for user and item regularization factor, respectively.
As shown, all our programs exhibit better RMSE than the baseline algorithm using default settings from
Table 1, which shows a RMSE score of
. The average RMSE of our ten genetically evolved programs is
with a standard deviation of
. In addition, a two-tailed
p-value of
obtained by the Wilcoxon rank-sum test confirms that our results are significantly better than the result obtained from the baseline algorithm.
A final remark can also be made by comparing the RMSE values when increasing the number of iterations, as seen in
Figure 1. We can see that the original baseline first drops rapidly but jumps back to a higher value after a maximum dip around the 100-th iteration. The optimized baseline, on the other hand, shows a constant reduction in the RMSE value without any jumps (except after every 100 iterations when we introduce an additional latent feature). This shows that even though we believed to have found the best possible settings in our previous work [
17], we set the learning rate too high which prevented us from finding the current minimum possible RMSE value. The new algorithm, on the other hand, detected this issue and adjusted the values accordingly.
Although a 0.41% decrease in RMSE does not seem to be a very significant improvement, we should note that the GE enhanced MF algorithm achieved this level of performance without any manual tuning beyond the initial grammar settings. Compared to the amount of time we had to spend during our previous work for manual determination of the appropriate parameter values, this represents a significant reduction in the time which a practitioner has to spend tinkering with settings to obtain the first results. One should also note that an evaluation of 150 generations takes less time than setting parameter ranges manually.
5.2. Evolving New Latent Factors Update Equations
After a successful application of GE to a simple real parameter optimization, we now wanted to evolve a complete replacement for Equations (
6) and (
7). For that purpose, we used the following start symbol for the grammar:
= < expr >
= < expr >
Using this symbol, we can either evolve a complex equation (as seen in most of the item latent factor equations in
Table 8) or a simple constant value. Note that during the evolution, we still used the values from
Table 1 for evaluation of each single program in the population.
This time, we let the evolution run for 150 generations, using the CoMoDa dataset again.
Table 8 shows the ten best resulting equations for user and item latent factor calculation in addition to their RMSE score. Note that the actual equations produced by the evolution were quite hieroglyphic and the versions shown are mathematically equivalent equations simplified by hand. All 10 produced equations performed not only better than the hand-optimized baseline algorithm but also better than the GE optimized baseline version from the first part of the experiment. Compared to the average value of
that we were able to achieve using only a parameter optimization, we now obtained an average RMSE value of
with a standard deviation of
. It was to be expected that the dispersion would now be greater as GE was given more freedom as to how to build the latent factors update expressions. A
p-value of
signifies that the results are significantly better than those obtained from the optimized baseline RMSE in the first part of the experiment. What is even more important, this time, we managed to obtain a program (i.e., the best program in
Table 8) whose performance is more than 10% better than that of a baseline algorithm.
It is quite interesting to compare the evolved equations from
Table 8 with those of the original algorithm (i.e., Equations (
6) and (
7)). We noticed that the evolved equations fixated latent factor values of a user. There seems to be one exception (i.e., the fifth row of the table), but as the equation simply copies a factor value to the next iteration, this value retains its initial value and can therefore be considered constant as well. The right column of
Table 8 contains equations for calculating latent factor values of an item. After a closer inspection of the equations, we can observe that they are all very similar to Equation (
7), only with quite a large learning rate
and very small or even zero normalization constant
. For example, in the first row of the table, we have
and
, and in the last row, we have
and
. In more than half of the equations, there is an additional constant factor whose interpretation is somehow ambiguous; it could be a kind of combination of a static bias and regularization. In summary, it is clear that GE diminished or sometimes even removed the regularization factor and greatly increased the learning rate in order to achieve the best possible result. Apart from that, it assigned the same constant value to all user latent factors. This could in turn signify that we are starting to experience the so-called bias-variance trade-off [
59], where static biases take over a major part of contribution to variations in rating values.
5.3. Convergence Analysis
We have so far succeeded to evolve MF update equations that produced significantly better RMSE values than the original hand-optimized algorithm. During the second part of our experiment, we observed a notable increase in learning rate which made us believe that we do not actually need 100 iterations to reach the final values for the user and item factors. We generated the plot of the RMSE value as a function of iteration number to observe how the algorithm converges toward the final values.
The blue line in
Figure 1 shows how the RMSE value changed during a run of a baseline MF algorithm using the original parameter values from
Table 1, and the orange line shows the algorithm convergence using the optimized parameter values from the first row of
Table 7. The noticeable jumps in the curves that happen every 100 iterations are a consequence of adding an additional latent factor into the calculation every 100 iterations as shown in the baseline algorithm.
We can observe that the unoptimized algorithm results in quite curious behavior. The smallest RMSE value is reached after only 65 iterations, but then it rapidly increases to a value even greater than the initial guess. The RMSE value stays larger than the initial one, even after adding additional factors and letting the MF algorithm run for 100 iterations for each one of the added factors. Conversely, using the GE optimized version of the MF algorithm, we obtain a curve whose RMSE score fell steadily toward the final, much better RMSE score.
Figure 2 shows how the RMSE value converges when we used the following equations as the update equations:
The equations are taken from the last and fourth row of
Table 8, respectively. The most obvious difference from
Figure 1 is a much more rapid convergence, which was to be expected as the learning rates are several orders of magnitude larger (this can be seen by rewriting Equations (
10) and (
9) back into the forms of (
6) and (
7), respectively). It seems that a learning rate of 40 and a regularization factor of
(Equation (
10)) are quite appropriate values as seen in
Figure 2. However, the figure also shows how a learning rate of 140 (Equation (
9)) already causes overshoots. At the same time, it seems that an absence of regularization produces overfitting which manifests itself as an increase in the RMSE values when the last two latent factors are added.
Either way, the algorithm converged in just a few iterations each time a new latent factor was added, indicating that much less than 100 iterations are actually needed. Thus, we reran all of the evolved programs on the same dataset, this time only using 20 iterations for each of the latent factors. We obtained exactly the same RMSE scores as we did with 100 iterations, which means that the calculation is now at least five times faster. This speedup is very important, especially when dealing with huge datasets that usually reside behind recommender systems. The results obtained by (
9) and (
10) using only 20 iterations are shown in
Figure 3.
5.4. MovieLens Dataset
In the next experiment, we wanted to evolve the update equations using a much larger dataset. We selected the MovieLens 100k dataset for the task while using the same starting symbol as in
Section 5.2.
Because we switched to a different dataset, we first had to rerun our original baseline algorithm in order to obtain a new baseline RMSE, which now has a value of
. Then, we ran over 20 evolutions, each of 150 generations.
Table 9 shows the five best (and unique, as some RMSE values were repeated over several runs) evolved equations and their corresponding RMSE values. In summary, we achieved an average RMSE value of
and a standard deviation of
(with a minimum value of
and a maximum value of
). We again used the Wilcoxon ranked sum test to confirm (with a
p-value of
) that our numbers significantly differ from the baseline value.
The most striking difference from the previous experiment is the fact that now all but one of the latent factors are in fact constants. Note that the third item latent factor in
Table 9 is actually a constant because
is a constant. This is a sign that a bias-variance trade-off as described in [
59] has occurred. Because the MovieLens dataset contains a lot more ratings than the CoMoDa dataset, the information contained in the static biases becomes more prominent. This in turn means that there is almost no reason to calculate the remaining variance, which was automatically detected by GE. Using fixed latent factors also means that the algorithm can be completed within a single iteration, which is a huge improvement in the case of a very large dataset. More likely, the occurrence of constant factors will be used as a detection of over-saturation—when the equations produced by GE become constants, the operator knows that they must either introduce a time-dependent calculation of biases [
19] or otherwise modify the algorithm (e.g., introduce new biases [
17]).
5.5. Book-Crossing Dataset
In this experiment, we used the Book-Crossing dataset which is larger than the MovieLens dataset by a factor of 10. In addition, it features a different range of ratings (from 1 to 10 instead of 1 to 5) and covers a different type of item, namely books. Again, we first calculated the baseline RMSE which was now
. We ran 20 evolutions of 150 generations each.
Table 10 shows the five best evolved equations and their corresponding RMSE values. In summary, we achieved an average RMSE value of
and a standard deviation of
(with a minimum value of
and a maximum value of
). We again used the Wilcoxon ranked sum test to confirm (with a
p-value of
) that our numbers significantly differ from the baseline value.
By observing the results, we can see a similar pattern as in the previous experiment—where one set of latent factors () is set to a static value, while the other changes its value during each iteration according to the current error value of . Because the dataset bears some similarities to the previous (MovieLens) in terms of saturation and ratios (ratings per user and per items), this was somehow expected.
5.6. Jester Dataset
In the last experiment, we used the Jester dataset. Again, we first calculated the baseline RMSE which was now
. One should note that the number differs from those presented in
Table 3 due to the fact that both cited articles ([
41,
44]) used a slightly different version of the dataset (either an expanded version or a filtered version). Although this means that we cannot directly compare our results, we can still see that we are in the same value range.
We ran 20 evolutions of 150 generations each.
Table 11 shows the five best evolved equations and their corresponding RMSE values. In summary, we achieved an average RMSE value of
and a standard deviation of
(with a minimum value of
and a maximum value of
). We again used the Wilcoxon ranked sum test to confirm (with a
p-value of
) that our numbers significantly differ from the baseline value.
Observing this last set of results, we find some similarities with the previous two experiments. One set of latent factors is again set to a static value. An interesting twist, however, is the fact that in this experiment the static value is assigned to the item latent factors () instead of the user latent factors. Upon reviewing the dataset characteristics, we can see that this further confirms the bias-variance trade-off. This is due to the fact that in this dataset the ratio of ratings per item is a hundred times larger than the ratio of ratings per user. The item’s biases therefore carry a lot more weight and thus reduce the importance of its latent factors. Once again, the GE approach detected this and adjusted the equations accordingly.
5.7. Result Summary
Table 12 shows a condensed version of our results. It can be seen that we managed to match or improve the performance of the MF algorithm on all four datasets. All the results were also tested using the Wilcoxon rank-sum test and, for each case, the test statistic was lower than the selected significance value of
which confirms that our results were significantly different (better) than those of the baseline approach.
6. Conclusions
We used GE to successfully optimize the latent factors update equations of the MF algorithm on four diverse datasets. The approach works in an autonomous way which requires only the information about the range of ratings (e.g., 1 to 5, −10 to 10, etc.) and no additional domain knowledge. Such an approach is therefore friendly to any non-expert user who wants to use the MF algorithm on their database and does not have the knowledge and resources for a lengthy optimization study.
In the first part of the research, we limited the optimization process only to the parameter values, which already produced statistically significantly better RMSE values using 10-fold validation. After using the GE’s full potential to produce latent factor update equations, we observed an even better increase in the algorithm’s accuracy. Apart from even better RMSE values, this modification accelerated the basic MF algorithm for more than five times.
We then switched to three larger datasets that contained different item types and repeated our experiments. The results showed that GE almost exclusively assigned constant values to either user or item latent factors. It is outstanding how GE is able to gradually change—depending on the dataset size—the nature of the update equations from the classical stochastic gradient descent form, through a modified form where user latent factors are constants, to a form where both latent factors are constants. This way, GE adapts to the degree of a bias-variance trade-off present in a dataset and is able to dynamically warn about over-saturation.
A great potential of the usage of GE to support a MF-factorization-based recommender system lies in the fact that GE can be used as a black box to aid or even replace an administrator in initial as well as on-the-fly adjustments of a recommender system. Apart from that, GE can be used to generate warnings when the system becomes over-saturated and requires an administrator’s intervention. We believe that our results make a valuable contribution to the emerging field of employing evolutionary computing techniques in the development of recommender systems [
26].
The presented approach therefore offers a nice quality-of-life upgrade to the existing MF algorithm but still has a few kinks that need to be ironed out. Although we are able to consistently optimize the algorithm and fit it to the selected dataset, the approach does require a lot of computational resources and time. This is not necessarily a drawback because we do not require real-time optimization but only need to run the algorithm once in a while to find the optimal settings.
For our future applications, we therefore plan to improve the algorithm’s performance by introducing parallel processing and to export parts of the code to C using the Cython library. In addition, we will also experiment with expanding the scope of our algorithm to optimizing the MF algorithm parameters as well (the number of iterations and latent factor, for example). We believe that this could lead to further improvements and potentially even speed up the MF method itself (by realizing that we need fewer iterations/factors, for example). It would also be interesting to test if we can apply our algorithm to other versions of the MF algorithm (graph-based, non-negative, etc.) and if we can apply our strategy to a completely different recommender system as well.