1. Introduction
With the rapid development of the Internet, web search engines become very popular for information retrieval. Because a web search engine can usually find an immense set of Web pages matching the search query, it is necessary to rank higher the most important pages to make this tool practical. Fur this purpose, the PageRank model was developed by the Google team to rank the importance of Web pages based on the frequency of page visits recorded by a random user who keeps browsing the World Wide Web with an equal probability of choosing the hyperlinks on each page. Mathematically speaking, it requires the computation of the stationary distribution of this Markov random walk on the linking structure of pages, the values within the distribution represent the frequency of visits to each Web page. The linking structure of the Web is represented by a large directed graph (called the Web link graph) and by its adjacency matrix
(here
n is the number of Web pages) such that
(being 1) only when page
j has a hyperlink pointing to page
i. The transition probability matrix
of this random walk process is defined as
However, to avoid that the process stagnates if a dangling page without hyperlinks is visited, in the model
P is usually modified as
where
is a probability distribution vector,
is a binary vector, and
if page
i has no hyperlinks. According to the Perron-Frobenius theorem, the unique existence of the stationary distribution vector is guaranteed when
is an irreducible stochastic matrix. To enforce this assumption,
is further modified into the matrix
where
is called the damping factor and
. Matrix
A is often called the Google matrix [
1]. Finally, the PageRank model can be stated mathematically as finding the unit positive eigenvector corresponding to the eigenvalue 1 of
A, that is the solution of the eigen-problem
Because of the assumption that
A is an irreducible stochastic matrix, given by Equation (
2), it follows that Equation (
3) is uniquely solvable. The unique stationary distribution vector
x is called the PageRank vector. Note that, the PageRank model now is often used as a network centrality measure to identify the most important nodes within large networks arising in several applications such as in chemistry, bioinformatics, neuro-science, bibliometrics, web search engines, social networks, etc. [
2]. In some fields such as the bioinformatics, the related networks are usually undirected, thus the corresponding adjacency matrices are symmetry. This symmetry can be used to build efficient storage format and efficient implementation for matrix-vector multiplications, then the difficulty of solving the PageRank model (also called the PageRank problem) can be reduced.
With rapid development of science and technology, the dimension of PageRank problems from various application fields has grown hugely in the last decades and still keeps growing. Accordingly, iterative methods become the only viable option to solve Equation (
3) numerically. Stationary iterative methods such as the Power, Jacobi and Gauss-Seidel methods are effective when the damping parameter
is not too close to 1, e.g., for search engine applications Google initially used
. However, larger values of
(not too close to 1) such as
may sometimes give a better ranking result [
3], and they tend to converge significantly more slowly when
is large. These computational issues make the PageRank problem always require new algorithmic solutions that are more time-saving and memory-saving.
The development of more efficient algorithms for solving PageRank problems has been ongoing for the past decade or so. One research direction is to accelerate the convergence speed of stationary iterative methods, related works include adaptive [
4] and extrapolation methods [
5,
6,
7], multigrid solvers [
8,
9], and inner-outer iterations [
10,
11,
12,
13,
14]. Meanwhile a significant amount of work has been devoted in particular to the analysis of Krylov subspace methods for computing PageRank. Golub and Greif proposed in [
3] a new variant of the Arnoldi algorithm that is particularly suited for solving problems with a large range of damping values, and can outperform many conventional stationary iterative solvers when
is closer to 1. Yin et al. have used weighted inner-products instead of the standard inner-products in the Arnoldi process reporting faster convergence due to a more favourable distribution of the harmonic Ritz values [
15]. The Power-Arnoldi method [
16], its weighted version [
17] and the Arnoldi-Inout method [
18] are hybrid variants of the Arnoldi process that combine periodically stationary and Krylov iterations, and can improve other established PageRank solvers on many examples. All of these techniques could accelerate the Arnoldi method from [
3], but there is likely to be a lot of room for further improvement. One obvious thing is that, preconditioning, which is widely used to accelerate Krylov subspace methods for solving linear systems, has not been considered yet for accelerating the Arnoldi-type methods when solving the PageRank eigen-problem. This is an area worth investigating because successful preconditioning usually results in significant acceleration of the solving process, and can be well combined with other acceleration techniques.
In this work we propose a new theoretical formulation of the PageRank eigenvector problem (
3) that is characterized by a better eigenvalue separation in the spectrum between the dominant and the second dominant eigenvalues of the coefficient matrix, so that the Arnoldi process may require significantly less iterations to converge, and consequently less BLAS-1 (inner-products and SAXPY) operations in the Gram-Schmidt orthogonalization. Our strategy to transform the original problem into a new one that is more amenable to the iterative solution can be seen as a form of preconditioning. The optimal parameter setting to choose in this preconditioning technique is also discussed. Our experiments confirm the theoretical findings and demonstrate that the Arnoldi-type methods applied to the preconditioned eigen-problem can converge much faster over their standard counterparts. Thus it has potential to solve large-scale PageRank computations more effectively on both sequential and parallel machines. Accordingly, we can expect that the proposed technique can accelerate accomplishing projects from various fields that use the PageRank model. For example, for the GeneRank problem [
19] and the ProteinRank [
20] problem that use the PageRank model on the gene-gene and protein-protein networks, users can find the genes and proteins that are pathogenic with high probability faster.
The paper is organized as follows. In
Section 2, we outline the refined Arnoldi method proposed in [
3] that is at the basis of our development, along with its main convergence results. In
Section 3, we present a preconditioned variant of the refined Arnoldi method for computing PageRank eigenvectors. Numerical experiments are reported in
Section 4 to support our theoretical findings. Finally, some conclusions from this study are presented in
Section 5. Note that, MATLAB notations are used throughout this article.
2. The Refined Arnoldi Method for PageRank
The Arnoldi process proposed in 1951 is an algorithm based on the modified Gram-Schmidt orthogonalization that, after m steps, computes an orthonormal basis of the Krylov subspace , where and are a given matrix and an initial vector, respectively. Here we sketch it in Algorithm 1.
In matrix form, Algorithm 1 yields after
m steps the Arnoldi decompositions
and
where
is an upper Hessenberg matrix, and we denote by
, by
and by
. In
Table 1 we summarize the computational cost of Algorithm 1 for the case of a general matrix
A, and in
Table 2 for the special case of the Google matrix ((
2). Note that in the latter case, from the expression
, it follows that the product
requires 1 sparse matrix-vector operation
(hereafter
u represents an arbitrary vector with appropriate dimension), 1 inner product
where
is stored, 1 vector scaling operation and 1 vector addition.
Algorithm 1 The Arnoldi process [,,,,]= |
- 1:
Compute , . - 2:
fordo - 3:
Compute . - 4:
for do - 5:
Compute . - 6:
Compute . - 7:
end for - 8:
Compute . - 9:
if then - 10:
Set , , and break. - 11:
else - 12:
Set . - 13:
end if - 14:
end for - 15:
Generate .
|
The Arnoldi process is the main ingredient of many efficient numerical techniques for solving linear systems and eigen-problems. After
m steps, scalars
and vectors
satisfying
are called the
Ritz values and the
Ritz vectors of
A onto the Krylov subspace
, respectively. The Ritz values with largest real parts and their corresponding Ritz vectors are often used to approximate the eigenvalues with largest real parts of
A and their associated eigenvectors [
21]. However, in-depth convergence analysis of the Arnoldi method shows that the Ritz vectors are not guaranteed to converge to the actual eigenvectors, even when the Ritz values do converge. To overcome this difficulty, Jia proposed in [
22] to compute the
refined Ritz vector associated to the Ritz value
, defined as
The solution of Equation (
7) is given by
where
is the singular value decomposition of matrix
, the singular values are stored in the diagonal part of
in decreasing order and the columns of
W are the right singular vectors. Besides, the 2-norm of the residual vector
is equal to the smallest singular value
.
Although in exact arithmetic the refined Ritz vector
is guaranteed to converge to the eigenvector associated to
when
, some potential numerical difficulties may still arise in practice. Firstly, the largest Ritz value may be complex, and the use of complex arithmetic may be a memory burden for large-scale computations. Secondly, when
is close to 1, slow or irregular convergence can still happen due to a weak separation of the eigenvalues of
A. Golub and Greif propose to use the largest eigenvalue of the Google matrix, which is known and equal to 1, as a shift in (
8) instead of the largest Ritz value
as an attempt to overcome problems related to slow solving process [
3]. We present the Golub and Greif variant of the refined Arnoldi method for PageRank in Algorithm 2, and refer to it shortly as the GG-Arnoldi method hereafter.
Algorithm 2 The Golub and Greif variant of the restarted refined Arnoldi method (shortly referred to as GG-Arnoldi) |
Input: The PageRank coefficient matrix A, initial guess , parameters m and .
- 1:
Run Algorithm 1 as [,,,,]=. - 2:
Compute the SVD factorization . - 3:
Compute the updated approximated solution . - 4:
Compute the residual 2-norm by the smallest singular value . - 5:
ifthen - 6:
Outputs and ends. - 7:
else - 8:
Set and go to step 1. - 9:
end if - 10:
returnx.
|
The overall computational cost of Algorithm 2 is summarized in
Table 3. Note that the singular value decomposition computed at line 3 is omitted because this cost is negligible when
m is very small. Analogously, the vector scaling
at line 6 is not reported in the table as it is computed only at the last cycle. Finally, the operation
is equivalent to
m vector scaling operations and
vector additions.
The convergence analysis of the GG-Arnoldi method is presented in [
20]. Here we recall the main result. We denote as
the standard orthogonal basis of the orthogonal complement of the space
, so that
is orthonormal. Then, we have
where
. According to the Perron-Frobenius theorem, the spectral radius of
A is equal to 1,
A has only one eigenvalue 1 on the unit circle, and the eigenspace of this eigenvalue is 1-dimensional. As 1 is a simple eigenvalue of
A, we introduce a separation function named “
sep” and defined as
Since
, where
is the second largest eigenvalue of
A [
20], we obtain
Under these assumptions, the theorem below suggests that the larger the second dominant eigenvalue
, the slower the convergence rate of the GG-Arnoldi method applied to the eigen-problem (
3).
Theorem 1 (Theorem 2.2 in [
20]).
Let be the orthogonal projector onto the subspace , be the distance between the PageRank vector x and the subspace , and be the solution generated by the GG-Arnoldi method. Then it holds We recall below another property of the eigenvalues of the Google matrix A that is very relevant to our analysis.
Theorem 2 ([
23]).
If a column-stochastic matrix has at least two irreducible closed subsets (which is the case for the web hyperlink matrix), then the second eigenvalue of , where and v is a vector with non-negative elements satisfying , is given by α. Therefore, slow convergence of the GG-Arnoldi method for PageRank problems may be expected in particular when
approaches 1. As this situation arises in several applications, some acceleration techniques have been developed in the past years to enhance the robustness of the GG-Arnoldi method. They can be classified in two types: (1) using weighted inner products instead of the standard inner products in Algorithm 2 to ensure a more favourable distribution of Ritz values [
15,
17]; (2) combining the GG-Arnoldi method with a few cycles of stationary iterative solvers, like in the Power-Arnoldi method [
16], the Inout-Arnoldi method [
18], the Arnoldi-PET method [
17] and others. Both approaches attempt to provide better initial guesses for the Arnoldi process. To the best of our knowledge, no research work has investigated efficient ways to precondition the PageRank eigen-problem (
3), in order to make it easier to be solved by the Arnoldi-type method. This is the main objective of our study.
3. Preconditioning the Refined Arnoldi Method
Preconditioning is an established technique used to accelerate Krylov subspace methods for solving linear systems. A nonsingular linear system can be transformed into an equivalent one of either the form (left preconditioned system) or the form with (right preconditioned system), where is called the preconditioner matrix or simply the preconditioner. Goal of preconditioning is to improve the spectral distribution of the coefficient matrix A so that a Krylov subspace algorithm can solve the transformed preconditioned system much faster than the original one. The development of efficient preconditioners is a very important topic in numerical linear algebra because it can enable rapid solution of problems that may initially appear numerically intractable. However, preconditioning is seldom used for solving eigen-problems because the left preconditioned () and right preconditioned (, ) eigenproblems are no longer equivalent to the original one.
Here we propose a new approach to precondition the PageRank eigen-problem (
3) for the computation of the dominant eigenvector
x corresponding to the largest eigenvalue
of the Google matrix
A. According to the analysis presented in
Section 2, the convergence speed of the GG-Arnoldi method is mainly determined by the separation between the largest and second largest eigenvalues of
A, namely by the quantity
. Therefore, our strategy for preconditioning Equation (
3) is to transform it into an equivalent eigen-problem that has a better separation between its two dominant eigenvalues. The main theoretical result underlying our method is presented in the theorem below.
Theorem 3. For Google matrix A in (2) and its eigenvalues , if a polynomial satisfies , then the PageRank eigen-problem (3) is equivalent to the eigen-problem: Proof. It is clear that any solution of
is also the solution of
. Suppose
where each
(
) is a Jordan matrix whose diagonal elements equal to
. Then,
Clearly, each
(
) is still a lower triangle matrix, and its diagonal elements equal to
. Because
, therefore the algebraic multiplicity of the eigenvalue
of
equals to 1, as same as that of the eigenvalue 1 of
A. Therefore
has the same solution space as
, i.e., problem (
3) is equivalent to problem (
11). □
The obvious question to address is whether such a polynomial
satisfying the condition
on the eigenvalues
of
A exists, and then how to make
easier to solve than
. The simple polynomial
(
) clearly satisfies
. According to Theorem 3, the two eigen-problems
and
are equivalent. Besides, in
the distance between the largest two eigenvalues equals to
. As a result, we can expect that the GG-Arnoldi method will require less cycles and operations to converge when it is applied to
than to the problem
. We sketch the complete preconditioned version of the Golub and Greif variant of the refined Arnoldi method with the polynomial choice
, hereafter shortly referred to as the PGG-Arnoldi method, in Algorithm 3.
Algorithm 3 The preconditioned version of the Golub and Greif variant of the refined Arnoldi method, with the polynomial choice (shortly, PGG-Arnoldi) for PageRank. |
Input: the PageRank coefficient matrix A, initial guess , parameters m, k and .
- 1:
Run Algorithm 1 as [,,,,]=. - 2:
Compute the SVD factorization ; - 3:
Compute the updated approximated solution . - 4:
Compute the residual 2-norm by the smallest singular value . - 5:
ifthen - 6:
Outputs and ends. - 7:
else - 8:
Set and go to step 1. - 9:
end if - 10:
returnx
|
Note that the matrix
in Algorithm 3 is never formed explicitly. The operations associated with
in this algorithm are all matrix-vector multiplications, such operation
is computed by carrying out
k sparse matrix-vector products without assembling the Google matrix
A shown as follows in Algorithm 4.
Algorithm 4 Implementation of the matrix-vector product |
- 1:
fori = 1:k do - 2:
Compute . - 3:
end for - 4:
Set . - 5:
returny.
|
The overall algorithmic complexity of a cycle of the PGG-Arnoldi method (Algorithm 3) is summarized in
Table 4.
It is clear that one cycle of Algorithm 3 is computationally more expensive than that of the GG-Arnoldi method (Algorithm 2) for the same value of
m, because the former requires to compute the matrix-vector product
. In detail, the PGG-Arnoldi algorithm needs additional
sparse matrix-vector products,
inner products,
vector scalings and
vector additions compared with the GG-Arnoldi algorithm per cycle. The computation of an inner product requires
n floating-point multiplications and
floating-point additions while scaling a vector only needs
n floating-point multiplications. Thus, we will assume that a vector scaling operation has half the cost of an inner product with vectors of the same dimension. For the same reason, the cost of each vector addition, 2-norm and 1-norm of a vector can be counted as
, 1 and
inner-product, respectively. Finally, the algorithmic complexity estimates of one cycle of GG-Arnoldi and PGG-Arnoldi in terms of sparse matrix-vector multiplications and vector inner-products are presented in
Table 5.
We observe from
Table 5 that one cycle of the PGG-Arnoldi costs less than
k times the cost of GG-Arnoldi. This means that the operations in addition to the sparse matrix-vector product in PGG-Arnoldi are less than those required in GG-Arnoldi. Therefore, if the number of cycles required by the PGG-Arnoldi is less than
times of the GG-Arnoldi, the former must be faster. Indeed, it may be faster even when the number of cycles required by the PGG-Arnoldi is larger than
times of the GG-Arnoldi, which will be shown by numerical experiments. Besides, in real applications, the computation of a matrix-vector product is more efficient than that of vector operations in either serial or parallel environments, as the former belongs to BLAS-2 classification while the latter belongs to BLAS-1. Note that, it is very desirable to improve the efficiency of parallel computing for solving large problems such as PageRank. Given a good estimate of the convergence rate comparison between these two methods and the density of matrix
G, the results presented in
Table 5 can guide the choice between the PGG-Arnoldi or the GG-Arnoldi methods for solving the problem at hand. However, few things (in particular, quantitative conclusions) are known about the convergence rate of the GG-Arnoldi method besides that it increases with the gap
, therefore we rely on numerical experiments to analyze the performance of the PGG-Arnoldi method.
Finally, we remark that:
acceleration techniques designed to work with the GG-Arnoldi, such as extrapolation methods [
6,
7], formulations based on weighted inner products [
3] and hybrid solution schemes [
16,
17,
18] that combine stationary iterative iterations with the Arnoldi method, can still be applied to the PGG-Arnoldi;
the performance of the PGG-Arnoldi can be further improved by using pre-processing algorithms such as the elimination strategy in [
24] and the low-rank factorization in [
25] that reduce the time cost and the memory cost of computing the sparse matrix-vector product
, because the sparse matrix-vector products in the PGG-Arnoldi account for a larger proportion of computation compared to the GG-Arnoldi.