Tbdata 2016 2541167
Tbdata 2016 2541167
Tbdata 2016 2541167
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 1
Abstract—The scholarly literature is expanding at a rate that necessitates intelligent algorithms for search and navigation.For the most
part, the problem of delivering scholarly articles has been solved. If one knows the title of an article, locating it requires little effort and,
paywalls permitting, acquiring a digital copy has become trivial.However, the navigational aspect of scientific search – finding relevant,
influential articles that one does not know exist – is in its early development. In this paper, we introduce Eigenfactor Recommends – a
citation-based method for improving scholarly navigation. The algorithm uses the hierarchical structure of scientific knowledge, making
possible multiple scales of relevance for different users. We implement the method and generate more than 300 million
recommendations from more than 35 million articles from various bibliographic databases including the AMiner dataset. We find little
overlap with co-citation, another well-known citation recommender, which indicates potential complementarity. In an online A-B
comparison using SSRN, we find that our approach performs as well as co-citation, but this new approach offers much larger
recommendation coverage. We make the code and recommendations freely available at babel.eigenfactor.org and provide an API for
others to use for implementing and comparing the recommendations on their own platforms.
1 I NTRODUCTION
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 2
find papers that may be more distantly related but represent where synonymy is an issue [9], [10]. And, like usage data,
foundational contributions to the broader area of research full text is difficult to obtain from publishers.
(Classic Recommendations) for researchers new to a field. We When domain-specific features are available, such as
find that this approach provides recommendations for a links in web pages or citations in academic literature, they
much larger set of papers than one can provide using a can be used to great effect. The approach of using links
co-citation approach, and performs at least as well in A- between documents is often called link analysis. The first
B testing. It also, distinctly, provides recommendations at instance of it was in the early 1960s, when Kessler in-
different scales for different user types. vented bibliographic coupling [11]. Bibliographic coupling
We apply EFrec to the AMiner data set [4] included measured document similarity by the number of shared ci-
in this special issue on ‘Big Scholar Data Discovery’. To tations. In his paper, Kessler describes some of the strengths
supplement this analysis, we also generate more than 300 of citation-based approaches over textual analysis, noting
million recommendations for various other bibliographic that they are language-independent and do not demand the
datasets including the arXiv, JSTOR, Microsoft Academic recommender to have any expertise in the subject matter.
Search (MAS), PLoS, Social Science Reseasrch Network The next stage in the evolution of citation analysis came
(SSRN) and Pubmed Central (Table 1) and make the code in the early 1970s, when Small introduced co-citation analy-
and data available through a public API. sis [12], a method that uses the frequency with which papers
are cited together as a measure of similarity. Small address
several of the shortcomings of bibliographic coupling, in-
P REVIOUS WORK cluding the permanence of bibliographic coupling, and how
co-citation is able to change as intellectual patterns change
The Eigenfactor Recommends algorithm is not the first over time. Co-citation analysis was later applied to authors
scholarly recommender. A number of document collections by White [13] and has been used extensively as a standard
and portals are already deploying basic recommendation citation-based method for recommending related papers.
designs that aim toward this goal. For example, when The 1990s brought a flurry of papers on link analysis and
one views an abstract on Pubmed, the system suggests represents the next epoch of techniques. This advancement
five articles as ”related citations” in the sidebar. Elsevier’s was not motivated by the aim of providing better tools
ScienceDirect system, Highwire press, and many other con- for scholars, but rather by the need to navigate the new,
tent systems present a similar sidebar highlighting related massive corpus of linked documents: the world wide web.
articles. In this section, we explain how EFrec fits within the The two most notably entries are HITS [14] by Kleinberg
broad category of scholarly recommender systems. and PageRank [15] by Brin and Page. These methods sought
First, there are collaborative filtering approaches, which to exploit the links present in web pages to provide a
rely on finding similar users and using their ratings to notion of authority or importance1 . Both also utilized similar
provide recommendations. These techniques are extremely techniques of a random walker (or surfer), following links
powerful and widely explored in the literature [5]; they at random around the network. By tracking where this
require no knowledge about the items being recommended, walker goes we can, based on the frequency of node visits,
and with sufficient user ratings can infer properties about determine which nodes are the most important.
the items in question [6]. They do, however, have several To take PageRank (or any impact measure) and turn it
limitations that make them difficult to use for scholarly into a search algorithm you need to partition the network
article recommendation [7]. First and foremost is the “cold- intelligently. Brin et al. did this by using standard informa-
start” problem – items that don’t have user ratings cannot tion retrieval (IR) techniques, augmenting them slightly for
be recommended [8], which is a substantial portion of the hyperlinks. They then take this IR score and combine it with
literature. Another limitation is that to be effective you the PageRank to generate a final score for a given document.
need good rating coverage; the number of ratings should Kleinberg did this by collecting a “root set” of the top
dominate the number of items being rated to perform well. results from text-based search engines (specifically AltaVista
In domains with many items and few user ratings, such as and HotBot). HITS then matches a query to one of these
scholarly literature, this condition can be difficult to satisfy. partitions, and then generates impact scores.2 Haveliwala
Perhaps the largest problem, though, is the difficulty in [21] modified PageRank and created a “topic-sensitive”
acquiring user ratings. To collect a sufficient amount of data variant, where, much like HITS, an external source is used
you need substantial traffic and the infrastructure to serve to generate a partition and then an impact score is generated
and gather ratings from a large number of users. Unless one for that network, resulting in topic-specific PageRank scores.
is a big publisher, these resources are difficult to obtain and A variant of this idea was used by PaperRank [22], a
since publishers rarely share data, it is difficult to develop recommender for scholarly literature. In PaperRank’s case
methods using usage data. they used the top-n results from the ACM digital library,
At the other end of the recommender spectrum are con-
tent based methods, which instead match items to similar 1. There is a common misconception that PageRank is a search
algorithm. In actuality, the function of PageRank is to rank the nodes in
items based on features [7]. Content based methods are a network by importance. These rankings can then be used to determine
ideal for datasets where there are few users, however there the order in which search results are presented. Variants of PageRank
is substantial difficulty in automated feature extraction. have been applied to scholarly networks to determine the impact of
journals, authors and articles [16], [17], [18], [19], [20].
Term frequency-inverse document frequency (TF-IDF) is
2. This is a bit of a simplification of both steps. HITS grows out from
a commonly used technique [7]. However, this approach the initial root set, and does not provide a single score but rather a hub
has significant downsides, especially in scientific literature score and an authority score.
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 3
and crawled all citations (up to 2000 papers) to form a topic acyclic or nearly so: a paper only cite those articles that
specific network. preceded it in time. When standard PageRank approaches
What makes EFrec unique among these link analysis are applied to acycling citation graphs, older papers are
techniques is that it can both determine an impact score for excessively weighted [26], [27].
papers in a network and partition the network using only ALEF is a modified version of PageRank (PR), tailored
the citation graph. We do this by exploiting domain-specific specifically for the time-directed acyclic networks associated
properties of scholarly literature: that scientific literature is with article-level citations. We show in a recent study that
hierarchically structured into domains, fields, subfields, sub- ALEF performs better than PageRank and degree centrality
subfields, etc. If we can partition the network according to [28] and have also demonstrated that ALEF better separates
this hierarchy, we can then determine the impact within a papers that contribute to a given theory [29].
specific area and provide more accurate recommendations. As mentioned above, the problem with using a PageR-
And, most importantly, EFrec can provide ’levels of rel- ank approach on article-level citation graphis is that the
evance’, which can be important when no information is “random walker” on the graph will move inexorably back-
provided about a user needs. We explain this further in the wards in time, and as a result will over-weight older papers.
following Methods section. ALEF addresses these issues with the following two modi-
fications: (1) it shortens the number of contiguous steps of
the random walker before she teleports to another part of
2 M ETHODS the network and (2) it teleports to links rather than nodes
The objective of the EFrec algorithm is to find relevant [30]. These modifications help adjust for the over-weighting
papers, given a seed paper3 . The data required to gener- without losing the ability to exploit network structure.
ate these recommendations is a citation graph. The output The mechanics of ALEF are relatively simple and pro-
consists of a list of paper IDs and a set of recommended ceed in five steps. First, the teleportation weight, wi for
papers associated with each paper ID. The number of rec- each node i is calculated by summing the in- and out-
ommendations to be obtained can be set by the user, and citations. Second, the random walker teleports to a random
range from one to hundreds of recommendations. Figure 1 node based on the teleportation weights. Third, the random
illustrates the four-step process for generating paper recom- walker takes a step along the citation graph. Fifth, we
mendations. compute the asymptotic rate at which each node is visited
under this process. And fifth, these rates—which provide
(1) Assemble citation network our rankings—are normalized so that the average score for
all papers equals 1.
The first step requires assembling the citation graph of a
Specifically, the article citation network can be repre-
relatively large corpus, where “relavitely large” means at
sented as an n × n adjacency matrix, Z, where the Zij
least a few hundred thousands papers and a few million
entry equals 1 if article i cites article j and 0 otherwise.
citations – bigger the better. This method does not perform
The matrix is highly sparse since an individual article cites
as well for small graphs. The graph is represented as a link
a tiny portion of all articles in the corpus. The teleportation
list file where column 1 is the citing paper ID and column 2
vector, indicating the rate of teleportion to each node i, is
is the cited paper ID. For the AMiner dataset, the network
calculated in the following way.
consisted of 2,092,356 nodes (papers) and 8,024,869 links
(citations). Because each paper cites only a small number n
X
T
of other references, the networks underlying large corpora wi = (Zij + Zij )
will be highly sparse, as described in the section Sparseness j
of Citation Graphs [3].
The matrix Zij is then row normalized so that the sum
of each row i equals 1. We call this row stochastic matrix,
(2) Rank nodes Hij
The second step consists of ranking the nodes. We rank Zij
the nodes according to the article-level Eigenfactor (ALEF4 ) Hij =
[25]. The original Eigenfactor algorithm, which is closely
Zi
related to original PageRank algorithm [15], performs well The ALEF scores are then calculated by multiplying wi
on journal-level citation graphs because these are cyclic, by Hij and normalizing the scores by the number of papers,
in that one can follow directed links (citations) from one n, in the corpus
journal out to other journals and then back to the originating
journal. At the article level, however, citation graphs are HTij .wi
ALEF = n P T
j [Hij .wi ]j
3. Instead of a paper, the input could be an author’s name. The
recommendations would then be related authors. We have developed We have chosen the current ranking strategy as a balance
ranking methods at the author level to the author-disambiguated SSRN
dataset [17]. Because of the author disambiguation issues with the other
between a fully degree centrality ranking (i.e., counting
datasets we have not deployed the author-level version of EFrec as of links) and the original PageRank algorithm5 .
yet.
4. These scores are used to order the top articles in ref. [24]. A 5. It should be noted that the basic logic of the Eigenfactor Rec-
manuscript describing the method in detail is in preparation and will ommends algorithm could have used a different ranking method, but
be posted on the arXiv. This paper and details of the method can be recent data challenges have shown that ALEF performs better for
provided upon request. identifying important papers [28].
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 4
Fig. 1. Eigenfactor Recommends. The process for generating paper recommendations includes the following steps. (1) In the first step, the paper
citation graph is assembled into an adjacency matrix using the citations (links) between papers (nodes). (2) The second step ranks the nodes using
the article-level Eigenfactor algorithm (ALEF). This is a modified, time-directed version of the PageRank algorithm [15]. (3) The third step imports
the citation graph and node rankings and then clusters the citation graph using the hierarchical MapEquation [23]. (4) The last step generates
recommendations given a seed paper (highlighted in yellow). Expert recommendations are drawn from the lowest level of the hierarchical tree,
while Classic recommendations include papers from one level up the hierarchy. A paper can be both an expert recommendation and a classic
recommendation if it is the highest rated paper at the lowest and upper levels.
(3) Hierarchically cluster nodes The hierarchical map equation extends this basic ap-
The third step is to cluster the network, using the citation proach to reveal modular structure on multiple levels. This
graph from step 1 and node rankings from step 2. Based is done by extending the coding scheme of the basic map
on these inputs, the hierarchical MapEquation [23] uncovers equation to a hierarchical one. The reader is referred to ref.
the boundaries between domains, fields, subfields, etc. [23] for details. Performance of this algorithm is extremely
For simplicity, we begin with a description of the non- good [36] and it readily scales to networks with tens of
hierarchical map equation [31], [32], which uncovers basic millions of links and hundreds of millions of nodes [37].
modular structure within networks, returning a hard par- The software for running this part of the algorithm6 can be
tition in which each node is assigned to a single module. downloaded at mapequation.org.
The map equation exploits the duality between compressing
data and revealing patterns within the data. The core idea is (4) Recommendation selection
to compress a description of a random walk on the network. The last step involves selecting nodes for three types of
If the network has localized regions such that a random recommendations: (1) Expert, (2) Classic and (3) Serendip-
walker has a long persistence time in a small group of nodes, ity. Each of these types use no specific information about
a random walk can be concisely encoded when this structure the user7 . However, the different recommendation types
is exploited in the coding scheme. offer results that will be most useful to different kinds of
Optimally compressed using a two-level description users. The Expert recommendations are aimed at researchers
(nodes and modules), the per-step description length M of a familiar with a community of papers and authors. The
random walk on a network is given by the map equation: algorithm aims to select papers that are highly specific to
m
X a particular sub-discipline. The Classic recommendations,
L(M) = qy H(Q) + pi H(P i ). on the other hand, are geared towards a graduate student
i=1 or someone new to a specific field of science. The Classic
papers are foundational articles in a field that every first
The term H(Q) represents the description length nec-
year graduate student in the field should read. Expert and
essary to transmit the name of the module in which the
classic recommendation examples for the famous Kleinberg
random walker resides, and this is weighted by qy , the
paper on ”Authoritative sources in a hyperlinked environ-
frequency of movement between modules. The term H(P i )
ment” can be found in Table 2 and Table 3. The Serendipity
represents the description length necessary to transmit the
recommendations are papers that are randomly chosen for
node within module i to which the walker has moved.
every new user session. However, they are not randomly
This term is weighted by the frequency pi with which the
chosen from any part of the corpus. Rather, they are chosen
random walker moves within module i. Shannon’s source
from within the relevant community of papers as defined
coding theorem [33] states that the minimum code length
by the hierarchical network structure. The size distribution
necessary to describe a P random variable X is given by
n of communities for the different recommendations can be
its entropy: H(X) = − 1 pi log(pi ). Using this fact, for
found in Figure 3.
any given partition of the network each of the terms can
Figure 2 shows an example of how Eigenfactor Recom-
be calculated in straightforward fashion from the citation
mends generates Classic and Expert recommendations. In
matrix. Details are provided in refs. [31], [32]. By searching
numerically across the space of possible network partitions 6. When using InfoMap, select the ”-t” option for undirdir, or
for the one that allows the shortest description length, we undirected, directed. Undirdir is the approach underlying the ALEF
can find an optimal partition of the network. Benchmark ranking.
studies have revealed that this approach performs extremely 7. Another version of the algorithm could utilize usage characteristics
like a reader’s bibtex file, reading list or viewing behavior. We are
well relative to other network clustering algorithms. [34], working with researchers at SSRN to build a personalized version of
[35]. EFrec.
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 5
99 33
1 PubMed 5,538,322 16,004,596 34,026,854
Classic 2
arXiv 626,441 781,108 5,624,262
DBLP 781,108 4,191,677 2,163,313
12
1 3 MAS 27,352,532
262,554,975 245,796,494
TABLE 1
2 Babel Datasets. Recommendations were generated for the following
datasets and available at babel.eigenfactor.org.
87
3
76 53
3 R ESULTS
68
Recommendation Generation
We applied the EFrec algorithms to the AMiner dataset,
generating recommendations for 1,218,504 papers, 58.2%
Fig. 2. Recommendation Closeup. A closeup of Fig. 1.4 shows how of the dataset. The recommendations are available at http:
Classic and Expert recommendations are generated for a given seed
//babel.eigenfactor.org. We also provide the code8 for cal-
paper (highlighted, score of 12). Expert recommendations are drawn
from the same leaf node (sea foam green) as the seed paper, while culating these recommendations, which can be used on
Classic recommendations include sibling nodes as well (aqua). Pa- any citation network. In addition to AMiner, we generated
pers are ordered by their Eigenfactor score, descending. The Classic recommendations for several other datasets including pa-
recommendations provide a more diverse set of papers, while Expert
recommendations are more specific. pers on the arXiv, Microsoft Academic Search (MAS), the
Social Science Research Network (SSRN), JSTOR, PLoS, and
PubMed Central. This includes recommendations for more
this example the seed paper, or paper we are generating than 35 million papers from more than 300 million citations.
recommendations for, is highlighted in yellow and has an Table 1 shows the number of recommendations for each
ALEF score of 12. First, we locate the leaf node the seed dataset9 .
paper is in—colored sea foam green in Figure 2. Next we Table 2 provides example Expert recommendations for
generate a list of candidate papers, excluding the seed the well-known Hubs and Authority paper by Jon Kleinberg
paper. When generating Expert recommendations we use [14]. “If you are reading the Hubs and Authority paper, you
all of the papers in seed paper’s node (sea foam green), should also consider reading the Google-creating PageRank
while Classic recommendations also use papers from sibling paper by Larry Page and Sergey Brin.” The ALEF score for
nodes (aqua). Finally, we order the papers according to their this paper is 455, which would put it second in this list. The
Eigenfactor scores (descending) and select the top-N papers. recommendation list also includes the Hubs and Authority
In some unusual circumstances the Classic and Expert version that was presented at an ACM-SIAM symposium
recommendations can be the same. If, for example, the top-3 on discrete algorithms. Table 2 lists the top 10 Expert papers,
papers in the seed node have a larger Eigenfactor score than but one could list the top N papers.
all the papers in sibling nodes, the Classic algorithm would Table 3 provides the Classic Recommendations for the
only select those papers, resulting in the same recommen- “Hubs and Authority” paper. The recommendation has
dations as the Expert algorithm provides. Conversely, it is zoomed out to the broader topic of information retrieval.
possible that the Classic algorithm would not recommend This includes books and seminal papers in information
any papers from the seed node if all papers in the seed node retrieval and textual analysis.
had unusually low Eigenfactor scores.
There are also situations where Eigenfactor Recom-
Properties of hierarchical tree
mends cannot generate recommendations. Occasionally, In-
foMap will place a paper into a node by itself, creating The right-skewed distribution that we see at the paper level
a singleton. For the AMiner set, there are approximately (Figure 4) is also found at the cluster level. Figure 3 shows
80,000 singletons. Singletons normally arise when a paper the number of papers found in the clusters for the Expert
has too few citations to be placed in a cluster, either due to and Classic recommendations. For the Expert recommenda-
errors in the dataset, lack of sufficient coverage (cited papers tions, there were 202,437 clusters with at least one paper.
don’t exist in this graph), or because the paper made very The average cluster size is 6.4 papers per cluster with a
few citations. Since one cannot generate recommendations right-skewed distribution and a variance of 739 papers. The
for singletons, we currently do not provide recommenda- largest cluster includes 2,688 papers, with top 10 papers in
tions for these papers. However, there are several ways to this group shown in Table 4. Appropriately for the subject
deal with these papers. One way is to use cosine similarity
8. https://github.com/jevinw/ALEF/
[5] between citations from these papers and the citations
9. There is overlap in these datasets. For example, certain papers are
from clusters in other parts of the tree. Papers could then be listed both in MAS and in Pubmed Central. However, we have not
placed in clusters where they can receive recommendations. collated the datasets. This is something we plan to do in the future.
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 6
TABLE 2
The top 10 Expert recommendations for the paper, “Authoritative sources in a hyperlinked environment” [14]. The ALEF for this paper is 455,
which would put it second in this list.
TABLE 3
The top 10 Classic recommendations for the paper, “Authoritative sources in a hyperlinked environment” [14]. The ALEF for this paper is 455,
which would put it 9th in this list.
matter of this article, this cluster is about recommendation 122 clusters in the tree that fit this criteria. These are clusters
algorithms, with the top papers being about collaborative that include papers highly disconnected from the rest of the
filtering. Because this is the largest cluster, the most rec- corpus. These papers receive no recommendations with our
ommended papers would be those listed in Table 4, if algorithm. Other techniques (e.g. semantic similarity) must
a recommendation was requested for every paper in the be used for these kinds of papers.
corpus. Since the Expert recommendation generates recom-
mendations from the lowest part of the tree, it contains
Sparseness of citation graphs
more, smaller clusters than the Classic recommendation,
which produces recommendations by zooming out 1 level The AMiner network, just like most citation databases, is
on the citation tree. There were 23,831 classic clusters with at sparse. The average out-degree (i.e., the number of citations
least 1 paper. There were 174 singletons (0.7% of all clusters) from an AMiner paper to other AMiner papers) is 3.8
not included in Figure 3. with a standard deviation of 7.4. The average in-degree
Singletons are clusters with only one paper. For Figure 3 distribution (e.g., the number citations to AMiner papers
we did not include singletons. There were 79,242 clusters from AMiner papers) is right skewed with a mean of 8.5
(39% of all clusters) that were removed. The number of sin- and a standard deviation of 34.9.
gletons generally decreases with increased citation density. Although most readers will be familiar with the AMiner
Like most bibliographic datasets, the AMiner network is dataset provided for this issue, we want to take a moment
sparse, which partly explains why 39% of the expert clusters to describe some attributes pertinent to citations based rec-
are singletons. The papers in these singleton clusters tend to ommenders. The most important factor for citations based
be sparsely connected to the larger groups within their dis- recommenders is the density of connections between nodes,
cipline. If you only include clusters with more than 1 paper, that is the number of citations a paper makes. For this
the average cluster size increases to 9.8 papers per cluster. dataset there are a large number of papers that make no
There also exists clusters at the highest level of the tree with citations, 1,233,562 of the total 2,092,356 papers (59%). Part
singletons. Typically, the highest level clusters as you move of this is incompleteness in the data set. For example, the
up the tree include tens of thousands of papers. There were paper Rough computational methods for information systems
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 7
EXPERT CLASSIC
20000 40000
20000 40000
number of clusters
0
0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0
Fig. 3. Cluster size distribution. The bar plots show the number of clusters with a given number of papers. For the expert recommendation, clusters
are extracted from the lowest part of the tree; therefore, there are more clusters with smaller number of papers. For the classic recommendation, the
algorithm zooms out one level of the tree. This produces fewer clusters with more papers in each cluster. We did not include singletons – clusters
with only one paper – in this figure. There were approximately 80k singletons in the AMiner tree.
TABLE 4
The top 10 papers for the largest Expert cluster. Germane to this paper, the articles in this group are mostly about recommendation research.
107 106
106
105
105
104
104
Count
Count
103
103
102 102
101 101
100 100
0 100 200 300 400 500 600 700 800 900 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Out Degree In Degree
Fig. 4. AMiner paper’s out-degree distribution. The bar plot shows Fig. 5. AMiner paper’s in-degree distribution. The bar plot shows the
the frequency of outbound citations made by papers in the AMiner frequency of inbound citations of papers in the AMiner dataset. Most
dataset. The largest value, 0, accounts for 1,233,562 papers or 59% papers are cited few times, while one paper is cited 8,166 times.
of the network.
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 8
by Guan and Bell is listed as having no citations, but the papers, with an average overlap of 0.011845 and a standard
original paper actually has 14 papers in its bibliography. deviation of 0.064401. One potential issue here is the limited
This incompleteness in data is very common for datasets number of recommendations the co-citation method can
of scholarly articles, and it represents one of the greatest generate for this dataset; our implementation only gen-
challenges to large scale deployment of citation-based rec- erated recommendations for 163,051 papers, 7.79% of the
ommenders. It is interesting to note that this problem is entire corpus. To further validate our overlap findings we
worse in scholarly literature than on the web. The web ran the experiment again, this time only using papers that
benefits substantially from a standardized link format, and the co-citation method had generated a recommendation for
easy disambiguation between sites. Scholarly literature is as the input set (N ). With this smaller dataset we compared
inconsistent in its use of bibliographies, many of which are recommendations for 163,051 papers, finding an average
encoded in textual formats and have no semantic markup overlap of 0.088522 and a standard deviation of 0.155588.
to indicate what is a citation and what isn’t. Even worse
the textual formats available vary widely, from convoluted
PDFs and TEX files to proprietary formats that cannot be reli- D ISCUSSION
ably parsed. Figure 4 and 5 shows the citation counts for the In this paper, we describe a simple method for recommend-
entire dataset. Recommender systems can also be evaluated ing scholarly papers at different scales. To do this, we (1)
by the coverage [38] they provide. This metric can be mea- assemble a citation graph, (2) rank the nodes according to a
sured in various ways, however it is commonly described modified form of PageRank, (3) cluster the network hierar-
as the percentage of items for which recommendations can chically using the MapEquation framework and (4) then se-
be formed. By this metric EFrec performs fairly well, with lect recommendations for a seed paper given its location in
recommendations available for 58.2% of the AMiner dataset the hierarchical tree. For the last step in the algorithm (step
(1,218,504/2,092,356). Co-citations, by contrast, is only able 4), we describe three approaches for selecting articles in this
to generate recommendations for 7.79% (163,051/2,092,356) tree at different scales of relevance, which we dub Expert,
of the corpus. Classic, and Serendipity. However, one could deploy many
other variants. For example, one may select articles within a
cluster based on semantic similarity or co-authorship. One
Recommendation Overlap could also zoom out multiple levels depending on the depth
Another metric that can be used to compare various rec- of a discipline. The authors of this paper have developed
ommendation algorithms is how much overlap there is in similar network-based ranking methods at the author-level
recommendations they generate. There are several reasons that could be applied to this ranking system [17].
this metric is useful. First, if you are claiming to produce We found that EFrec performed as well as the widely-
better recommendations than the competition they would used co-citation method on papers that both methods can
have to also be different recommendations. If Eigenfactor rank — but EFrec provides far more recommendations
and co-citation have substantial overlap it would be unlikely (more than twice as many for both expert and classic recom-
that they generated materially different recommendations. mendations), and it provides recommendations at different
Second, there is a lot of discussion around hybrid or en- levels of granularity (expert and classic). More experiments
semble recommenders. The intuition behind this idea is that are needed to further analyze and compare this approach to
recommenders all have strengths and weaknesses, but if other citation-based approaches.
we combine many different recommenders we can generate We performed some preliminary analysis on live users
better overall recommendations in more situations. There- on the Social Science Research Network (SSRN) to eval-
fore, one would want recommenders that are ”orthogonal” uate EFrec’s performance against other recommendation
to each other; they should produce substantially different algorithms. Using an A-B testing environment developed
recommendations for the same paper. by SSRN [37], we compared EFrec classic and expert to
We compared the overlap of the Eigenfactor Recom- co-citations and co-downloads. Users were randomly as-
mends (EFrec) results to the well-known co-citation recom- signed one of the four recommenders with equal proba-
mendation method. To do this we generated recommenda- bility whenever they viewed an article page. As Table 5
tions for a set of papers, took the top-n from each algorithm, shows, Eigenfactor expert and co-citation had very similar
calculated the intersection and divided by n. The below click-through rates (CTR): 0.24% and 0.26% respectively.
equation describes this process where N is the set of papers Eigenfactor classic’s CTR was half co-citations at 0.13%.
to generate recommendations for, Oi is the overlap score for Though Eigenfactor did not outperform co-citation in this
two recommendations, Ci,n is the top-n recommendations experiment, it does provide substantially greater coverage
generated by the co-citation method for paper i, Ri,n is with comparable recommendation performance. Co-citation
the top-n recommendations generated by the co-citation could only generate recommendations for 94,043 of the
method for paper i, and n is the maximum number of 426,412 papers (22%), while EFrec expert and classic could
recommendations. If both Ci,n and Ri,n were empty the provide recommendations for 215,627 (51%) and 227,049
paper was thrown out. (53%) of papers respectively.
We also had usage data for the current production SSRN
| Ci,n ∩ Ri,n |
∀i ∈ N, Oi = (1) recommender: co-downloads. Co-download is a collabora-
n tive filtering method that tracks user downloads to gen-
When running over the entire miner data set (N = erate recommendations. It is important to keep in mind
aminer) we compared the recommendations for 1,218,504 that co-downloads and citation based methods are not
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 9
Co-Download† N/A N/A 272,760 22.45% 1,882 51.63% 556 56.22% 0.69% 29.54% 0.20%
Eigenfactor Classic 227,049 53% 462,653 38.08% 595 16.32% 167 16.89% 0.13% 28.07% 0.04%
Eigenfactor Expert 215,627 51% 410,622 33.80% 988 27.11% 209 21.13% 0.24% 21.15% 0.05%
measuring the same thing—co-downloads is a measure of and community structure. For some fields such as eco-
popularity, citation based methods of importance/impact. nomics, ecology and mathematics, the citation lags are much
The difference between citation based methods and collab- longer [16]. When compared to usage data, citations tend to
orative filtering is illustrated by visiting a product page on better locate field-starting, classic papers, while usage-based
amazon.com. For most products Amazon provides at least methods locate recent, popular papers. Given the somewhat
two sets of recommendations: “Customers Who Bought complementary nature of co-downloads and EFrec, we plan
This Item Also Bought” and “Compare to Similar Items”. to integrate both into a hybrid recommender. Ideally, this
Collaborative filtering methods (like co-download) are the will combine a strong impact measure and greater coverage
“Customers Who Bought This Item Also Bought” style of of EFrec, while adapting quickly to changing fields and new
recommendations, though for co-download it would more papers.
accurately be titled “Users Who Read This Paper Also One of the strengths of the method described in this
Read”. This is also a measure of popularity, recommending paper is the simplicity of the algorithm conceptually and
the most downloaded items. EFrec is akin to the ”Compare computationally. It is easy to describe, build upon and
to Similar Items”—we find similar papers based on the compute. In this paper, we simply sort the articles within the
latent hierarchical structure encoded in citation networks. clusters using ALEF, based on a modified version of PageR-
In addition ,we note that co-download approaches require ank, and aggregate the clusters at different levels (Expert
that we have detailed information about the browsing and and Classic). Computationally, the method is relatively fast.
searching activities of each and every individual user. The We can run the entire AMiner dataset of 2.1 million papers
SSRN platform, with whom we conducted these tests, have and 8 million citations on a standard desktop machine (2.6
this information because user login is required to use their GHz Intel) on one core in less than 30 minutes. The ranking
system — but this is the exception rather than the rule step (2 from Figure 1) and the recommendation step (4
among platforms that would benefit from deploying a rec- from Figure 1) are especially fast. During the recent 2016
ommendation system. As a result, a citation-based method WSDM Cup Challenge we ran the ranking step (Figure 1.2)
may be of great value even if it performs modestly in head- on the Microsoft Academic Search citation graph, which
to-head comparisons with co-download approaches. contained over 49 million papers and 949 million citations,
Although we do not have the exact coverage numbers for in approximately 30 minutes [28]. The bottleneck is the
co-download, the lower appearance count implies that it is clustering step (3), but we are continuing to improve the
not able to generate recommendations for as many papers as speed of this step by parallelizing the code for multi-core
EFrec. However, those recommendations co-download does machines and distributed systems [39]. Current details and
generate are quite effective, with a 0.69% CTR compared improvements about the computational time can be found
to EFrec expert’s 0.24%. One possible explanation for this at babel.eigenfactor.org.
impressive performance is that co-download can quickly We would like to close the discussion with a short note
begin recommending new papers. about this “Special Issue on Big Scholar Data Discovery
There are advantages and disadvantages in using cita- and Collaboration.” Given the growth of science there is a
tions as the primary substrate for recommending papers. strong need for better tools for navigating the literature, but
One clear advantage is the insularity of this method to most of the recommendation research has been developed
textual noise; it simply looks at the connections among for commercial uses (Netflix, Amazon, etc). We hope that
papers and ignores semantic information. The major dis- scientists spend more time helping one another discover
advantage is due to the permanence of citations (and the important, relevant papers and authors. A major reason for
lack of semantic information). It takes considerable time to this lack of recommender development hasn’t been lack of
accumulate enough citations to obtain signals of influence interest in developing better algorithms and tools, but rather
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 10
the difficulty of accessing content. There has been some [4] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer:
progress in this area, such as the recent release of the Mi- Extraction and mining of academic social networks,” in KDD’08,
2008, pp. 990–998.
crosoft Academic Search citation graph [40] and the AMiner
[5] P. B. Kantor, L. Rokach, F. Ricci, and B. Shapira, Recommender
graph for this special issue. Our hope is that publishers will systems handbook. Springer, 2011.
continue to work more with researchers in this area so that [6] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization
development continues beyond this special issue. techniques for recommender systems,” Computer, pp. 42–49, 2009.
[Online]. Available: https://datajobs.com/data-science-repo/
Recommender-Systems-[Netflix].pdf
[7] L. Lü, M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang, and
4 C ONCLUSION T. Zhou, “Recommender Systems,” p. 97, Feb. 2012. [Online].
In this paper, we describe a simple citation-based method Available: http://arxiv.org/abs/1202.1112
[8] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock, “Meth-
for recommending articles. The method is based on the
ods and metrics for cold-start recommendations,” in Proceedings of
hierarchical structure of scientific knowledge, allowing for the 25th annual international ACM SIGIR conference on Research and
different scales of influence. development in information retrieval. ACM, 2002, pp. 253–260.
The method scales well for large citation networks and is [9] X. Su and T. M. Khoshgoftaar, “A Survey of Collaborative
Filtering Techniques,” Advances in Artificial Intelligence, vol.
made freely available at babel.eigenfactor.org. We have gen- 2009, no. Section 3, pp. 1–19, 2009. [Online]. Available:
erated hundreds of millions of recommendations for tens http://www.hindawi.com/journals/aai/2009/421425/
of millions of scholarly articles, and we plan to continually [10] O. Küçüktunç, E. Saule, K. Kaya, and U. Çatalyürek, “Direction
generate more as bibliographic data becomes available. awareness in citation recommendation,” vol. i, pp. 3–8,
2012. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/
In this paper, we compare EFrec to other recommen- summary?doi=10.1.1.301.1927
dation methods, including citation based and usage-based [11] M. M. Kessler, “Bibliographic Coupling Between Scientific Pa-
ones, on live users of the Social Science Research Network pers,” American Documentation (pre-1986), vol. 14, no. 1, p. 10, 1963.
(SSRN). When compared to other citation based methods [12] H. Small, “Co-Citation in Scientific Literature: A new measure of
(co-citation), EFrec produced comparable recommendations the relationship between two documents,” Journal of the American
Society for Information Science, vol. 24, no. 4, pp. 265–269, 1973.
as measured by click-through rate, but with substantially [13] H. White and B. Griffith, “Author cocitation: A literature measure
higher recommendation coverage: 52% vs 22% of the corpus. of intellectual structure,” Journal of the American Society for Informa-
However, when EFrec is compared to collaborative filtering tion Science, vol. 32, pp. 163–171, 1981.
based methods (co-downloads), we found that EFrec has [14] J. M. Kleinberg, “Authoritative Sources in a Hyperlinked Envi-
ronment,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 604–632,
a substantially lower a click-through rate: 0.24% vs 0.69%. 1999.
However, since these methods employ different data sources [15] S. Brin and L. Page, “The anatomy of a large-scale hypertextual
we see them as being more complimentary than in competi- web search engine,” in Seventh International World-Wide Web
tion. Conference (WWW 1998), 1998. [Online]. Available: http://ilpubs.
stanford.edu:8090/361/
Finally, we see this as only one way of using the ranking
[16] J. West, T. Bergstrom, and C. Bergstrom, “The eigenfactor metrics:
and clustering aspects of the EFrec method. There are many A network approach to assessing scholarly journals,” College and
variations and extensions that can be built on the general Research Libraries, vol. 71, no. 3, pp. 236–244, 2010.
framework. We hope the method, code, and recommen- [17] J. D. West, M. C. Jensen, R. J. Dandrea, G. J. Gordon, and
dations will provide fodder for further study in scholarly C. T. Bergstrom, “Author-level Eigenfactor metrics: Evaluating
the influence of authors, institutions, and countries within
recommendation. the social science research network community,” Journal of
the American Society for Information Science and Technology,
vol. 64, no. 4, pp. 787–801, Apr. 2013. [Online]. Available:
ACKNOWLEDGMENTS http://doi.wiley.com/10.1002/asi.22790
[18] J. Bollen, M. A. Rodriquez, and H. Van de Sompel, “Journal
This work was supported the Metaknowledge Network status,” Scientometrics, vol. 69, no. 3, pp. 669–687, 2006.
fundeded by the John Templeton Foundation. We would [19] S. J. Liebowitz and J. P. Palmer, “Assessing the relative impacts of
like to thank Martin Rosvall and Daril Vilhena for early economics journals,” Journal of Economic Literature, pp. 77–88, 1984.
discussions around the topic of recommendation. We would [20] G. Pinski and F. Narin, “Citation influence for journal aggregates
of scientific publications: Theory, with application to the literature
also like to thank Ralph Dandrea, Anatoly Chukhin and of physics,” Information Processing & Management, vol. 12, no. 5,
Fernando Agostino at SSRN for helpful discussions and for pp. 297–312, 1976.
providing data for the preliminary usage experiments that [21] T. H. Haveliwala, “Topic-sensitive pagerank,” in Proceedings of the
are being developed for a future paper. 11th international conference on World Wide Web. ACM, 2002, pp.
517–526.
[22] M. Gori and A. Pucci, “Research paper recommender systems: A
random-walk based approach,” Proceedings - 2006 IEEE/WIC/ACM
International Conference on Web Intelligence (WI 2006 Main Conference
Proceedings), WI’06, pp. 778–781, 2007.
[23] M. Rosvall and C. T. Bergstrom, “Multilevel compression of ran-
R EFERENCES dom walks on networks reveals hierarchical organization in large
[1] D. J. de Solla Price, “Networks of scientific papers,” Science, vol. integrated systems,” PloS one, vol. 6, no. 4, p. e18209, 2011.
169, pp. 510–515, 1965. [24] J. D. West, J. Jacquet, M. M. King, S. J. Correll, and C. T. Bergstrom,
[2] L. Bornmann and R. Mutz, “Growth rates of modern science: A “The role of gender in scholarly authorship,” PloS one, vol. 8, no. 7,
bibliometric analysis based on the number of publications and p. e66212, 2013.
cited references,” Journal of the Association for Information Science [25] J. West, M. Rosvall, and C. Bergstrom, “Ranking and mapping
and Technology, 2015. article-level citation networks,” In Prep., 2016.
[3] A. E. Jinha, “Article 50 million: an estimate of the number of [26] D. Walker, H. Xie, K.-K. Yan, and S. Maslov, “Ranking scientific
scholarly articles in existence,” Learned Publishing, vol. 23, no. 3, publications using a model of network traffic,” Journal of Statistical
pp. 258–263, 2010. Mechanics: Theory and Experiment, vol. 2007, no. 06, p. P06010, 2007.
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TBDATA.2016.2541167, IEEE
Transactions on Big Data
IEEE TRANSACTIONS ON BIG DATA 11
[27] S. Maslov and S. Redner, “Promise and pitfalls of extending Ian Wesley-Smith is a PhD student at the Uni-
google’s pagerank algorithm to citation networks,” The Journal of versity of Washington Information School and a
Neuroscience, vol. 28, no. 44, pp. 11 103–11 105, 2008. member of the Datalab. His research interests
[28] I. Wesley-Smith, C. T. Bergstrom, and J. D. West, “Static ranking include machine learning, network analysis, rec-
of scholarly papers using article-level eigenfactor (alef),” ser. 9th ommender systems, distributed computing and
ACM International Conference on Web Search and Data Mining. cryptography. He is currently working on Babel,
ACM, in press. a platform to develop, distribute and measure
[29] K. R. Larsen, D. Hovorka, J. West, J. Birt, J. R. Pfaff, T. W. Cham- scholarly article recommendations, all as a web
bers, Z. R. Sampedro, N. Zager, and B. Vanstone, “Theory identity: service, freely available to everyone. Prior joining
A machine-learning approach,” in System Sciences (HICSS), 2014 the DataLab Ian was a developer on Amazon
47th Hawaii International Conference on. IEEE, 2014, pp. 4639–4648. Web Services and spent several years as an
[30] R. Lambiotte and M. Rosvall, “Ranking and clustering of nodes engineer on Windows Security. He holds an BS in Computer Science
in networks with smart teleportation,” Physical Review E, vol. 85, from Louisiana State University.
no. 5, p. 056107, 2012.
[31] M. Rosvall and C. T. Bergstrom, “Maps of random walks on
complex networks reveal community structure,” Proceedings of the
National Academy of Sciences, vol. 105, no. 4, pp. 1118–1123, 2008.
[32] M. Rosvall, D. Axelsson, and C. T. Bergstrom, “The map equation,”
European Physical Journal: Special Topics, vol. 178, no. 1, pp. 13–23,
2009.
[33] C. E. Shannon, “A mathematical theory of communication,” Bell
Labs Tech J, vol. 27, pp. 379–423, 1948.
[34] A. Lancichinetti and S. Fortunato, “Community detection algo-
rithms: a comparative analysis,” Physical review E, vol. 80, no. 5, p.
056117, 2009.
[35] G. K. Orman, V. Labatut, and H. Cherifi, “On accuracy
of community structure discovery algorithms,” arXiv preprint
arXiv:1112.4134, 2011.
[36] J. I. Perotti, C. J. Tessone, and G. Caldarelli, “Hierarchical mutual
information for the comparison of hierarchical community struc-
tures in complex networks,” Physical Review E, vol. 92, no. 6, p.
062825, 2015.
[37] I. Wesley-Smith, R. Dandrea, and J. West, “An experimental plat-
form for scholarly article recommendation,” in Proc. of the 2nd
Workshop on Bibliometric-enhanced Information Retrieval (BIR2015),
2015, pp. 30–39.
[38] J. L. Herlocker, J. a. Konstan, L. G. Terveen, and J. T. Riedl,
“Evaluating collaborative filtering recommender systems,” ACM
Transactions on Information Systems, vol. 22, no. 1, pp. 5–53, 2004.
[39] S.-H. Bae, D. Halperin, J. West, M. Rosvall, and B. Howe, “Scalable
flow-based community detection for large-scale network analy- Carl T. Bergstrom is a Professor in the Depart-
sis,” in Data Mining Workshops (ICDMW), 2013 IEEE 13th Interna- ment of Biology at the University of Washing-
tional Conference on. IEEE, 2013, pp. 303–310. ton. He is a pioneer in the applying the meth-
[40] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and ods of network science to citation networks. To-
K. Wang, “An overview of microsoft academic service (mas) and gether with author Jevin West, he developed
applications,” in Proceedings of the 24th International Conference and launched the Eigenfactor Metrics, a system
on World Wide Web, ser. WWW ’15 Companion. Republic and for ranking scholarly journals the full network of
Canton of Geneva, Switzerland: International World Wide Web citations. Eigenfactor has since become widely
Conferences Steering Committee, 2015, pp. 243–246. [Online]. adopted as an industry standard. Dr. Bergstrom
Available: http://dx.doi.org/10.1145/2740908.2742839 is also the co-inventer of the InfoMap algorithm,
a leading method for mapping the community
structure of large networked systems. In addition to his work in scholarly
AUTHOR B IOGRAPHIES communication, Dr. Bergstrom is the coauthor with Lee Alan Dugatkin
of a leading evolutionary biology textbook, Evolution [more details here:
octavia.zoology.washington.edu]
2332-7790 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.