1. Introduction
Cognitive science, namely the study of the mind and its processes [
1,
2,
3], has recently gained significant momentum, which can be attributed to a number of reasons. It is a major driver of the big data age along with online social networks, the semantic web and computational systems theory to name a few. Recent sociological and demographic studies conducted in the majority of the Western world including the EU [
4] and the U.S. [
5] reveal that one of the biggest challenges of the next decade will be the healthcare costs associated with cognitive issues. Similar trends at the planet scale can be found in reports compiled by the UN Population Division [
6,
7,
8]. Thus, the systematic analysis of cognitive science literature is of immediate interest, besides researchers, of healthcare planners, government agencies, hospital administrators, insurance companies, equipment manufacturers and software developers.
With the creation of PubMed (
http://www.ncbi.nlm.nih.gov/pubmed) in 1996, the largest public online database under the ultimate administrative oversight of NIH, a massive collection spanning millions of life science articles is available to researchers. Over the past two decades, it has been substantially enriched and currently contains more than fourteen million abstracts, whereas it accepts and serves more than seventy million queries of six terms each on average per month. Indicative of its enormous topic diversity is the fact that through Entrez (French term for enter.), the PubMed-coupled indexing engine, the searchable keywords currently exceed two millions. Two issues arising in any information retrieval context, especially in a major digital repository of humanistic data, are precision and recall. The former pertains to the accurate retrieval of studies, articles and data tables stored across a variety of online archives and libraries that are relevant to the query, whereas the latter refers to the fraction of relevant documents that is retrieved for a given query [
9].
PubMed cognitive science-related documents range over such diverse types as academic articles, software blueprints, demographic surveys, healthcare personnel training manuals, best practice guides and clinical data reports. It is evident that in such a large collection, both in terms of quantity and quality, information retrieval and text mining techniques should be applied if any meaningful piece of information is to be acquired. This is especially true given that medical and scientific articles exhibit a high degree of textual structure, including metadata like abstracts and keywords.
The primary contribution of this article is a topic-based clustering strategy for cognitive documents whose core is a third order term-keyword-document tensor. The latter is one of the many possible direct generalizations of the established term-document matrix, which essentially is a second order tensor from a linear algebraic point of view. As such, this tensor can be queried in a similar manner. The reason for selecting this particular scheme over a number of other possible ones stems from the intuition, corroborated in part from empirical scientometric evidence, that keywords are semantically more important compared to ordinary terms [
10,
11].
This journal article is structured as follows.
Section 2 summarizes recent work on medical retrieval and tensor analysis.
Section 3 describes software tools, and
Section 4 outlines the proposed tensor model. Furthermore,
Section 5 discusses performance aspects of the proposed and the baseline method obtained through the TREC (Text REtrieval Conference) genomics dataset, while
Section 6 discusses tensor analytics. Finally,
Section 7 explores future research directions. Tensors are denoted by uppercase calligraphic letters, such as
, matrices by uppercase boldface letters, like
, and vectors by lowercase boldface letters or numbers, like
and
.
Table 1 summarizes the article notation.
2. Previous Work
Document clustering has gained much interest in biomedicine [
12]. PubMed abstracts are clustered with frequent words and near terms in [
13]. A graph algorithm based on flow simulation is considered in [
14], where advanced techniques are proposed in [
15].
Biomedical ontologies in conjunction with mining of biomedical texts led to the technique of word sense disambiguation (WSD), which maps documents to different topics. Ontologies and meta-data assist the clustering algorithms [
16]. Event-based text mining systems in the context of biomedicine as an annotation scheme are the focus of [
17]. On the other hand, domain-specific information extraction systems regarding event-level information with automatic causality recognition are proposed in [
18]. Human gene ontologies are described in [
19,
20]. U-Compare, an integrated text mining and NLP system based on the Unstructured Information Management (UIMA) Framework (UIMA:
http://uima.apache.org/), is presented in [
21].
Using the MeSH ontology for biomedical document clustering is popular in scientific literature [
22,
23,
24,
25]. Various clustering approaches such as suffix tree clustering were supplemented with ontological information in [
26], whereas the accuracy of similarity metrics is discussed in [
27]. A knowledge domain scheme based on bipartite graphs with MeSH is presented in [
28]. Two serious limitations that face approaches by using the MeSH thesaurus are introduced in [
29].
Tensor algebra [
30,
31] and the closely-associated field of multilayer graphs [
32,
33] are some of the primary algorithmic tools for dealing with higher order data, along with higher order statistics [
34,
35] and multivariate polynomials [
36,
37]. Central places in tensor algebra have Tucker and Kruskal tensor forms [
38], which allow alternative tensor representations appropriate for certain linear algebraic operations such as tensor-matrix multiplication, tensor compression [
39], tensor regularization and factor discovery. Models for tensor data mining have been outlaid in [
40]. A very recent work combining tensors and semantics for medical information retrieval is [
41].
4. Proposed Model
4.1. Representation
Definition 1. A p-th order tensor is a p-dimensional array indexed by p integers and coupling simultaneously at most p distinct linear spaces denoted by , . When each of the p linear spaces is , then as a shorthand .
It is obvious from the way tensors are defined that they are direct generalizations of matrices. Indeed, a matrix is the linear algebraic vehicle for coupling the row space and the column space . Of course, for square and invertible matrices, the row and column spaces coincide.
Tensor
, which will contain properly-defined values for the keyword-term-document triplets, is populated by
keywords and
terms stored in
documents making it a third order tensor
. Note that the proposed
is but one of the ways for extending the established term-document model, which is based on a second order tensor, namely a matrix
. For instance, in [
47], a term-author-document is proposed. The latter is based on empirical scientometric evidence in favor of the semantic role authors play in the process of information retrieval [
10,
11]. A common point with the proposed model, besides both relying on third order tensors, is that they are inspired by OLAP (Online Analytical Processing) cubes [
48]. Regarding the tensor dimensions, it should be noted that, although the three dimensions are easy to visualize and handle, they are by no means a golden rule.
As stated earlier,
, essentially the algebraic cornerstone of the proposed technique, is a third order and real valued tensor simultaneously coupling the keyword, term and document spaces. The entries of
are associated with the document retrieval process. Concretely, let
,
and
respectively denote the
-th keyword, the
-th term and the
-th document where
,
and
. Moreover, let
and
respectively be the number of occurrences of
and
in
. Then,
contains the normalized occurrences of
and
in
according to the following four factor double tf-idf (term frequency-inverse document frequency) scheme:
In Equation (
1), the first pair of terms
and
forms a standard tf-idf scheme based only on terms and documents:
while the second pair of terms
and
constitutes the second tf-if scheme:
4.2. Analytics
Tensor density, similarly to large matrix sparsity, is a significant metric, which besides potential compression, may reveal interesting patterns along many dimensions since its definition is straightforward.
Definition 2. The density ρ of a tensor is defined as the number of the non-zero elements to its total number of elements, which can be easily found by multiplying the size of each dimension. Thus: Definition 3. Along similar lines, the log density of is defined as the logarithm of the number of the non-zero elements to the logarithm of its total number of elements, essentially being the ratio of the magnitudes of the respective numbers. Besides its natural interpretation, can usually lead to larger values, which in turn result in numerically stable computations in formulae when it appears in denominators.
The Frobenius norm of a tensor
, denoted by
, is an algebraic indicator of the overall strength of the tensor entries, which is indirectly tied to compressionability. Recall that the Frobenius norm for a matrix
is defined as:
Both are related in their own way to compression potential, which is critical given the large volume of data typically held in tensors. The former plays the same role as with matrices, whereas the latter indicates whether there are strong or weak connections between keywords, terms and documents. Since both provide a data summary in the form of a scalar, they give quick and overall information regarding the tensor status at the expense of aggregating information about each dimension of this single value. Thus, both metrics can be used as building blocks for composite ones, which examine each dimension separately.
Definition 4. The Frobenius norm of an p-th order tensor is the square root of the sum along each dimension of its elements squared: Generally, there is no consensus as to which values of
indicate strong connections on average. In order to derive bounds, probabilistic techniques can be employed by treating the elements of
being drawn from a distribution. One way is to observe that
is the sample approximation of
. Then, since the Frobenius norm is always positive, as all-zero tensors are not under consideration, the Markov inequality:
can be used to derive a bound. For instance, if the elements of
are drawn from a Gauss distribution, then
follows a noncentral chi square distribution.
4.3. Metric Fusion
Again, in order to take into account information about each dimension separately for a third order tensor, it suffices to fix the last index and create a metric
that takes into consideration the density of each resulting matrix separately. Thus, if the density of each separate matrix
for a fixed value of
is defined as:
then the set of tensor densities along the third dimension
can be used to build the following aggregative metric:
The harmonic mean ensures that will tend to be close to the smallest of .
For a third order tensor , a related metric can be constructed by first fixing one of the three indices, treating the remaining two dimensions as a sequence of matrices, computing the Frobenius norm for each such matrix and taking the harmonic mean of these norms. Deciding which index is to be fixed is important as it essentially determines a tensor partitioning. For the purposes of this article, the last index is fixed creating thus a metric , which ranges over the documents.
Notice that in (11) denotes the matrix created by fixing to k while the two remaining indices stay unaltered, creating thus a matrix. If is large, which may well be the case for document collections, then it would also make sense to compute statistic measures such as sample versions of variance, skewness and kurtosis.
4.4. Tensor Tucker Form
Definition 5. The multiplication along the k-th dimension between a p-th order tensor and a vector is a -th order tensor with elements: Definition 6. The multiplication along the k-th dimension between a p-th order tensor and a matrix is a p-th order tensor with elements: Definition 7. Tucker tensor factorization is defined as: The Tucker factorization is one of the possible generalizations of the SVD for matrices:
which is the core of the term-document information retrieval model and the starting point of a number of document clustering schemes. In order to compute the Tucker factorization, the higher order SVD is employed. The latter is based on cyclically updating each of the basis matrices
until they all converge according to a criterion.
4.5. Queries
Similarly to the term-document matrix case, tensor can be queried regarding a set of terms or a set of keywords . Said queries can be cached in terms of linear algebra as tensor-vector multiplications. In addition, allows queries about both terms and keywords.
Generating a query vector, it suffices to place one at a position corresponding to a query term and zero otherwise. For a more detailed description, see the collection querying algorithm in [
47].
4.6. Document Clustering
When partitioning a set
S into
k subsets, a heuristic approach is necessary since the number of ways
to perform such partitioning equals [
49]:
The generating function
of the recursively defined sequence
is:
which can be proven using the property of partial sums for any integer sequence.
4.7. Baseline Methodology
The processing steps of the baseline methodology are extensively described in previous works [
23,
24,
25]. Initially, for the web documents to be retrieved and later processed, the web document repository (PubMed) is queried. Specifically, we have used PubMed API (Pubmed API:
http://www.ncbi.nlm.nih.gov/books/NBK25500/). After the results
are retrieved in the initial step, each result item
consists of six different items: title, author names, abstract, keywords, conference/journal name and publication date,
.
In the following, the document representation takes place as the proposed methodology enriched the corresponding texts with annotations from a specific ontology. Consequently, each document is represented as a term frequency-inverse document frequency () vector, and some terms of the vector are annotated and mapped on senses identified from MeSH.
As a last step, the vectors-documents are clustered by utilizing k-means.
The baseline methodology is outlined in Algorithm 1.
Algorithm 1 Baseline methodology from [25]. |
Require: Query vector q |
Ensure: Clusters produced |
1: identify documents set |
2: ∀ result item in D |
3: for each in D do |
4: calculation of MeSH vectors M = {} |
5: end for |
6: use as input, titles, keywords and abstracts: |
7: for each in M do |
8: Clusters ←K-Means(M) {where Cosine Similarity metric is applied to k-Means} |
9: end for |
6. Custom PubMed Dataset
Before analyzing the precision and recall characteristics of the proposed model, it is worth looking at the tensor contents and specifically at the term list. The twenty most and the twenty least common keywords in the collection and their frequencies are shown in
Table 10. Additionally,
Table 11 contains the corresponding information for the text terms. In these tables, the frequency
f for both keywords and terms is computed based on the entire document collection.
It is no surprise to see common technical and medical terms at the top of the list of
Table 10. For instance, both fMRI and EEG analysis are widespread techniques with many MATLAB implementations. Moreover, older or more specialized terms are less frequent. For example, shell shock or shellshock is a rather negatively-charged WWI-era term, which is now largely replaced by post-traumatic. Notice that closely-associated keywords, such as clinical and evaluation, have similar frequencies, which is expected. Rare keywords also pertain to other physiological conditions, probably from papers establishing a connection between brain and body functionality.
The situation is similar in
Table 11 where the top twenty and the bottom twenty terms are shown. In comparison to
Table 10, the terms are more diversified covering a broader number of topics including many secondary ones, and thus, the gaps between terms are considerably narrower. Obviously, the terms cognitive, cognition and brain are present in literally every document of the collection, which was anticipated. In contrast to
Table 10, there hardly appears to be a connection between the least frequent terms and the topic. In fact, the right-hand side of
Table 10 could appear in virtually any medical collection about any topic and still make some sense. This implies that there is definitely compression potential in the original collection as a portion of documents can be replaced by a combination of eigen-documents or, in the case of redundant information determined by a large number of generic terms, it can be simply discarded.
Notice that the majority of the terms of the second column of
Table 11 probably refer to the subjects undergoing some kind of treatment or monitoring. Furthermore, the first column of
Table 11 shares many common entries with the corresponding column of
Table 10. This can be attributed to the fact that a keyword, which carries significant semantic information, is very likely to be used in the text of a document. Furthermore, the frequency of terms fMRI and EEG equals roughly the sum of the frequency of the academic papers and the clinical data documents of
Table 12. A possible explanation is that these types of documents are the most likely to refer to clinical methodology, while the remaining document types address auxiliary topics.
Table 12 contains the frequency of each document type in the collection. It comprises approximately half of the scientific papers, which is consistent with the role of PubMed, supplemented by another half of auxiliary documents of various types.
It is of interest to examine the similarity between the keyword set
and the term set
, as any high relevance between them would mean the tensor can be reduced to a term-document matrix. Their similarity was assessed with the DTW metric, which works on vectors of lengths
p and
q. First, it defines a metric between the members of both vectors
and then relies on the recurrence relation:
to compute the shortest transformation and its cost
between those two vectors, creating incrementally a shortest cost path in a
tableau. Since
and
contain words,
was selected to be the Levenshtein distance. One fine point is that DTW requires vectors, which are ordered, whereas sets are by definition unordered. To overcome this,
and
were sorted in descending order according to word frequency and, if needed, lexicographically, as well, to break any frequency ties. This preserves not only the words in each set, but also their significance. Another subtlety is that the atomic operations are character and not word oriented. Once
was computed, it was expressed as a fraction of the worst case scenario, which is the deletion of each character of
followed by the insertion of each character of
. For the given sets, this ratio was
, which means that there is little overlapping between their sorted versions.
Table 13 presents tensor density as a function of
. Similarly to the benchmark dataset, it is also a decreasing function of tensor size.
Table 14 presents the ratio of
to
. The weakening of document connections is caused similarly to the benchmark dataset case.
7. Conclusions
This article presented a semantically-aware topic-based document clustering scheme for biomedical articles that can be further applied to biomedical ones. The core of this scheme is a keyword-term-document third order tensor, namely a three-dimensional array. The latter is a generalization of the established term-document matrix model, which is widely used in information retrieval, both in research and in industrial-grade systems. A third order keyword-term-document tensor with values coming out a tf-idf scheme is proposed. The advantage of the proposed representation is the semantic enrichment, which is achieved with the inclusion of keywords. Scientometric research suggests that keywords regularly carry more semantic information than ordinary terms. A variation of this model is to mix keywords from MeSH with keywords retrieved from PubMed.
The proposed methodology has been compared to both the term-author-document outlined in [
47] in terms of compression potential, precision and recall. Both were implemented in MATLAB using the Tensor Toolbox. The experimental results suggest that inclusion of keywords instead of authors increases precision and, to an extent, recall.
Regarding future work directions, a number of extensions is possible. The sparsity patterns of larger tensors should be analyzed, and if possible, compression techniques such as those proposed in [
51] should be applied. Furthermore, effective density patterns should be investigated. Another research point is the addition of update operations to the proposed model, namely of insertion and deletion operations, yielding thus a more flexible scheme. A related topic is the development of persistent methodologies for tensors, such as those in [
52], in order to support the efficient retrieval of past versions. Finally, real-time analytics are gaining attention with the recent combination of streaming algorithms and tensors.