Just as different metrics have been devised for assessing stability, different techniques have also been developed to improve stability in topic modeling. Some techniques work by modifying different parts of existing topic models, while others work by introducing entirely new models.
This section reviews different approaches to improve stability of topic modeling, including: different topic models, alternate generative process (i.e., sampling and/or inference algorithms), combinations of existing models, use of multiple random initialization, and (hyper)parameters adjustments. Although improvements in topic quality are not the main focus of this paper, approaches with quality enhancement that also improved topic stability are discussed to demonstrate how topic stability can relate with modifications intended to improve quality.
6.1 Novel Models
This subsection compares and contrasts novel topic models (different probabilistic approaches, matrix factorization methods, etc.) against previous common models discussed in Section
2. Delving into all alternatives models would exceed the scope of this paper. Instead, this section focuses only on models that were specifically intended to improve stability.
An early example comes from De Waal and Barnard [
28], who computed stability for LDA and for a matrix factorization method. They used
Gamma-Poisson (GaP) as the matrix factorization method, which chooses topic proportion of a document with Gamma distribution. GaP updates topic-term matrix for each word of the current document using Poisson distribution to reduce reproduced documents’ terms error [
20]. De Waal and Barnard [
28] showed that the LDA model is able to produce more stable results across two runs, while a single run of GaP generates topics that are less similar to other topics within that run (i.e., the topics are more diverse). While this work does not introduce a novel method to improve stability (GaP factorization had already been introduced for topic modeling [
20]), it compares topic modeling techniques from a novel standpoint that may function differently than other topic metrics. The relationships between stability and other metrics are discussed in more details in Section
7.
Beside Probabilistic topic models (e.g., pLSA, GaP, and LDA) and matrix factorization techniques (e.g., NMF and SVD) that attempt to reduce document-term generation error using an optimization process, there is another type of topic model that clusters terms or documents as topics. These models use a similarity-based metric to cluster topics and represent each topic’s descriptors as their highly important terms (for example high ranked TF-IDF terms of each cluster). The rest of this section is organized to describe the most recent similarity-based methods and discuss these models’ stability performance.
El-Assady et al. [
31] introduced a new hierarchical document/topic clustering method called
Incremental Hierarchical Topic Modeling (IHTM). This method generates topics by clustering documents using document similarity. Document similarity is computed within terms of the documents (e.g., TF-IDF or other term-frequency features). Then two or more similar documents form a topic. Document’s keywords (e.g., top-N words with the highest TF-IDF) are merged to generate term representation of a topic. Originally, IHTM was designed to improve visual representation of hierarchical relationships among topics and to provide parameters (e.g., threshold for merging topics) for user interactions.
However, it has also been shown that IHTM also improves stability. El-Assady et al. [
32] used a newly defined metric called
Document-Matched-Frequency (DMF) to compare LDA, IHTM, and NMF. DMF captures the similarity of a topic’s top documents with a set of ground truth documents, which differs from computing stability by comparing results across multiple runs.
3IHTM resolves the issue of document order wherein providing the same input documents in a different order can yield varied topic solutions (discussed in Sections
1 and
5). However, IHTM is only able to resolve this issue for document-topic assignment. In topic-term assignment, IHTM is highly sensitive to document order. This occurs because a topic’s descriptors (i.e., assignment of topic top terms) are formed based on incoming documents, and latter documents have less influence on descriptors compared with early ones. Moreover, this model is not able to provide document-topic distribution or topic-term probability. This makes IHTM more suitable for certain visualizations, but not suitable for all purpose as the comprehensive probability distributions generated by other topic models (e.g., LDA).
6.2 Improving Topic Generation and Inference
The inference phase of topic generation is an unsupervised process and one source of instability in topic modeling. For example in LDA, the optimization procedure depends on drawing latent parameters (e.g., topic-term \(\beta\)) before the generative process starts. These latent parameters determine document-topic proportion (\(\theta\)) and topic assignment of words (\(z_n\)), which are also not observed. Altering the procedures by which these latent parameters are inferred may be able to improve stability.
Roberts et al. [
98] introduced a new approach to improve LDA optimization by making an initial guess for
\(\theta\) and
\(z_n\) and then optimizing
\(\beta\) with
Variational Expectation Maximization (VEM). This approach is called
Structural Topic Modeling (STM). To make more accurate guesses, Roberts et al. [
97] added two covariates, topical prevalence and topical content. The prevalence covariate is based on an assumption of a linear relation between a document discussing a topic and metadata variables (e.g., for political blog posts, whether the author is liberal and conservative). The content covariate takes documents’ different sources (e.g., blogs vs. news media, or different news media venues) into account in the generative process. In the inference phase,
\(\theta\) is drawn from a logistic-normal distribution of topical prevalence (X) and topic covariance (
\(\Sigma\)).
STM is built to resolve LDA’s optimization issue in the presence of latent parameters. LDA optimization depends on starting points in finding local minima [
98]. STM resolves this dependency on the starting point by adding covariates, topic covariance, VEM, and an initial guess (discussed in more details in Section
6.4) to the optimization procedure.
Although Roberts et al. [
98] did not compare the stability of LDA and STM across multiple runs, they compared the stability of the relationship between the covariates and estimation of topics proportion (
\(\theta _d\)). They observed that STM relationships of generated topics to covariates are more stable than LDA generated topics. To estimate topic proportion (
\(\theta _d\)), MAP
4 was used which adds priors to likelihood computations [
98,
100]. The authors used MAP analysis to compare LDA, STM, and the true distribution of a synthesized data set. This comparison showed which model is more stable in capturing covariate-topic relations. Figure
7 shows STM can capture covariate’s effect that LDA is not able to integrate. Besides, running STM multiple times does not affect covariate-MAP widely from true synthesize data distribution. This shows, besides using the VEM algorithm, that adding covariates provides stable inference phase and
\(\beta\) estimate for STM. However, since LDA does not integrate covariates to initialize and optimize the parameters, this comparison does not necessarily mean one is more stable than the other in generating final topics.
As another attempt to resolve the topic generation instability issue, Koltcov et al. [
53] introduced a new sampling approach that they term
granulated LDA (gLDA). Generally, topic models use a bag-of-words representation that does not account for the proximity of neighboring terms within a document either in the generative process or, more specifically, in the sampling procedure. The authors introduced a new sampling method called Granulated Gibbs sampling that samples terms randomly similar to Gibbs sampling, but assigns the same topic to the neighbor terms around anchor words (top terms derived from previous inference iteration) with a fixed window length. It has been shown that increasing the length of the neighboring window increased stability of the topics generated by gLDA according to the Jaccard stability measure. Koltcov et al. [
53] ran LDA (with Gibbs sampling and variational Bayes inference), pLSA, and gLDA three times on the same corpus. The results showed that gLDA increased stability according both to Jaccard similarity and to symmetric KL. Interestingly, gLDA also increased topic quality according to coherence [
75] and tf-idf coherence [
82].
Koltcov et al. [
53] and Roberts et al. [
97] used two different inference algorithms, sampling-based (gLDA) and Variational Expectation Maximization algorithm (VEM), and both improved stability. However, Yao et al. [
129] showed different inference algorithms affect topic solutions by comparing different Gibbs sampling (which gLDA originally derived from), VEM (which STM used for optimization), Max-Entropy, and Naive Bayes. Since the true distribution of topics is not observable, the authors used a version of Gibbs sampling that jointly resamples all topics (computationally expensive to be used regularly) to generate document samples. This posterior of topic distribution is assumed to be closer to the true posterior since it jointly resamples all topics. This comparison emphasizes that none of the topic modeling algorithms can produce results that are similar to what authors referred to as a closer true distribution and compared other models against w.r.t. F1, cosine similarity, and KL-distance metrics. Although Yao et al. did not compare the stability of different models across multiple runs, comparing similarity to a base model with the ability to update all parameters at once reveals that altering inference algorithms affects topic models’ ability to generate solutions/topics. That is, using different sampling and optimization techniques may improve stability, but the resulting final solutions may also be less similar to a synthesized “true” distribution.
6.3 Combining Multiple Runs’ Solutions
In this and previous sections, models that altered the topic modeling algorithm and/or inference phase were reviewed. Besides changing algorithm of a model, combining multiple runs is another way of improving topics stability. Next section discusses advantages and disadvantages of such combination methods. Rather than attempt to alter the underlying model or inference algorithm, an alternative approach to achieve higher stability in topic modeling is to combine topics of multiple runs. With each subsequent run, each topic is either a new topic or repeated topic from a previous run, as determined by similarity measures (introduced in Section
5). Merging repeated topics and adding new, distinct, non-repeated topics is expected to produce a stable solution. Combining multiple runs may end with different numbers of emerged topics than the initial number of topics (e.g., caused by different settings, different number of initial
K, or changing similarity threshold for merging). Different approaches, discussed further below, handle this issue differently.
Belford et al. [
11] introduced a new topic modeling ensemble that combines topics of multiple runs. The method proposed by Belford et al. [
11] consists of two steps: generation and integration. In the generation step, multiple numbers, termed
r, of NMF models are executed, each generating
K topics. These
\(r*K\) topics are stacked vertically to form a topic-term matrix M. Then, in the integration step, an NMF decomposition algorithm is applied to matrix M to generate topic-term matrix H with
k components (topics). Topic-term matrix H is multiplied by the original document-term matrix and produces document-topic matrix D. The authors used
Non-Negative Double Singular Value Decomposition (NNDSVD) algorithm (an NMF technique) in the integration step that differs from least square NMF used in the generation step. Boutsidis and Gallopoulos [
17] found that initializing factors of an NMF technique (e.g., SVD, least square) with an SVD has a huge impact on final solutions and reduces random initialization sensitivity. This method is called Non-Negative Double Singular Value Decomposition (NNDSVD).
Authors compared stability of LDA and NMF topic models using DSD and average Jaccard similarity (introduced in Section
5). Despite the higher stability of the ensemble model, the document-topic matrix is generated after the model finishes matrix decomposition (not within either of the two steps). This may cause different document-topic solutions each time the ensemble model is executed, since the
r topic models (and thus M matrix) vary at each run. Belford et al. [
11] did not provide any stability comparison at document level to investigate possible effects of the ensemble strategy on the generated document-topic matrix.
In short, ensemble of NMF topic models can achieve higher stability than a single run of LDA or NMF model. However, running a matrix factorization method on the results of previously obtained factors makes the interpretation difficult. Alternatively, clustering to combine topics of multiple runs generates more interpretable results.
Mantyla et al. [
69] introduced a new method to cluster generated topics of N runs of LDA models. This method employed K-medoids clustering, which starts with a selection of cluster centers (either randomly or with an approach to minimize an objective function) and assigns data points to each center. K-medoids iteratively search for the best replacements of cluster centers. Replacements are selected from each cluster’s data points to reduce an intra-cluster distance measure. At each stage centers and assigned data points are updated until no more changes occur [
89,
105]. After obtaining t=N*K topics (N runs, each with K topics) with topic-term matrix of size t*w, authors employed the GloVE word embedding technique [
91] to project this matrix to a lower dimension matrix of size t*v (
\(v=[200,400]\),
\(v\lt w\)).
Running K-medoids with K (i.e., the same as number of topics) clusters results in generating K topics. The proposed method sums up the term probabilities of topics of each cluster and selects \(top_{10}\) terms as the descriptors of a cluster. Because of this combination of topics, a topic-term probability matrix is not provided. Authors did not compare LDA clustered stability with LDA or any other topic models, but used clustering metrics (e.g., Silhouette) and stability metrics (e.g., Jaccard) to show the best, the median, and the worst topic of 20 runs of this model according to these metrics.
Mantyla et al. [
69] uses word embedding in clustering topics, thus topic stability is subject to change with different word embedding training parameters, trained corpora, initialization, or document order [
5]. Moreover, no topic quality assessment and comparison is provided for this method, and it is hard to analyze how this approach to stability affects usability of the results or their perceived quality.
Clustering topics may combine mixed topics
5 with non-mixed ones. Whether this occurs depends massively on the pre-defined number of clusters and distance measure. A hierarchical clustering stops merging topics that are partially similar at lower levels and combines them as it goes to higher level clusters. Miller and McCoy [
72] proposed a new hierarchical clustering model to improve stability of topic modeling and hierarchical summarization. First, the model executes topic modeling under the same conditions (e.g., same number of topics, documents, and pre-processing) multiple times. Second, it finds pairwise matched topics of different models using cosine similarity. Third, it clusters topics by applying group agglomerative clustering [
33] using computed pairs’ similarity. Fourth, for each cluster, it (I) assigns cluster’s members, (II) updates centroids using hierarchical topic models (ignore topics with less than a threshold similarity to centroids), and (III) computes clusters’ stability of updated centroids. This clustering method repeats these four steps until no changes happen.
In this approach, Miller and McCoy [
72] used
Hierarchical Dirichlet process (HDP) with Gibbs sampler [
118] as flat topic model (first stage), and nested DHP [
88] for hierarchical clustering (stage 4.II). Three stability metrics were used to capture similarity of topics across multiple runs: Cosine similarity, ratio of aligned topics, and JSD. This hierarchical topic modeling was not compared to any other model, but it has been shown with deeper hierarchical alignment of topic models, stability decreases [
72]. Although this model tends to improve stability, the results are subject to change with different ordering, especially since centroids are shaped based on topics’ order of entry into the clusters.
These last two methods [
69,
72] used top terms’ similarity to merge topics and ignored document similarity. El-Assady et al. [
32] suggested using both terms and documents similarity to match and merge topics based on an observational study of manual matching. Authors defined a strong or complete match if a match maximizes both term and document similarity between topic pairs of two different runs.
LTMA (Layered Topic Matching Algorithm) model was introduced to do so. It first calculates topic matching and generates matching candidates. Then, it adds candidates if they are complete-matches as higher ranks, or document-only and term-only similarity matches as lower ranks. They used DMF as described in Section
6.1 to capture similarity of topics at the document level. A new weighted term similarity measure called
Ranked and Weighted Penalty Distance (rwpd) is introduced. The rwpd similarity measure puts more weight on similar higher ranked top terms and penalizes missing top terms with higher ranks as well. A comparison between LTMA and LDA showed LTMA is more stable within the DMF metric. However, authors did not compare stability of previously introduced IHTM (a hierarchical matching model) [
31] with LTMA where both were shown to be more stable than LDA.
Considering both term and document similarity is a plus in an ensemble model. In contrast, LTMA (similar to IHTM) is not able to produce topic-term and document-topic matrices. Besides, neither of these two models, IHTM and LTMA, can be used as an ensemble base model. Moreover, for quality assessment, authors compared LDA, IHTM, and LTMA according to the amount of correct document-theme assignment based on a defined ground truth. LTMA performed better in assigning similar documents of a theme to a topic. This shows LTMA was able to improve stability and quality of topics. However, using this measure and comparing document-topic assignment with a pre-defined document-theme might not be the best way to compare stability or quality of topics. Although topic models are supposed to extract latent themes, that does not necessarily mean extracted themes and defined themes will match thoroughly. Besides, generating different topics compared to ground truth themes does not indicate whether the model is or is not stable.
Section
6.1 through
6.3 reviews how changes in a model affects stability if those changes modify the whole model, alter the inference phase, or combine multiple solutions. Since most topic models are sensitive to initialization and pre-defined parameters to obtain topics, the next two sections analyze these small changes and their influence on topic stability.
6.4 Initialization
Changing parameters and settings of a topic model varies the generated topics. A topic model initializes some of these parameters randomly (e.g., topic-term matrix). Different random initialization will yield different topic solutions. Prior work has considered using different initialization approaches to increase stability, as described in this subsection.
As discussed in Section
6.2, STM alters LDA by adding two covariates and other hyperparameters to estimate topic-term and document-topic matrices (
\(\beta\) and
\(\theta\)). However, STM still ends up with different topics when starting with different random initializations. Roberts et al. [
98] showed that final solutions can become more stable with higher quality (measured negative log likelihood of the model) if LDA or spectral algorithm [
6] is used as initialization. Unlike LDA, the spectral algorithm extracts topics with provable guarantees to maximize the likelihood of the objective function. Thus, it is deterministic and stable.
To measure the stability, Roberts et al. computed lower bound of the marginal likelihood of multiple runs. The VEM approach with JSD inequality is used to derive a lower bound of marginal likelihood of the unobserved parameters (topics) given the observed data (documents). They used this measure to compare stability and quality of different initialization techniques. In this comparison, higher values of the lower bound of marginal likelihood [
116] yields higher quality and smaller changes of values across multiple runs indicates higher stability results.
Roberts et al. [
98] compared these two initialization methods – LDA and spectral – and found that using LDA helps STM converge faster. Thus, it was chosen as the default initialization method of STM’s R package [
99]. They also found that using more iterations of LDA initialization does not improve the stability and quality of the generated topics. While an iteration of the spectral algorithm is slower than an iteration of LDA, the spectral algorithm converges after only one iteration.
Roberts et al. also stated that using either the LDA or the spectral initialization method causes STM to perform with a lower bound of marginal likelihood (higher quality). It has been shown that STM with pre-initialization achieves higher stability and quality using the lower bound of marginal likelihood. However, quality (e.g., perplexity, or NPMI) and stability (e.g., average Jaccard) of the STM generated topics were not compared to topics of the other models. Furthermore, although STM can employ either an LDA or a spectral initialization for a more stable start, it does not necessarily mean the inference process – including VEM optimization, sampling, and generative process – provides more stable (with higher quality) solutions than LDA at the end. Especially because LDA itself uses random initialization, this may influence the stability of STM results as well.
STM is not the only topic model that employed another topic model to initialize parameters. As previously discussed in Section
6.3, Belford et al. [
11] showed that a single NMF model with NNDSVD matrix factorization technique is more stable than an ensemble of least square NMF models. Initialization of multiple models reduced the quality (according to NPMI and NMI) and stability (according to average Jaccard and average DSD) of the proposed ensemble model [
11]. Therefore, [
11] designed a new ensemble model by breaking the dataset into
f folds of documents. So, instead of having
r NMF models with random initialization (previously discussed in Section
6.3), they executed multiple NNDSVD topic models, termed
p, on each fold.
In the next step, topics are generated similarly to the process described in Section
6.3 for the approach [
11] introduced. The proposed f-fold ensemble model performs better than LDA, NMF, and ensemble NMF, but mostly similar to NNDSVD w.r.t. stability and quality comparisons introduced above. Interestingly, the models with the highest quality (NPMI and PMI) in experiments were not necessarily the most stable ones. When LDA or a single model NMF showed higher performance, they were among the least two stable ones. This may indicate that the most stable models are not always the models with highest quality and vice versa. Comparing and contrasting stability and quality is discussed more in detail in Section
7.
6.5 (Hyper)Parameter Adjustment
Beside random initialization, (hyper)parameters adjustment can also affect solutions of a topic model [
25]. Various hyperparameters optimization techniques have been used to achieve higher stability in topic modeling. Chuang et al. [
25] optimized
\(\alpha\) and
\(\beta\) hyperparameters using a parameter exploration technique called Grid search. They show that hyperparameters changes affect resolved and repeated topics and alter fused and fused-repeated topics
6 even more. Agrawal et al. [
1] analyzed topic modeling papers and stated that, as of 2018, only one third of the topic modeling papers that focused on stability used some level of parameter tuning (i.e., manual adjustments or parameter exploration) and less than 10% of those papers mentioned that parameter adjustments had a huge impact on final results.
To explore this gap in the literature, Agrawal et al. [
1] introduced a new
evolutionary algorithm (EA) technique to maximize topic modeling stability using hyperparameters adjustments. They employed the
Differential Evolution (DE) algorithm
7, which they chose because it has been shown that DE is competitive to
Genetic Algorithm (GA) and
Particle Swarm Optimization (PSO). LDADE starts with randomly initializing N (number of population) sets of parameters including number of topics (K) [10,100],
\(\alpha\) [0,1] and
\(\beta\) [0,1]. Then, the algorithm iteratively generates new sets of parameters and evaluates new offsprings by running an LDA on each set of parameters; authors compared both Gibbs-LDA and VEM-LDA on LDADE evaluation and found out Gibbs-LDA provides more stable topics.
Agrawal et al. [
1] then computed the number of similar (overlapped) terms in topic pairs for a different number of top-N terms. They computed the stability of a model using median of overlaps,
\(R_n\), across multiple runs. They called this approach raw score, then introduced another measure, Delta score, to compute the difference of
\(R_n\) before and after DE optimization. Delta score measures the stability improvement or changes. To reduce the effects of document order on stability, Agrawal et al. [
1] evaluated each model within 10 runs of LDA using shuffled data.
While it has been shown LDADE improved both stability (median of term overlaps) and quality (F1 score for text classification using generated topics) of LDA topic modeling, it takes almost 5 times longer than a single LDA to run. On the other hand, changing parameters of DE algorithm (e.g., number of iterations, cross-over rate, or differential weights) affects stability and the generated topics. Therefore, LDADE itself needs exploration and adjustment, too. Despite classification improvement, Agrawal et al. [
1] did not compare LDA and LDADE using other quality measures (e.g., NPMI or Perplexity). Moreover, choosing the the best model using LDADE is done by assessing classification via F1 score. This binds the resultant model to be more suitable for the assessed tasks, classification on a specific corpus in this case, rather than the the quality of topics themselves including human perception of quality or a more general topic quality assessment (e.g., Perplexity or NPMI).