Stochastic-CVB0

Code implementing Stochastic CVB0 (SCVB0), from: J. R. Foulds, L. Boyles, C. DuBois, P. Smyth and M. Welling. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2013.

Prerequisites

Julia programming language (tested on v0.6.2)
The Distributions Julia package, which can be added using the package manager

The code should work on any platform.

Data format

Routines are provided to read in corpora in two formats: a simple format where each token is individually encoded, and a sparse format. Both formats involve a text file which has one line for each document in the corpus. One-based indexing is always used, since Julia indexes arrays this way. In the simple, non-sparse format (readData.jl), each word in the document is represented by a word index, and these are space separated. The words and documents should ideally be shuffled, to improve the performance of the stochastic algorithm.

In the sparse format (readSparseData.jl), there is an entry for each distinct word type appearing in a document. These are encoded by pairs of token index and token count. All values are space separated, and a newline once again signifies the end of a document. For reporting the top words in a corpus, a dictionary file is also expected, containing one line per word type in plain text.

The NIPS corpus, due to Sam Roweis, is encoded in both formats in the data folder (NIPS.txt and NIPSsparse.txt), along with a dictionary file (NIPSdict.txt).

Running the code

Try out runNIPS.jl to run the algorithm on the NIPS corpus. A flag switches between the two data formats. This file can be modified to load your particular dataset.

Author

James Foulds

License

Licensed under the Apache License, Version 2.0 (the "License"). You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 .

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
definitions.jl		definitions.jl
getImportantWordsInTopic.jl		getImportantWordsInTopic.jl
readData.jl		readData.jl
readSparseData.jl		readSparseData.jl
runNIPS.jl		runNIPS.jl
saveSparseData.jl		saveSparseData.jl
stochasticCVB0.jl		stochasticCVB0.jl
update_functions.jl		update_functions.jl
utilityFunctionsLDA.jl		utilityFunctionsLDA.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stochastic-CVB0

Prerequisites

Data format

Running the code

Author

License

About

Releases

Packages

Languages

License

jrfoulds/Stochastic-CVB0

Folders and files

Latest commit

History

Repository files navigation

Stochastic-CVB0

Prerequisites

Data format

Running the code

Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages