Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1255175.1255249acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Organizing the OCA: learning faceted subjects from a library of digital books

Published: 18 June 2007 Publication History

Abstract

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.

References

[1]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003.
[2]
W. Buntine, S. Perttu, and H. Tirri. Building and maintaining web taxonomies. In XML Finland 2002, 2002.
[3]
California Digital Library. The Melvyl Recommender project full text extension supplementary report. http://www.cdlib.org/inside/projects/melvyl_recommender/report_docs/mellon_extension.pdf.
[4]
G. Celeux, D. Chauveau, and J. Diebolt. On stochastic versions of the EM algorithm. Technical Report RR-2514, INRIA.
[5]
C. Elkan. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In ICML 2006, 2006.
[6]
E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol., 55(3):214--227, 2004.
[7]
C. J. Godby and J. Stuler. The Library of Congress Classification as a knowledge base for automatic classification. In IFLA Preconference, 2001.
[8]
J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In NIPS 2004, 2004.
[9]
Google Books. http://books.google.com.
[10]
M. Hearst. Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4):59--61, 2006.
[11]
Internet Archive. http://www.archive.org/texts.
[12]
A. Krowne and M. Halbert. An initial evaluation of automated organization for digital library browsing. In JCDL 2005, 2005.
[13]
R. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In ICML 2005, 2005.
[14]
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
[15]
T. Minka. Estimating a Dirichlet distribution, 2000.
[16]
D. Newman. American west metadata enhancement feasibility study, 2005. http://www.cdlib.org/inside/projects/amwest/cluster.pdf.
[17]
Open Content Alliance. http://www.opencontentalliance.org/.
[18]
A. Rauber and D. Merkl. Text mining in the SOMLib digital library system: the representation of topics and genres. Applied Intelligence, 18:271--293, 2003.
[19]
Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 2004, 2004.
[20]
S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical Dirichlet model for document classification. In ICML 2005, 2005.
[21]
X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR 2006, 2006.
[22]
Battle of Chancellorsville, Battle of Gettysburg. Wikipedia, accessed 2007. http://en.wikipedia.org/.
[23]
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD 2004, pages 743--748, 2004.

Cited By

View all
  • (2022)Analysis of Unboxing Experience by Applying Topic Modeling to YouTube Review Data: The Smartphone CaseJournal of the Korean Institute of Industrial Engineers10.7232/JKIIE.2022.48.6.54648:6(546-556)Online publication date: 15-Dec-2022
  • (2021)Deep Learning for Medical Anomaly Detection – A SurveyACM Computing Surveys10.1145/346442354:7(1-37)Online publication date: 18-Jul-2021
  • (2021)Beyond Multimedia AuthoringACM Computing Surveys10.1145/346442254:7(1-31)Online publication date: 18-Jul-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. topic models

Qualifiers

  • Article

Conference

JCDL07
JCDL07: Joint Conference on Digital Libraries
June 18 - 23, 2007
BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Analysis of Unboxing Experience by Applying Topic Modeling to YouTube Review Data: The Smartphone CaseJournal of the Korean Institute of Industrial Engineers10.7232/JKIIE.2022.48.6.54648:6(546-556)Online publication date: 15-Dec-2022
  • (2021)Deep Learning for Medical Anomaly Detection – A SurveyACM Computing Surveys10.1145/346442354:7(1-37)Online publication date: 18-Jul-2021
  • (2021)Beyond Multimedia AuthoringACM Computing Surveys10.1145/346442254:7(1-31)Online publication date: 18-Jul-2021
  • (2021)A Survey of Smart Contract Formal Specification and VerificationACM Computing Surveys10.1145/346442154:7(1-38)Online publication date: 18-Jul-2021
  • (2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
  • (2021)Semantic Information Retrieval on Medical TextsACM Computing Surveys10.1145/346247654:7(1-38)Online publication date: 17-Sep-2021
  • (2021)Temporal Relation Extraction in Clinical TextsACM Computing Surveys10.1145/346247554:7(1-36)Online publication date: 17-Sep-2021
  • (2021)More than PrivacyACM Computing Surveys10.1145/346077154:7(1-37)Online publication date: 18-Jul-2021
  • (2019)Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature, 2005–2017Social Currents10.1177/23294965198465056:4(307-318)Online publication date: 2-May-2019
  • (2016)N-gram over ContextProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2882981(1045-1055)Online publication date: 11-Apr-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media