Article

Organizing the OCA: learning faceted subjects from a library of digital books

Authors:

Andrew McCallumAuthors Info & Claims

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 376 - 385

https://doi.org/10.1145/1255175.1255249

Published: 18 June 2007 Publication History

Abstract

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.

References

[1]

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003.

Digital Library

[2]

W. Buntine, S. Perttu, and H. Tirri. Building and maintaining web taxonomies. In XML Finland 2002, 2002.

[3]

California Digital Library. The Melvyl Recommender project full text extension supplementary report. http://www.cdlib.org/inside/projects/melvyl_recommender/report_docs/mellon_extension.pdf.

[4]

G. Celeux, D. Chauveau, and J. Diebolt. On stochastic versions of the EM algorithm. Technical Report RR-2514, INRIA.

[5]

C. Elkan. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In ICML 2006, 2006.

Digital Library

[6]

E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol., 55(3):214--227, 2004.

Digital Library

[7]

C. J. Godby and J. Stuler. The Library of Congress Classification as a knowledge base for automatic classification. In IFLA Preconference, 2001.

[8]

J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In NIPS 2004, 2004.

[9]

Google Books. http://books.google.com.

[10]

M. Hearst. Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4):59--61, 2006.

Digital Library

[11]

Internet Archive. http://www.archive.org/texts.

[12]

A. Krowne and M. Halbert. An initial evaluation of automated organization for digital library browsing. In JCDL 2005, 2005.

Digital Library

[13]

R. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In ICML 2005, 2005.

Digital Library

[14]

A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.

[15]

T. Minka. Estimating a Dirichlet distribution, 2000.

[16]

D. Newman. American west metadata enhancement feasibility study, 2005. http://www.cdlib.org/inside/projects/amwest/cluster.pdf.

[17]

Open Content Alliance. http://www.opencontentalliance.org/.

[18]

A. Rauber and D. Merkl. Text mining in the SOMLib digital library system: the representation of topics and genres. Applied Intelligence, 18:271--293, 2003.

Digital Library

[19]

Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 2004, 2004.

[20]

S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical Dirichlet model for document classification. In ICML 2005, 2005.

Digital Library

[21]

X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR 2006, 2006.

Digital Library

[22]

Battle of Chancellorsville, Battle of Gettysburg. Wikipedia, accessed 2007. http://en.wikipedia.org/.

[23]

C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD 2004, pages 743--748, 2004.

Digital Library

Cited By

Yang JLee S(2022)Analysis of Unboxing Experience by Applying Topic Modeling to YouTube Review Data: The Smartphone CaseJournal of the Korean Institute of Industrial Engineers10.7232/JKIIE.2022.48.6.54648:6(546-556)Online publication date: 15-Dec-2022
https://doi.org/10.7232/JKIIE.2022.48.6.546
Fernando TGammulle HDenman SSridharan SFookes C(2021)Deep Learning for Medical Anomaly Detection – A SurveyACM Computing Surveys10.1145/346442354:7(1-37)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464423
Mattos DMuchaluat-Saade DGhinea G(2021)Beyond Multimedia AuthoringACM Computing Surveys10.1145/346442254:7(1-31)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464422
Show More Cited By

Index Terms

Organizing the OCA: learning faceted subjects from a library of digital books
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Expression microarray classification using topic models
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

Classification of samples in expression microarray experiments represents a crucial task in bioinformatics and biomedicine. In this paper this scenario is addressed by employing a particular class of statistical approaches, called Topic Models. These ...
Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

We propose Topic Anchoring-based Review Summarization (TARS), a two-step extractive summarization method, which creates review summaries from the sentences that represent the most important aspects of a review. In the first step, the proposed method ...
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

June 2007

534 pages

ISBN:9781595936448

DOI:10.1145/1255175

General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

JCDL07

Sponsor:

JCDL07: Joint Conference on Digital Libraries

June 18 - 23, 2007

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
559
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang JLee S(2022)Analysis of Unboxing Experience by Applying Topic Modeling to YouTube Review Data: The Smartphone CaseJournal of the Korean Institute of Industrial Engineers10.7232/JKIIE.2022.48.6.54648:6(546-556)Online publication date: 15-Dec-2022
https://doi.org/10.7232/JKIIE.2022.48.6.546
Fernando TGammulle HDenman SSridharan SFookes C(2021)Deep Learning for Medical Anomaly Detection – A SurveyACM Computing Surveys10.1145/346442354:7(1-37)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464423
Mattos DMuchaluat-Saade DGhinea G(2021)Beyond Multimedia AuthoringACM Computing Surveys10.1145/346442254:7(1-31)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464422
Tolmach PLi YLin SLiu YLi Z(2021)A Survey of Smart Contract Formal Specification and VerificationACM Computing Surveys10.1145/346442154:7(1-38)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3464421
Chauhan UShah A(2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462478
Tamine LGoeuriot L(2021)Semantic Information Retrieval on Medical TextsACM Computing Surveys10.1145/346247654:7(1-38)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462476
Gumiel YSilva e Oliveira LClaveau VGrabar NParaiso EMoro CCarvalho D(2021)Temporal Relation Extraction in Clinical TextsACM Computing Surveys10.1145/346247554:7(1-36)Online publication date: 17-Sep-2021
https://dl.acm.org/doi/10.1145/3462475
Zhang LZhu TXiong PZhou WYu P(2021)More than PrivacyACM Computing Surveys10.1145/346077154:7(1-37)Online publication date: 18-Jul-2021
https://dl.acm.org/doi/10.1145/3460771
Lindstedt N(2019)Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature, 2005–2017Social Currents10.1177/23294965198465056:4(307-318)Online publication date: 2-May-2019
https://doi.org/10.1177/2329496519846505
Kawamae NBourdeau JHendler JNkambou RHorrocks IZhao B(2016)N-gram over ContextProceedings of the 25th International Conference on World Wide Web10.1145/2872427.2882981(1045-1055)Online publication date: 11-Apr-2016
https://dl.acm.org/doi/10.1145/2872427.2882981
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten