Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2034691.2034733acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Building a topic hierarchy using the bag-of-related-words representation

Published: 19 September 2011 Publication History

Abstract

A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.

References

[1]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB'94: International Conference on Very Large Data Bases, pages 487--499. Morgan Kaufmann Publishers Inc., 1994.
[2]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.
[3]
J. Blanchard, F. Guillet, R. Gras, and H. Briand. Using information-theoretic measures to assess association rule interestingness. In ICDM'05: Internation Conference on Data Mining, pages 66--73, 2005.
[4]
M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text databases & document management: theory & practice, pages 78--102, 2001.
[5]
A. L. C. Carvalho, E. S. Moura, and P. Calado. Using statistical features to find phrasal terms in text collections. Journal of Information and Data Management, 1(3):583--597, 2010.
[6]
A. Doucet and H. Ahonen-Myka. Non-contiguous word sequences for information retrieval. In MWE'04: Workshop on Multiword Expressions: Integrating Processing, MWE'04, pages 88--95. Association for Computational Linguistics, 2004.
[7]
L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Computing Surveys, 38(3):9, 2006.
[8]
F. Guillet and H. J. Hamilton, editors. Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.
[9]
V. Kashyap, C. Ramakrishnan, C. Thomas, and A. P. Sheth. Taxaminer: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, 1(2):240--266, 2005.
[10]
Y. Lie, H. T. Loh, and W. G. Lu. Deriving taxonomy from documents at sentence level. In A. H. do Prado and E. Ferneda, editors, Emerging Technologies of Text Mining: Techniques and Applications, chapter 5, pages 99--119. Information Science Reference, 1 edition, 2007.
[11]
P. D. McNicholas, T. B. Murphy, and M. O'Regan. Standardising the lift of an association rule. Computational Statistics & Data Analysis, 52(10):4712--4721, 2008.
[12]
D. Mladenic and M. Grobelnik. Word sequences as features in text-learning. In ERK'98: Electrotechnical and Computer Science Conference, pages 145--148, 1998.
[13]
M. F. Moura and S. O. Rezende. A simple method for labeling hierarchical document clusters. In IASTED'10: International Conference on Artificial Intelligence and Applications (IAI 2010), pages 363--371, 2010.
[14]
G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., 1989.
[15]
F. F. Santos, V. O. de Carvalho, and S. O. Rezende. Selecting candidate labels for hierarchical document clusters using association rules. In Springer-Verlag, editor, MICAI'10: Mexican International Conference on Artificial Intelligence, 2010.
[16]
M. V. B. Soares, R. C. Prati, and M. C. Monard. PreTexT II: Descrição da reestruturação da ferramenta de pré-processamento de textos. Technical Report 333, ICMC-USP, 2008.
[17]
C.-M. Tan, Y.-F. Wang, and C.-D. Lee. The use of bigrams to enhance text categorization. Information Processing and Management, 38(4):529--546, 2002.
[18]
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In ACM SIGKDD'2002: International Conferenceon Knowledge Discovery and Data Mining, pages 32--41. ACM, 2002.
[19]
R. Tesar, V. Strnad, K. Jezek, and M. Poesio. Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In DocEng'06: ACM Symposium on Document Engineering, pages 138--146, 2006.
[20]
J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In SIGKDD'09: Proceeding of the International Conference on Knowledge Discovery and Data Mining, pages 877--886. ACM, 2009.
[21]
Z. Yang, L. Zhang, J. Yan, and Z. Li. Using association features to enhance the performance of naíve bayes text classifier. In ICCIMA '03: International Conference on Computational Intelligence and Multimedia Applications, page 336. IEEE Computer Society, 2003.
[22]
X. Zhang and X. Zhu. A new type of feature - loose n-gram feature in text categorization. In IbPRIA'07: Iberian Conference on Pattern Recognition and Image Analysis, pages 378--385. Springer, 2007.
[23]
Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM '02: International Conference on Information and Knowledge Management, pages 515--524. ACM Press, 2002.

Cited By

View all
  • (2019)A Knowledge-Based Semisupervised Hierarchical Online Topic Detection FrameworkIEEE Transactions on Cybernetics10.1109/TCYB.2018.284150449:9(3307-3321)Online publication date: Sep-2019
  • (2018)Compact Representation of Documents Using Terms and TermsetsMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96136-1_7(77-84)Online publication date: 8-Jul-2018
  • (2017)Centrality-Based Group Profiling: A Comparative Study in Co-authorship NetworksNew Generation Computing10.1007/s00354-017-0028-936:1(59-89)Online publication date: 21-Nov-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering
September 2011
296 pages
ISBN:9781450308632
DOI:10.1145/2034691
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document representation
  2. text mining
  3. topic hierarchy

Qualifiers

  • Research-article

Conference

DocEng '11
Sponsor:
DocEng '11: ACM Symposium on Document Engineering
September 19 - 22, 2011
California, Mountain View, USA

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A Knowledge-Based Semisupervised Hierarchical Online Topic Detection FrameworkIEEE Transactions on Cybernetics10.1109/TCYB.2018.284150449:9(3307-3321)Online publication date: Sep-2019
  • (2018)Compact Representation of Documents Using Terms and TermsetsMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96136-1_7(77-84)Online publication date: 8-Jul-2018
  • (2017)Centrality-Based Group Profiling: A Comparative Study in Co-authorship NetworksNew Generation Computing10.1007/s00354-017-0028-936:1(59-89)Online publication date: 21-Nov-2017
  • (2016)A systematic review of multi-label feature selection and a new method based on label constructionNeurocomputing10.1016/j.neucom.2015.07.118180:C(3-15)Online publication date: 5-Mar-2016
  • (2015)Lazy Multi-label Learning Algorithms Based on Mutuality StrategiesJournal of Intelligent and Robotic Systems10.1007/s10846-014-0144-480:1(261-276)Online publication date: 1-Oct-2015
  • (2014)Label Construction for Multi-label Feature SelectionProceedings of the 2014 Brazilian Conference on Intelligent Systems10.1109/BRACIS.2014.52(247-252)Online publication date: 18-Oct-2014
  • (2012)Measuring media-based social interactions in online civicmobilization against corruption in BrazilProceedings of the 18th Brazilian symposium on Multimedia and the web10.1145/2382636.2382675(173-180)Online publication date: 15-Oct-2012

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media