Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2838706.2838708acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfireConference Proceedingsconference-collections
research-article

Context-driven Dimensionality Reduction for Clustering Text Documents

Published: 04 December 2015 Publication History

Abstract

We investigate clustering documents based on automatically annotated potentially sensitive information extracted from a large collection of organizational data. The process of clustering in this particular use case is helpful to visualize and navigate through groups of documents with related content. However, the effectiveness and efficiency of document clustering is limited mainly due to the large dimensionality of the document vectors. To alleviate this problem we propose a dimensionality reduction approach which involves selecting terms with high tf-idf scores from the context of the automatically annotated sensitive regions of a document. Due to the unavailability of real organizational data for research purposes, we evaluate our approach on the standard 20 news-groups dataset. For evaluation purposes, the only sensitive information that we use from the documents of this dataset are the named entities, e.g. the names of persons and organizations. Experimental results show that our approach is able to achieve an almost perfect clustering with a purity value of 0.998 improving by 22.60% with respect to the purity value of 0.814 obtained without document dimensionality reduction.

References

[1]
R. A. Baeza-Yates, C. A. Hurtado, M. Mendoza, and G. Dupret. Modeling user search behavior. In Third Latin American Web Congress (LA-Web 2005), 1 October - 2 November 2005, Buenos Aires, Argentina, pages 242--251, 2005.
[2]
M. N. K. Boulos. The use of interactive graphical maps for browsing medical/health internet information resources. Int J Health Geogrphics, 2(1):1, 2003.
[3]
A. Cardoso-Cachopo. Improving Methods for Single-label Text Categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.
[4]
I. Herman, G. Melançon, and M. S. Marshall. Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics, 6(1):24--43, Jan. 2000.
[5]
D. Holten. Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data. IEEE Transactions on Visualization and Computer Graphics, 12(5):741--748, Sept. 2006.
[6]
K. Hornbæk, B. B. Bederson, and C. Plaisant. Navigation patterns and usability of zoomable user interfaces with and without an overview. ACM Transactions on Computer-Human Interaction (TOCHI), 9(4):362--389, 2002.
[7]
S. Jun, S.-S. Park, and D.-S. Jang. Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications, 41(7):3204--3212, 2014.
[8]
D. Kimelman, B. Leban, T. Roth, and D. Zernik. Reduction of visual complexity in dynamic graphs. In Proceedings of DIMACS International Workshop, Graph Drawing (GD '94), pages 218--225, 1994.
[9]
H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data, 3(1):1:1--1:58, Mar. 2009.
[10]
A. Mikheev, M. Moens, and C. Grover. Named entity recognition without gazetteers. In Proceedings of EACL '99, pages 1--8, 1999.
[11]
B. Mirkin. Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
[12]
S. Mukherjea, J. D. Foley, and S. E. Hudson. Interactive clustering for navigating in hypermedia systems. In Proceedings of European Conference on Hypertext Technology (ECHT '94), pages 136--145, 1994.
[13]
L. Ovenden. Local Authority Protective Marking methodology, 2011.
[14]
T. Roxborough and A. Sen. Graph clustering using multiway ratio cut. In Proceedings of Graph Drawing (GD '97), pages 291--296, 1997.
[15]
S. M. Ruger and S. E. Guach. Feature reduction for document clustering and classification. Technical report, London, UK, 2000.
[16]
R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38:1409--1438, 1958.
[17]
B. Tang, M. Shepherd, E. Milios, and M. I. Heywood. Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering. In Proc. Canadian Conference on AI, pages 292--296, 2005.
[18]
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report 01-40, University of Minnesota, 2001.

Cited By

View all
  • (2019)A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document ClassificationFoundations of Science10.1007/s10699-019-09592-w25:4(1077-1094)Online publication date: 9-Mar-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
FIRE '15: Proceedings of the 7th Annual Meeting of the Forum for Information Retrieval Evaluation
December 2015
57 pages
ISBN:9781450340045
DOI:10.1145/2838706
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Dimensionality reduction
  2. Document Clustering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

FIRE '15
FIRE '15: Forum for Information Retrieval Evaluation
December 4 - 6, 2015
Gandhinagar, India

Acceptance Rates

FIRE '15 Paper Acceptance Rate 12 of 42 submissions, 29%;
Overall Acceptance Rate 19 of 64 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document ClassificationFoundations of Science10.1007/s10699-019-09592-w25:4(1077-1094)Online publication date: 9-Mar-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media