DoSO: a document self-organizer

Gerasimos Spanakis¹^nAff2,
Georgios Siolas¹ &
Andreas Stafylopatis¹

307 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we propose a Document Self Organizer (DoSO), an extension of the classic Self Organizing Map (SOM) model, in order to deal more efficiently with a document clustering task. Starting from a document representation model, based on important “concepts” exploiting Wikipedia knowledge, that we have previously developed in order to overcome some of the shortcomings of the Bag-of-Words (BOW) model, we demonstrate how SOM’s performance can be boosted by using the most important concepts of the document collection to explicitly initialize the neurons. We also show how a hierarchical approach can be utilized in the SOM model and how this can lead to a more comprehensive final clustering result with hierarchical descriptive labels attached to neurons and clusters. Experiments show that the proposed model (DoSO) yields promising results both in terms of extrinsic and SOM evaluation measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

See http://kdd.ics.uci.edu/.
See http://icame.uib.no/.

References

Alias-i (2008). LingPipe 4.1.0 (online). http://alias-i.com/lingpipe. Accessed 23 Jan 2012
Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.
Article Google Scholar
Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using Wikipedia. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 787–788). New York, NY, U.S.A.: ACM.
Google Scholar
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., et al. (2009). DBpedia—A crystallization point for the Web of data. Journal Web Semantics, 7(3), 154–165.
Article Google Scholar
Bloehdorn, S., Cimiano, P., & Hotho, A. (2006). Learning ontologies to improve text clustering and classification. In M. Spiliopoulou, R. Kruse, A. Nürnberger, C. Borgelt, & W. Gaul (Eds.), From data and information analysis to knowledge engineering: Proceedings of the 29th annual conference of the German classification society (GfKl 2005), 9–11 Mar 2005, Magdeburg, Germany. Studies in classification, data analysis, and knowledge organization (Vol. 30, pp. 334–341). Berlin-Heidelberg, Germany: Springer.
Google Scholar
Breaux, T. D., & Reed, J. W. (2005). Using ontology in hierarchical information clustering. In HICSS ’05: Proceedings of the proceedings of the 38th annual Hawaii international conference on system sciences (HICSS’05)—track 4 (p. 111.2). Washington, DC, U.S.A.: IEEE Computer Society.
Google Scholar
Bunescu, R. C., & Pasca, M. (2007). Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics.
A. Carnegie Group Inc., & B. Reuters Ltd. (1997). Reuters-21578 text categorization test collection.
Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759–771.
Article Google Scholar
Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7(1), 88–102.
Article Google Scholar
Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proc. 2007 joint conference on EMNLP and CNLL (pp. 708–716).
Davison, M. L. (1983). Multidimensional scaling. New York: Wiley.
MATH Google Scholar
Demartines, P., & Herault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1), 148–154.
Article Google Scholar
Francis, W. N., & Kucera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American english, for use with digital computers. Providence, Rhode Island.
Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proc. of the 3rd SIAM international conference on data mining (SDM) (pp. 59–70). San Francisco, CA: SIAM.
Google Scholar
Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI’06: Proceedings of the 21st national conference on artificial intelligence (pp. 1301–1306). Menlo Park, CA: AAAI Press.
Google Scholar
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence (pp. 1606–1611). San Francisco, CA, U.S.A.: Morgan Kaufmann Publishers Inc.
Google Scholar
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Hammouda, K. M., & Kamel, M. S. (2004). Efficient phrase-based document indexing for Web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16, 1279–1296.
Article Google Scholar
He, J., Tan, A.-h., & Tan, C.-l. (2002). ART-C: A neural architecture for self-organization under constraints. In In proceedings of international joint conference on neural networks (IJCNN) (pp. 2550–2555).
Himberg, J. (2000). A SOM based cluster visualization and its application for false coloring. In IJCNN ’00: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 3, p. 3587). Washington, DC, U.S.A.: IEEE Computer Society.
Google Scholar
Hofmann, T. (1999). The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In In IJCAI (pp. 682–687).
Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Y. Ding, K. van Rijsbergen, I. Ounis, & J. Jose (Eds.), Proceedings of the semantic Web workshop of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (SIGIR 2003), 1 Aug 2003, Toronto Canada.
Hotho, A., & Stumme, G. (2002). Conceptual clustering of text clusters. In Proceedings of FGML workshop (pp. 37–45). Special Interest Group of German Informatics Society (FGML).
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging Wikipedia semantics. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 179–186). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 389–396). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Huang, A., Milne, D., Frank, E., & Witten, I. H. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of the 13th Pacific–Asia Conference on advances in knowledge discovery and data mining. PAKDD ’09 (pp. 628–636). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Jin, H., Wong, M.-L., & Leung, K. S. (2005). Scalable model-based clustering for large databases based on data summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11), 1710–1719.
Article Google Scholar
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Article Google Scholar
Junker, M., Sintek, M., & Rinck, M. (2000). Learning for text categorization and information extraction with ILP. Learning Language in Logic, 247–258.
Kangas, J., Kohonen, T., & Laaksonen, J. (1990). Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1(1), 93–99.
Article Google Scholar
Karypis, G. (2002). CLUTO—A clustering toolkit (Vol. 02–017). Technical Report.
Kiran, G. V. R., & Shankar, R. (2010). Enhancing document clustering using various external knowledge sources. In Proceedings of the 15th Australasian document computing symposium.
Kohonen, T. (1989). Self-organization and associative memory (3rd Edn.). New York, NY, U.S.A.: Springer New York, Inc.
Book Google Scholar
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., et al. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.
Article Google Scholar
Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.) (2001). Self-organizing maps. Secaucus, NJ, U.S.A.: Springer New York, Inc.
MATH Google Scholar
Kraaijveld, M. (1992). A non-linear projection method based on Kohonen’s topology preserving maps. In 11th IAPR international conference on pattern recognition, 1992. Conference B: Pattern recognition methodology and systems, proceedings (Vol. II, pp. 41 –45).
Lagus, K., Kaski, S., & Kohonen, T. (2004). Mining massive document collections by the WEBSOM method. Informing Science, 163(1–3), 135–156.
Article Google Scholar
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the international conference on machine learning. Tahoe City, California, U.S.A.: Morgan Kaufmann.
Google Scholar
Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 797–798). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In SIGIR ’91: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 262–269). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191–198). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Mendes, P., Jakob, M., Garca-Silva, A., & Bizer, C. (2011). Dbpedia spotlight: Shedding light on the Web of documents. In In the proceedings of the 7th international conference on semantic systems (I-semantics).
Merkl, D. (1998). Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 61–77.
Article Google Scholar
Merkl, D., & Rauber, A. (1997). Alternative ways for cluster visualization in self-organizing maps. In In Proc. of the workshop on self-organizing maps (WSOM97) (pp. 106–111).
Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 233–242). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Miikkulainen, R. (1990). Script recognition with hierarchical feature maps. Connection Science, 2, 83–101.
Article Google Scholar
Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management. CIKM ’08 (pp 509–518). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312.
Article Google Scholar
Moutarde, F., & Ultsch, A. (2005). U*F clustering: A new performant “cluster-mining” method based on segmentation of self-organizing maps. In Workshop on self-organizing maps (WSOM’2005).
Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL ’10 (pp. 216–225). Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.
Google Scholar
Pampalk, E., Rauber, A., & Merkl, D. (2002). Using smoothed data histograms for cluster visualization in self-organizing maps. In ICANN ’02: Proceedings of the international conference on artificial neural networks (pp. 871–876). London, U.K.: Springer.
Google Scholar
Pölzlbauer, G. (2004). Survey and comparison of quality measures for self-organizing maps. In J. Paralič, G. Pölzlbauer, & A. Rauber (Eds.), Proceedings of the fifth workshop on data analysis (WDA’04), Sliezsky dom, Vysoké Tatry, 24–27 June 2004 (pp. 67–82). Slovakia: Elfa Academic Press.
Google Scholar
Pullwitt, D. (2002). Integrating contextual information to enhance som-based text document clustering. Neural Networks, 15(8–9), 1099–1106.
Article Google Scholar
Ratinov, L., Roth, D., Downey, D., & Anderson, M. (2011). Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1375–1384). HLT ’11. Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.
Google Scholar
Rauber, A. (1999). LabelSOM: On the labeling of self-organizing maps. In International joint conference on neural networks, 1999. IJCNN ’99 (Vol. 5, pp. 3527–3532).
Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: Exploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13, 1331–1341.
Article Google Scholar
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Article Google Scholar
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York, U.S.A.: McGraw-Hill.
MATH Google Scholar
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Article MATH Google Scholar
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409.
Article Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, Manchester, UK.
Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In ROMAND ’04: Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Morristown, NJ, U.S.A.: Association for Computational Linguistics.
Chapter Google Scholar
Shehata, S., Karray, F., & Kamel, M. S. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22, 1360–1371.
Article Google Scholar
Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 129–136). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.
Article MATH Google Scholar
Spanakis, G., Siolas, G., & Stafylopatis, A. (2011). Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. The Computer Journal, Section C: Computational Intelligence. doi:10.1093/comjnl/bxr024.
Google Scholar
Stanford (2009). Named entity recognizer (online). http://www-nlp.stanford.edu/software/CRF-NER.shtml. Accessed 23 Jan 2012
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In M. Grobelnik, D. Mladenic, & N. Milic-Frayling (Eds.), KDD-2000 workshop on text mining, Boston, MA (pp. 109–111).
Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing iceberg concept lattices with TITANIC. Data & Knowledge Engineering, 42(2), 189–222.
Article MATH Google Scholar
Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and Wordnet. Journal Web Semantics, 6, 203–217.
Article Google Scholar
Talavera, L., & Bejar, J. (2001). Generality-based conceptual clustering with probabilistic concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 196–206.
Article Google Scholar
Tenenbaum, J. B., Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Article Google Scholar
Toral, A., & Munoz, R. (2006). A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In EACL. The Association for Computer Linguistics.
Ultsch, A., & Siemon, H. P. (1990). Kohonen’s self organizing feature maps for exploratory data analysis. In Proceedings of international neural networks conference (INNC) (pp. 305–308). Kluwer Academic Press.
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.
Article Google Scholar
Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18, 153–172.
Article Google Scholar
Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using Wikipedia. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 713–721). New York, NY, U.S.A.: ACM.
Chapter Google Scholar
Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2009). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.
Article Google Scholar
Wang, B. B., Mckay, R. I. B., Abbass, H. A., & Barlow, M. (2003). A comparative study for domain ontology guided feature extraction. In ACSC ’03: Proceedings of the 26th Australasian computer science conference (pp. 69–78). Darlinghurst, Australia, Australia: Australian Computer Society, Inc.
Google Scholar
Wikipedia (2011). Wikipedia API (online). http://en.Wikipedia.org/w/api.php. Accessed 18 Oct 2011
Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5), 577–597.
Article Google Scholar
Xiong, H., Steinbach, M., Tan, P., & Kumar, V. (2004). HICAP: Hierarchical clustering with pattern preservation. In Proceedings of SIAM international conference on data mining (pp. 279–290). Philadelphia, PA: SIAM.
Google Scholar
Xue, X.-B., & Zhou, Z.-H. (2009). Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3), 428–442.
Article MathSciNet Google Scholar
Yin, H. (2002). ViSOM—A novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1), 237–243.
Article Google Scholar

Download references

Author information

Gerasimos Spanakis
Present address: Intelligent Systems Lab, Athens, Greece

Authors and Affiliations

National Technical University of Athens, Athens, Greece
Gerasimos Spanakis, Georgios Siolas & Andreas Stafylopatis

Authors

Gerasimos Spanakis
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Siolas
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Stafylopatis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerasimos Spanakis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spanakis, G., Siolas, G. & Stafylopatis, A. DoSO: a document self-organizer. J Intell Inf Syst 39, 577–610 (2012). https://doi.org/10.1007/s10844-012-0204-9

Download citation

Received: 02 February 2012
Revised: 25 April 2012
Accepted: 26 April 2012
Published: 12 May 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10844-012-0204-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Topics in Documents by Clustering Word Vectors

MIGSOM: A SOM Algorithm for Large Scale Hyperlinked Documents Inspired by Neuronal Migration

Self-Organizing Map for Multi-view Text Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

DoSO: a document self-organizer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Topics in Documents by Clustering Word Vectors

MIGSOM: A SOM Algorithm for Large Scale Hyperlinked Documents Inspired by Neuronal Migration

Self-Organizing Map for Multi-view Text Clustering

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation