Abstract
In this paper, we propose a Document Self Organizer (DoSO), an extension of the classic Self Organizing Map (SOM) model, in order to deal more efficiently with a document clustering task. Starting from a document representation model, based on important “concepts” exploiting Wikipedia knowledge, that we have previously developed in order to overcome some of the shortcomings of the Bag-of-Words (BOW) model, we demonstrate how SOM’s performance can be boosted by using the most important concepts of the document collection to explicitly initialize the neurons. We also show how a hierarchical approach can be utilized in the SOM model and how this can lead to a more comprehensive final clustering result with hierarchical descriptive labels attached to neurons and clusters. Experiments show that the proposed model (DoSO) yields promising results both in terms of extrinsic and SOM evaluation measures.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Alias-i (2008). LingPipe 4.1.0 (online). http://alias-i.com/lingpipe. Accessed 23 Jan 2012
Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.
Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using Wikipedia. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 787–788). New York, NY, U.S.A.: ACM.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., et al. (2009). DBpedia—A crystallization point for the Web of data. Journal Web Semantics, 7(3), 154–165.
Bloehdorn, S., Cimiano, P., & Hotho, A. (2006). Learning ontologies to improve text clustering and classification. In M. Spiliopoulou, R. Kruse, A. Nürnberger, C. Borgelt, & W. Gaul (Eds.), From data and information analysis to knowledge engineering: Proceedings of the 29th annual conference of the German classification society (GfKl 2005), 9–11 Mar 2005, Magdeburg, Germany. Studies in classification, data analysis, and knowledge organization (Vol. 30, pp. 334–341). Berlin-Heidelberg, Germany: Springer.
Breaux, T. D., & Reed, J. W. (2005). Using ontology in hierarchical information clustering. In HICSS ’05: Proceedings of the proceedings of the 38th annual Hawaii international conference on system sciences (HICSS’05)—track 4 (p. 111.2). Washington, DC, U.S.A.: IEEE Computer Society.
Bunescu, R. C., & Pasca, M. (2007). Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics.
A. Carnegie Group Inc., & B. Reuters Ltd. (1997). Reuters-21578 text categorization test collection.
Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759–771.
Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7(1), 88–102.
Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proc. 2007 joint conference on EMNLP and CNLL (pp. 708–716).
Davison, M. L. (1983). Multidimensional scaling. New York: Wiley.
Demartines, P., & Herault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1), 148–154.
Francis, W. N., & Kucera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American english, for use with digital computers. Providence, Rhode Island.
Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proc. of the 3rd SIAM international conference on data mining (SDM) (pp. 59–70). San Francisco, CA: SIAM.
Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI’06: Proceedings of the 21st national conference on artificial intelligence (pp. 1301–1306). Menlo Park, CA: AAAI Press.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence (pp. 1606–1611). San Francisco, CA, U.S.A.: Morgan Kaufmann Publishers Inc.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Hammouda, K. M., & Kamel, M. S. (2004). Efficient phrase-based document indexing for Web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16, 1279–1296.
He, J., Tan, A.-h., & Tan, C.-l. (2002). ART-C: A neural architecture for self-organization under constraints. In In proceedings of international joint conference on neural networks (IJCNN) (pp. 2550–2555).
Himberg, J. (2000). A SOM based cluster visualization and its application for false coloring. In IJCNN ’00: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 3, p. 3587). Washington, DC, U.S.A.: IEEE Computer Society.
Hofmann, T. (1999). The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In In IJCAI (pp. 682–687).
Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Y. Ding, K. van Rijsbergen, I. Ounis, & J. Jose (Eds.), Proceedings of the semantic Web workshop of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (SIGIR 2003), 1 Aug 2003, Toronto Canada.
Hotho, A., & Stumme, G. (2002). Conceptual clustering of text clusters. In Proceedings of FGML workshop (pp. 37–45). Special Interest Group of German Informatics Society (FGML).
Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging Wikipedia semantics. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 179–186). New York, NY, U.S.A.: ACM.
Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 389–396). New York, NY, U.S.A.: ACM.
Huang, A., Milne, D., Frank, E., & Witten, I. H. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of the 13th Pacific–Asia Conference on advances in knowledge discovery and data mining. PAKDD ’09 (pp. 628–636). Berlin, Heidelberg: Springer.
Jin, H., Wong, M.-L., & Leung, K. S. (2005). Scalable model-based clustering for large databases based on data summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11), 1710–1719.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Junker, M., Sintek, M., & Rinck, M. (2000). Learning for text categorization and information extraction with ILP. Learning Language in Logic, 247–258.
Kangas, J., Kohonen, T., & Laaksonen, J. (1990). Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1(1), 93–99.
Karypis, G. (2002). CLUTO—A clustering toolkit (Vol. 02–017). Technical Report.
Kiran, G. V. R., & Shankar, R. (2010). Enhancing document clustering using various external knowledge sources. In Proceedings of the 15th Australasian document computing symposium.
Kohonen, T. (1989). Self-organization and associative memory (3rd Edn.). New York, NY, U.S.A.: Springer New York, Inc.
Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., et al. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.
Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.) (2001). Self-organizing maps. Secaucus, NJ, U.S.A.: Springer New York, Inc.
Kraaijveld, M. (1992). A non-linear projection method based on Kohonen’s topology preserving maps. In 11th IAPR international conference on pattern recognition, 1992. Conference B: Pattern recognition methodology and systems, proceedings (Vol. II, pp. 41 –45).
Lagus, K., Kaski, S., & Kohonen, T. (2004). Mining massive document collections by the WEBSOM method. Informing Science, 163(1–3), 135–156.
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the international conference on machine learning. Tahoe City, California, U.S.A.: Morgan Kaufmann.
Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22). New York, NY, U.S.A.: ACM.
Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 797–798). New York, NY, U.S.A.: ACM.
Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In SIGIR ’91: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 262–269). New York, NY, U.S.A.: ACM.
Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191–198). New York, NY, U.S.A.: ACM.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.
Mendes, P., Jakob, M., Garca-Silva, A., & Bizer, C. (2011). Dbpedia spotlight: Shedding light on the Web of documents. In In the proceedings of the 7th international conference on semantic systems (I-semantics).
Merkl, D. (1998). Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 61–77.
Merkl, D., & Rauber, A. (1997). Alternative ways for cluster visualization in self-organizing maps. In In Proc. of the workshop on self-organizing maps (WSOM97) (pp. 106–111).
Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 233–242). New York, NY, U.S.A.: ACM.
Miikkulainen, R. (1990). Script recognition with hierarchical feature maps. Connection Science, 2, 83–101.
Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management. CIKM ’08 (pp 509–518). New York, NY, U.S.A.: ACM.
Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312.
Moutarde, F., & Ultsch, A. (2005). U*F clustering: A new performant “cluster-mining” method based on segmentation of self-organizing maps. In Workshop on self-organizing maps (WSOM’2005).
Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL ’10 (pp. 216–225). Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.
Pampalk, E., Rauber, A., & Merkl, D. (2002). Using smoothed data histograms for cluster visualization in self-organizing maps. In ICANN ’02: Proceedings of the international conference on artificial neural networks (pp. 871–876). London, U.K.: Springer.
Pölzlbauer, G. (2004). Survey and comparison of quality measures for self-organizing maps. In J. Paralič, G. Pölzlbauer, & A. Rauber (Eds.), Proceedings of the fifth workshop on data analysis (WDA’04), Sliezsky dom, Vysoké Tatry, 24–27 June 2004 (pp. 67–82). Slovakia: Elfa Academic Press.
Pullwitt, D. (2002). Integrating contextual information to enhance som-based text document clustering. Neural Networks, 15(8–9), 1099–1106.
Ratinov, L., Roth, D., Downey, D., & Anderson, M. (2011). Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1375–1384). HLT ’11. Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.
Rauber, A. (1999). LabelSOM: On the labeling of self-organizing maps. In International joint conference on neural networks, 1999. IJCNN ’99 (Vol. 5, pp. 3527–3532).
Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: Exploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13, 1331–1341.
Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York, U.S.A.: McGraw-Hill.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, Manchester, UK.
Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In ROMAND ’04: Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Morristown, NJ, U.S.A.: Association for Computational Linguistics.
Shehata, S., Karray, F., & Kamel, M. S. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22, 1360–1371.
Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 129–136). New York, NY, U.S.A.: ACM.
Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.
Spanakis, G., Siolas, G., & Stafylopatis, A. (2011). Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. The Computer Journal, Section C: Computational Intelligence. doi:10.1093/comjnl/bxr024.
Stanford (2009). Named entity recognizer (online). http://www-nlp.stanford.edu/software/CRF-NER.shtml. Accessed 23 Jan 2012
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In M. Grobelnik, D. Mladenic, & N. Milic-Frayling (Eds.), KDD-2000 workshop on text mining, Boston, MA (pp. 109–111).
Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing iceberg concept lattices with TITANIC. Data & Knowledge Engineering, 42(2), 189–222.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and Wordnet. Journal Web Semantics, 6, 203–217.
Talavera, L., & Bejar, J. (2001). Generality-based conceptual clustering with probabilistic concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 196–206.
Tenenbaum, J. B., Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Toral, A., & Munoz, R. (2006). A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In EACL. The Association for Computer Linguistics.
Ultsch, A., & Siemon, H. P. (1990). Kohonen’s self organizing feature maps for exploratory data analysis. In Proceedings of international neural networks conference (INNC) (pp. 305–308). Kluwer Academic Press.
Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.
Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18, 153–172.
Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using Wikipedia. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 713–721). New York, NY, U.S.A.: ACM.
Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2009). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.
Wang, B. B., Mckay, R. I. B., Abbass, H. A., & Barlow, M. (2003). A comparative study for domain ontology guided feature extraction. In ACSC ’03: Proceedings of the 26th Australasian computer science conference (pp. 69–78). Darlinghurst, Australia, Australia: Australian Computer Society, Inc.
Wikipedia (2011). Wikipedia API (online). http://en.Wikipedia.org/w/api.php. Accessed 18 Oct 2011
Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5), 577–597.
Xiong, H., Steinbach, M., Tan, P., & Kumar, V. (2004). HICAP: Hierarchical clustering with pattern preservation. In Proceedings of SIAM international conference on data mining (pp. 279–290). Philadelphia, PA: SIAM.
Xue, X.-B., & Zhou, Z.-H. (2009). Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3), 428–442.
Yin, H. (2002). ViSOM—A novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1), 237–243.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Spanakis, G., Siolas, G. & Stafylopatis, A. DoSO: a document self-organizer. J Intell Inf Syst 39, 577–610 (2012). https://doi.org/10.1007/s10844-012-0204-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-012-0204-9