Abstract.
The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.
Similar content being viewed by others
References
Al-Halami R, Berwick R(1998) In: Fellbaum C, Miller G. (eds) WordNet, an electronic lexical database. MIT Press-Bradford Books, Cambridge, MA
Aggarwal C, Gates S, Yu P (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (ACM SIGKDD), 15-18 August 1999, San Diego, pp 352-356
Bidault A, Safar B, Froidevaux Ch (2002) Proximit’e entre requetes dans un contexte mediateur. 13eme Congres Francophone AFRIF-AFIA de reconnaissance des formes et intelligence artificielle, Centre des Congres d’Angers, FRANCE, 8-10 January 2002, pp 653-662
Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD international conference on management of data, Atlantic City, NJ, 23-25 May 1990, pp 322-331
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998. Comput Netw ISDN Sys 30(1-7):pp 107-117
Chakrabarti S, Dom B, Gibson D, Keinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource list compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998, Comput Netw ISDN Sys 30(1-7):65-74
Chakrabati S, Dom B, Gibson D, Kleinberg J, Kumar S, Raghavan P, Rajagopalan S, Tomkins A (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60-67
Chekuri C, Goldwasser M, Raghavan P, Upfal E (1997) Web search using automatic classification. Poster at the 6th international World Wide Web conference, Santa Clara, CA, April 1997, http://decweb.ethz.ch/WWW6/Posters/725/Web+_+Search.html
DARPA Agent Markup Language Ontology Library. http://www.daml.org/ontologies/
Dumais S, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd ACM international conference on research and development in information retrieval, Athens, Greece, 24-28 July 2000, pp 256-263
Desmontils E, Jacquin C (2002) Indexing a Web site with a terminology oriented ontology. In: Cruz IF, Decker S, Euzenat J, McGuinness DL (eds) The emerging semantic Web. IOS Press, Amsterdam, pp 181-198
Ester M, Kriegel HP, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining ACM-SIGKDD, Portland, OR, August 1996, pp 226-231
Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th VLDB conference, New York, 27-31 August 1998, pp 323-333
Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informat 34(2):109-133
Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139-172
Guarino N (1998) Formal ontology and information systems. In: Procedings of the 1st international conference on formal ontologies in information systems FOIS’98, Trento, Italy, June 1998, pp 3-15. IOS Press, Amsterdam
Gionis A, Gunopulos D, Koudras N (2001) Efficient and tunable similar set retrieval. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, 21-24 May 2001, pp 247-258
Green J, Horne N, Orlowska E, Siemens P (1996) A rough set model of information retrieval. Fundamenta Informaticae 28(3-4):273-296
Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using Web structure for classifying and describing Web pages. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 562-569
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Sys 17(2-3):107-145
Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity on the Web. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp 432-442
Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2002) THESUS: Organizing web document collections based on semantics & clustering. Technical Report N.230. (available at http://osage.inria.fr/verso/Gemo/PUBLI/index.php)
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604-632
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, 15-18 August 1999, San Diego, pp 16-22
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Madison, WI, 24-27 July 1998, pp 296-304
Niiniluoto I (1987) Truthlikeness. Reidel, Dordrecht
The Northern Light search engine: http://www.northernlight.com
Nguyen B, Vazirgianis M, Varlamis I, Halkidi M, (2002) Organizing Web documents into thematic subsets using an ontology, Technical Report. (available at http://www.db-net.aueb.gr/pubs.php#tr)
ODP - Open Directory Project, http://dmoz.org/
Phelps T, Wilensky R (2000) Robust hyperlinks cost just five words each. UC Berkeley Computer Science Technical Report UCB//CSD-00-1091
Qui Y, Frei HP (1993) Concept Base Query Expansion. In: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June-July 1 1993, pp 160-169
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI 95), Montreal, 20-25 August 1995, pp 448-453
Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130
Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the 7th Irish AI and cognitive science conference, Dublin, 8-9 September 1994
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Thesus Web page: http://www.db-net.aueb.gr/thesus/
Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, New York
Web research collections - TREC Web Track. http://www.ted.cmis.csiro.au/TRECWeb/
Vivisimo search engine: http://www.vivisimo.com/
Varlamis I, Vazirgiannis M (2001) Web document searching using enhanced hyperlink semantics based on XML. In: In Proceedings of the international database engineering and applications symposium, IDEAS ‘01,Grenoble, France, 16-18 July 2001, pp 34-43
Wordnet Web site: http://www.cogsci.princeton.edu/~wn/
Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meetings of the associations for computational linguistics, Las Cruces, NM, June 1994, pp 133-138
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual ACM SIGIR international conference on research and development in information retrieval, Melbourne, Australia, 24-28 August 1998, pp 46-54
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 16 December 2002, Accepted: 16 April 2003, Published online: 17 September 2003
Rights and permissions
About this article
Cite this article
Halkidi, M., Nguyen, B., Varlamis, I. et al. THESUS: Organizing Web document collections based on link semantics. VLDB 12, 320–332 (2003). https://doi.org/10.1007/s00778-003-0100-6
Issue Date:
DOI: https://doi.org/10.1007/s00778-003-0100-6