Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

THESUS: Organizing Web document collections based on link semantics

Published: 01 November 2003 Publication History

Abstract

The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.

References

[1]
1. Al-Halami R, Berwick R et al (1998) In: Fellbaum C, Miller G. (eds) WordNet, an electronic lexical database. MIT Press-Bradford Books, Cambridge, MA.
[2]
2. Aggarwal C, Gates S, Yu P (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (ACM SIGKDD), 15-18 August 1999, San Diego, pp. 352-356.
[3]
3. Bidault A, Safar B, Froidevaux Ch (2002) Proximit'e entre requetes dans un contexte mediateur. 13eme Congres Francophone AFRIF-AFIA de reconnaissance des formes et intelligence artificielle, Centre des Congres d'Angers, FRANCE, 8- 10 January 2002, pp. 653-662.
[4]
4. Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD international conference on management of data, Atlantic City, NJ, 23-25 May 1990, pp. 322-331.
[5]
5. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998. Comput Netw ISDN Sys 30(1-7): pp. 107-117.
[6]
6. Chakrabarti S, Dom B, Gibson D, Keinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource list compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998, Comput Netw ISDN Sys 30(1- 7):65-74.
[7]
7. Chakrabati S, Dom B, Gibson D, Kleinberg J, Kumar S, Raghavan P, Rajagopalan S, Tomkins A (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60-67.
[8]
8. Chekuri C, Goldwasser M, Raghavan P, Upfal E (1997) Web search using automatic classification. Poster at the 6th international World Wide Web conference, Santa Clara, CA, April 1997, http://decweb.ethz.ch/WWW6/Posters/725/Web_Search.html
[9]
9. DARPA Agent Markup Language Ontology Library. http://www.daml.org/ontologies/
[10]
10. Dumais S, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd ACM international conference on research and development in information retrieval, Athens, Greece, 24-28 July 2000, pp. 256-263.
[11]
11. Desmontils E, Jacquin C (2002) Indexing a Web site with a terminology oriented ontology. In: Cruz IF, Decker S, Euzenat J, McGuinness DL (eds) The emerging semantic Web. IOS Press, Amsterdam, pp. 181-198.
[12]
12. Ester M, Kriegel HP, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining ACM-SIGKDD, Portland, OR, August 1996, pp. 226-231.
[13]
13. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th VLDB conference, New York, 27-31 August 1998, pp. 323-333.
[14]
14. Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informat 34(2):109-133.
[15]
15. Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139-172.
[16]
16. Guarino N (1998) Formal ontology and information systems. In: Proceedings of the 1st international conference on formal ontologies in information systems FOIS'98, Trento, Italy, June 1998, pp. 3-15. IOS Press, Amsterdam.
[17]
17. Gionis A, Gunopulos D, Koudras N (2001) Efficient and tunable similar set retrieval. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, 21-24 May 2001, pp. 247-258.
[18]
18. Green J, Horne N, Orlowska E, Siemens P (1996) A rough set model of information retrieval. Fundamenta Informaticae 28(3- 4):273-296.
[19]
19. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using Web structure for classifying and describing Web pages. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp. 562-569.
[20]
20. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Sys 17(2-3):107-145.
[21]
21. Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity on the Web. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp. 432-442.
[22]
22. Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2002) THESUS: Organizing web document collections based on semantics & clustering. Technical Report N. 230. (available at http://osage.inria.fr/verso/Gemo/PUBLI/index.php)
[23]
23. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604-632.
[24]
24. Larsen B, Aone C (1999) Fast and effective text mining using. linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, 15-18 August 1999, San Diego, pp. 16-22.
[25]
25. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Madison, WI, 24-27 July 1998, pp. 296-304.
[26]
26. Niiniluoto I (1987) Truthlikeness. Reidel, Dordrecht.
[27]
27. The Northern Light search engine: http://www.northernlight.com
[28]
28. Nguyen B, Vazirgianis M, Varlamis I, Halkidi M, (2002) Organizing Web documents into thematic subsets using an ontology, Technical Report. (available at http://www.dbnet.aueb.gr/pubs.php#tr)
[29]
29. ODP - Open Directory Project, http://dmoz.org/
[30]
30. Phelps T, Wilensky R (2000) Robust hyperlinks cost just five words each. UC Berkeley Computer Science Technical Report UCB//CSD-00-1091
[31]
31. Qui Y, Frei HP (1993) Concept Base Query Expansion. In: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June-July 1 1993, pp. 160-169.
[32]
32. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI 95), Montreal, 20-25 August 1995, pp. 448-453.
[33]
33. Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130.
[34]
34. Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the 7th Irish AI and cognitive science conference, Dublin, 8-9 September 1994.
[35]
35. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York.
[36]
36. Thesus Web page: http://www.db-net.aueb.gr/thesus/
[37]
37. Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, New York.
[38]
38. Web research collections - TREC Web Track. http://www.ted.cmis.csiro.au/TRECWeb/
[39]
39. Vivisimo search engine: http://www.vivisimo.com/
[40]
40. Varlamis I, Vazirgiannis M (2001) Web document searching using enhanced hyperlink semantics based on XML. In: In Proceedings of the international database engineering and applications symposium, IDEAS '01, Grenoble, France, 16-18 July 2001, pp. 34-43.
[41]
41. Wordnet Web site: http://www.cogsci.princeton.edu/~wn/
[42]
42. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meetings of the associations for computational linguistics, Las Cruces, NM, June 1994, pp. 133- 138.
[43]
43. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual ACM SIGIR international conference on research and development in information retrieval, Melbourne, Australia, 24-28 August 1998, pp. 46-54.

Cited By

View all
  • (2021)Clustering sequences of multi-dimensional sets of semantic elementsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441920(384-391)Online publication date: 22-Mar-2021
  • (2017)Named entity similarity computationProceedings of the 11th Workshop on Geographic Information Retrieval10.1145/3155902.3155903(1-8)Online publication date: 30-Nov-2017
  • (2012)Enriching short text representation in microblog for clusteringFrontiers of Computer Science in China10.5555/2125163.21251896:1(88-101)Online publication date: 1-Feb-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 12, Issue 4
November 2003
83 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 November 2003

Author Tags

  1. Document clustering
  2. Link analysis
  3. Link management
  4. Semantics
  5. Similarity measure
  6. World Wide Web

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)6
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Clustering sequences of multi-dimensional sets of semantic elementsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441920(384-391)Online publication date: 22-Mar-2021
  • (2017)Named entity similarity computationProceedings of the 11th Workshop on Geographic Information Retrieval10.1145/3155902.3155903(1-8)Online publication date: 30-Nov-2017
  • (2012)Enriching short text representation in microblog for clusteringFrontiers of Computer Science in China10.5555/2125163.21251896:1(88-101)Online publication date: 1-Feb-2012
  • (2011)Web personalization by assimilating usage data and semantics expressed in ontology termsProceedings of the International Conference & Workshop on Emerging Trends in Technology10.1145/1980022.1980133(516-521)Online publication date: 25-Feb-2011
  • (2011)A framework for discovering and classifying ubiquitous services in digital health ecosystemsJournal of Computer and System Sciences10.1016/j.jcss.2010.02.00977:4(687-704)Online publication date: 1-Jul-2011
  • (2009)Comparison of similarity measures for clustering Turkish documentsIntelligent Data Analysis10.5555/1662565.166257313:5(815-832)Online publication date: 1-Oct-2009
  • (2009)An integrated model for next page access predictionInternational Journal of Knowledge and Web Intelligence10.1504/IJKWI.2009.0279251:1/2(48-80)Online publication date: 1-Aug-2009
  • (2009)State of the Art in Semantic Focused CrawlersProceedings of the International Conference on Computational Science and Its Applications: Part II10.1007/978-3-642-02457-3_74(910-924)Online publication date: 9-Jul-2009
  • (2008)Integrating recommendation models for improved web page prediction accuracyProceedings of the thirty-first Australasian conference on Computer science - Volume 7410.5555/1378279.1378296(91-100)Online publication date: 1-Jan-2008
  • (2008)A comparative evaluation of different link types on enhancing document clusteringProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390429(555-562)Online publication date: 20-Jul-2008
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media