article

Free access

THESUS: Organizing Web document collections based on link semantics

Authors:

Benjamin Nguyen,

Iraklis Varlamis,

Michalis VazirgiannisAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 12, Issue 4

Pages 320 - 332

https://doi.org/10.1007/s00778-003-0100-6

Published: 01 November 2003 Publication History

Abstract

The requirements for effective search and management of the WWW are stronger than ever. Currently Web documents are classified based on their content not taking into account the fact that these documents are connected to each other by links. We claim that a page’s classification is enriched by the detection of its incoming links’ semantics. This would enable effective browsing and enhance the validity of search results in the WWW context. Another aspect that is underaddressed and strictly related to the tasks of browsing and searching is the similarity of documents at the semantic level. The above observations lead us to the adoption of a hierarchy of concepts (ontology) and a thesaurus to exploit links and provide a better characterization of Web documents. The enhancement of document characterization makes operations such as clustering and labeling very interesting. To this end, we devised a system called THESUS. The system deals with an initial sets of Web documents, extracts keywords from all pages’ incoming links, and converts them to semantics by mapping them to a domain’s ontology. Then a clustering algorithm is applied to discover groups of Web documents. The effectiveness of the clustering process is based on the use of a novel similarity measure between documents characterized by sets of terms. Web documents are organized into thematic subsets based on their semantics. The subsets are then labeled, thereby enabling easier management (browsing, searching, querying) of the Web. In this article, we detail the process of this system and give an experimental analysis of its results.

References

[1]

1. Al-Halami R, Berwick R et al (1998) In: Fellbaum C, Miller G. (eds) WordNet, an electronic lexical database. MIT Press-Bradford Books, Cambridge, MA.

[2]

2. Aggarwal C, Gates S, Yu P (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (ACM SIGKDD), 15-18 August 1999, San Diego, pp. 352-356.

[3]

3. Bidault A, Safar B, Froidevaux Ch (2002) Proximit'e entre requetes dans un contexte mediateur. 13eme Congres Francophone AFRIF-AFIA de reconnaissance des formes et intelligence artificielle, Centre des Congres d'Angers, FRANCE, 8- 10 January 2002, pp. 653-662.

[4]

4. Beckmann N, Kriegel HP, Schneider R, Seeger B (1990) The R^*-Tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD international conference on management of data, Atlantic City, NJ, 23-25 May 1990, pp. 322-331.

[5]

5. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998. Comput Netw ISDN Sys 30(1-7): pp. 107-117.

[6]

6. Chakrabarti S, Dom B, Gibson D, Keinberg J, Raghavan P, Rajagopalan S (1998) Automatic resource list compilation by analyzing hyperlink structure and associated text. In: Proceedings of the 7th international World Wide Web conference, Brisbane, Australia, 14-18 April 1998, Comput Netw ISDN Sys 30(1- 7):65-74.

[7]

7. Chakrabati S, Dom B, Gibson D, Kleinberg J, Kumar S, Raghavan P, Rajagopalan S, Tomkins A (1999) Mining the link structure of the World Wide Web. IEEE Comput 32(8):60-67.

[8]

8. Chekuri C, Goldwasser M, Raghavan P, Upfal E (1997) Web search using automatic classification. Poster at the 6th international World Wide Web conference, Santa Clara, CA, April 1997, http://decweb.ethz.ch/WWW6/Posters/725/Web_Search.html

[9]

9. DARPA Agent Markup Language Ontology Library. http://www.daml.org/ontologies/

[10]

10. Dumais S, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the 23rd ACM international conference on research and development in information retrieval, Athens, Greece, 24-28 July 2000, pp. 256-263.

[11]

11. Desmontils E, Jacquin C (2002) Indexing a Web site with a terminology oriented ontology. In: Cruz IF, Decker S, Euzenat J, McGuinness DL (eds) The emerging semantic Web. IOS Press, Amsterdam, pp. 181-198.

[12]

12. Ester M, Kriegel HP, Sander J, Xu X (1996) A density based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining ACM-SIGKDD, Portland, OR, August 1996, pp. 226-231.

[13]

13. Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th VLDB conference, New York, 27-31 August 1998, pp. 323-333.

[14]

14. Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informat 34(2):109-133.

[15]

15. Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139-172.

[16]

16. Guarino N (1998) Formal ontology and information systems. In: Proceedings of the 1st international conference on formal ontologies in information systems FOIS'98, Trento, Italy, June 1998, pp. 3-15. IOS Press, Amsterdam.

[17]

17. Gionis A, Gunopulos D, Koudras N (2001) Efficient and tunable similar set retrieval. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, 21-24 May 2001, pp. 247-258.

[18]

18. Green J, Horne N, Orlowska E, Siemens P (1996) A rough set model of information retrieval. Fundamenta Informaticae 28(3- 4):273-296.

[19]

19. Glover EJ, Tsioutsiouliklis K, Lawrence S, Pennock DM, Flake GW (2002) Using Web structure for classifying and describing Web pages. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp. 562-569.

[20]

20. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Sys 17(2-3):107-145.

[21]

21. Haveliwala TH, Gionis A, Klein D, Indyk P (2002) Evaluating strategies for similarity on the Web. In: Proceedings of the 11th international World Wide Web conference (WWW 2002), Honolulu, 7-11 May 2002, pp. 432-442.

[22]

22. Halkidi M, Nguyen B, Varlamis I, Vazirgiannis M (2002) THESUS: Organizing web document collections based on semantics & clustering. Technical Report N. 230. (available at http://osage.inria.fr/verso/Gemo/PUBLI/index.php)

[23]

23. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604-632.

[24]

24. Larsen B, Aone C (1999) Fast and effective text mining using. linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, 15-18 August 1999, San Diego, pp. 16-22.

[25]

25. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, Madison, WI, 24-27 July 1998, pp. 296-304.

[26]

26. Niiniluoto I (1987) Truthlikeness. Reidel, Dordrecht.

[27]

27. The Northern Light search engine: http://www.northernlight.com

[28]

28. Nguyen B, Vazirgianis M, Varlamis I, Halkidi M, (2002) Organizing Web documents into thematic subsets using an ontology, Technical Report. (available at http://www.dbnet.aueb.gr/pubs.php#tr)

[29]

29. ODP - Open Directory Project, http://dmoz.org/

[30]

30. Phelps T, Wilensky R (2000) Robust hyperlinks cost just five words each. UC Berkeley Computer Science Technical Report UCB//CSD-00-1091

[31]

31. Qui Y, Frei HP (1993) Concept Base Query Expansion. In: Proceedings of the 16th annual international ACM-SIGIR conference on research and development in information retrieval, Pittsburgh, 27 June-July 1 1993, pp. 160-169.

[32]

32. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI 95), Montreal, 20-25 August 1995, pp. 448-453.

[33]

33. Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95-130.

[34]

34. Richardson R, Smeaton A, Murphy J (1994) Using WordNet as a knowledge base for measuring semantic similarity between words. In: Proceedings of the 7th Irish AI and cognitive science conference, Dublin, 8-9 September 1994.

[35]

35. Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York.

[36]

36. Thesus Web page: http://www.db-net.aueb.gr/thesus/

[37]

37. Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Press, New York.

[38]

38. Web research collections - TREC Web Track. http://www.ted.cmis.csiro.au/TRECWeb/

[39]

39. Vivisimo search engine: http://www.vivisimo.com/

[40]

40. Varlamis I, Vazirgiannis M (2001) Web document searching using enhanced hyperlink semantics based on XML. In: In Proceedings of the international database engineering and applications symposium, IDEAS '01, Grenoble, France, 16-18 July 2001, pp. 34-43.

[41]

41. Wordnet Web site: http://www.cogsci.princeton.edu/~wn/

[42]

42. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meetings of the associations for computational linguistics, Las Cruces, NM, June 1994, pp. 133- 138.

[43]

43. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual ACM SIGIR international conference on research and development in information retrieval, Melbourne, Australia, 24-28 August 1998, pp. 46-54.

Cited By

Moreau CChanson APeralta VDevogele Tde Runz CHung CHong JBechini ASong E(2021)Clustering sequences of multi-dimensional sets of semantic elementsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441920(384-391)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3441920
Fotsoh ASallaberry CLe Parc-Lacayrelle A(2017)Named entity similarity computationProceedings of the 11th Workshop on Geographic Information Retrieval10.1145/3155902.3155903(1-8)Online publication date: 30-Nov-2017
https://dl.acm.org/doi/10.1145/3155902.3155903
Tang JWang XGao HHu XLiu H(2012)Enriching short text representation in microblog for clusteringFrontiers of Computer Science in China10.5555/2125163.21251896:1(88-101)Online publication date: 1-Feb-2012
https://dl.acm.org/doi/10.5555/2125163.2125189
Show More Cited By

Index Terms

THESUS: Organizing Web document collections based on link semantics
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Semantic networks
2. Information systems

Recommendations

Multi-Layer Semantics Based Document Clustering
WIMS '16: Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Document Base or Corpus) into smaller, more manageable subject homogeneous collections (clusters). Traditional method of document ...
THESUS, a Closer View on Web Content Management Enhanced with Link Semantics

With the unstoppable growth of the World Wide Web, the great success of Web Search Engines, such as Google and Alta-Vista, users now turn to the Web whenever looking for information. However, many users are neophytes when it comes to computer science, ...
Efficient Phrase-Based Document Indexing for Web Document Clustering

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To achieve more accurate document clustering, more informative features including phrases and their weights are particularly ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 12, Issue 4

November 2003

83 pages

ISSN:1066-8888

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 November 2003

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
1,359
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)6

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moreau CChanson APeralta VDevogele Tde Runz CHung CHong JBechini ASong E(2021)Clustering sequences of multi-dimensional sets of semantic elementsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3441920(384-391)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3441920
Fotsoh ASallaberry CLe Parc-Lacayrelle A(2017)Named entity similarity computationProceedings of the 11th Workshop on Geographic Information Retrieval10.1145/3155902.3155903(1-8)Online publication date: 30-Nov-2017
https://dl.acm.org/doi/10.1145/3155902.3155903
Tang JWang XGao HHu XLiu H(2012)Enriching short text representation in microblog for clusteringFrontiers of Computer Science in China10.5555/2125163.21251896:1(88-101)Online publication date: 1-Feb-2012
https://dl.acm.org/doi/10.5555/2125163.2125189
Tiwari RHusain MSrivastava VAgrawal AMishra B(2011)Web personalization by assimilating usage data and semantics expressed in ontology termsProceedings of the International Conference & Workshop on Emerging Trends in Technology10.1145/1980022.1980133(516-521)Online publication date: 25-Feb-2011
https://dl.acm.org/doi/10.1145/1980022.1980133
Dong HHussain FChang E(2011)A framework for discovering and classifying ubiquitous services in digital health ecosystemsJournal of Computer and System Sciences10.1016/j.jcss.2010.02.00977:4(687-704)Online publication date: 1-Jul-2011
https://dl.acm.org/doi/10.1016/j.jcss.2010.02.009
Madylova AÖğüdücü Ş(2009)Comparison of similarity measures for clustering Turkish documentsIntelligent Data Analysis10.5555/1662565.166257313:5(815-832)Online publication date: 1-Oct-2009
https://dl.acm.org/doi/10.5555/1662565.1662573
Khalil FLi JWang H(2009)An integrated model for next page access predictionInternational Journal of Knowledge and Web Intelligence10.1504/IJKWI.2009.0279251:1/2(48-80)Online publication date: 1-Aug-2009
https://dl.acm.org/doi/10.1504/IJKWI.2009.027925
Dong HHussain FChang E(2009)State of the Art in Semantic Focused CrawlersProceedings of the International Conference on Computational Science and Its Applications: Part II10.1007/978-3-642-02457-3_74(910-924)Online publication date: 9-Jul-2009
https://dl.acm.org/doi/10.1007/978-3-642-02457-3_74
Khalil FLi JWang H(2008)Integrating recommendation models for improved web page prediction accuracyProceedings of the thirty-first Australasian conference on Computer science - Volume 7410.5555/1378279.1378296(91-100)Online publication date: 1-Jan-2008
https://dl.acm.org/doi/10.5555/1378279.1378296
Zhang XHu XZhou XChua TLeong MMyaeng SOard DSebastiani F(2008)A comparative evaluation of different link types on enhancing document clusteringProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1390334.1390429(555-562)Online publication date: 20-Jul-2008
https://dl.acm.org/doi/10.1145/1390334.1390429
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents