article

Ontology-driven web-based semantic similarity

Authors:

David Sánchez,

Montserrat Batet,

Karina GibertAuthors Info & Claims

Journal of Intelligent Information Systems, Volume 35, Issue 3

Pages 383 - 413

https://doi.org/10.1007/s10844-009-0103-x

Published: 01 December 2010 Publication History

Abstract

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge--such as the structure of a taxonomy--or implicit knowledge--such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies ---like specific domain ontologies- and massive corpus ---like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.

References

[1]

Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137-146). Yokohama, Japan.

[2]

Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34-43.

[3]

Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.

[4]

Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360-369). Mexico City, Mexico.

[5]

Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13-47.

Digital Library

[6]

Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115-164). New Jersey, USA.

[7]

Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370-383.

Digital Library

[8]

Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.

[9]

Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652-659). New York: ACM.

[10]

Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111-134). Amsterdam: Elsevier.

[11]

Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733-2739).

[12]

Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711-722.

[13]

Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91-134.

Digital Library

[14]

Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.

[15]

Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

[16]

Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369-381).

[17]

Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.

[18]

Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3-15). Trento: IOS Press.

[19]

Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48-54.

[20]

Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19- 33), Japan.

[21]

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

[22]

Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265-283). Cambridge: MIT Press.

[23]

Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188-207.

[24]

Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). http://cpl.revues.org/document471.html. Accessed 26 May 2009.

[25]

Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296-304). San Francisco: Kaufmann.

[26]

Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303-308). Morristown: Association for Computational Linguistics.

[27]

Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28.

[28]

Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1-8). Trento, Italy.

[29]

Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288-299.

Digital Library

[30]

Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17-30.

[31]

Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448-453).

[32]

Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95-130.

Digital Library

[33]

Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729-733).

[34]

Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.

[35]

Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).

[36]

Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317-330.

[37]

Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557-570.

Digital Library

[38]

Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs -- Geological Society of America, 36(5), 149.

[39]

Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491-499). Freiburg, Germany.

[40]

Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133-138). New Mexico, USA.

[41]

Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189- 196). Cambridge, MA.

Cited By

Kumar ABraud TKwon YHui P(2020)AquilisProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322054:4(1-28)Online publication date: 18-Dec-2020
https://dl.acm.org/doi/10.1145/3432205
Jiang YBai WZhang XHu J(2017)Wikipedia-based information content and semantic similarity computationInformation Processing and Management: an International Journal10.1016/j.ipm.2016.09.00153:1(248-265)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.ipm.2016.09.001
Snchez DBatet M(2017)Toward sensitive document release with privacy guaranteesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.12.01359:C(23-34)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.engappai.2016.12.013
Show More Cited By

Ontology-driven web-based semantic similarity
1. Hardware
  1. Power and energy
    1. Power estimation and optimization
2. Information systems

Recommendations

An Improved Information Content Measure for Semantic Similarity Based on Biomedical Ontology
ICISEM '13: Proceedings of the 2013 International Conference on Information System and Engineering Management

Many semantic similarity measures have been proposed to determine how similar one concept is to another within the context of an ontology. Recently, researchers use intrinsic information content to compute semantic similarity only based on ontology ...
Ontology-based semantic similarity: A new feature-based approach

Estimation of the semantic likeness between words is of great importance in many applications dealing with textual data such as natural language processing, knowledge acquisition and information retrieval. Semantic similarity measures exploit knowledge ...
Computing Knowledge-Based Semantic Similarity from the Web: An Application to the Biomedical Domain
KSEM '09: Proceedings of the 3rd International Conference on Knowledge Science, Engineering and Management

Computation of semantic similarity between concepts is a very common problem in many language related tasks and knowledge domains. In the biomedical field, several approaches have been developed to deal with this issue by exploiting the knowledge ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Intelligent Information Systems

Journal of Intelligent Information Systems Volume 35, Issue 3

December 2010

166 pages

ISSN:0925-9902

Issue’s Table of Contents

Copyright © Copyright © 2010 Springer Science+Business Media, LLC.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kumar ABraud TKwon YHui P(2020)AquilisProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322054:4(1-28)Online publication date: 18-Dec-2020
https://dl.acm.org/doi/10.1145/3432205
Jiang YBai WZhang XHu J(2017)Wikipedia-based information content and semantic similarity computationInformation Processing and Management: an International Journal10.1016/j.ipm.2016.09.00153:1(248-265)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.ipm.2016.09.001
Snchez DBatet M(2017)Toward sensitive document release with privacy guaranteesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.12.01359:C(23-34)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1016/j.engappai.2016.12.013
Batet MSánchez D(2016)Improving Semantic Relatedness AssessmentsProcedia Computer Science10.1016/j.procs.2016.08.14996:C(365-374)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.08.149
Martínez-Sanahuja LSánchez D(2016)Evaluating the Suitability of Web Search Engines as Proxies for Knowledge Discovery from the WebProcedia Computer Science10.1016/j.procs.2016.08.12396:C(169-178)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.08.123
Pàmies-Estrems DCastellà-Roca JViejo A(2016)Working at the web search engine side to generate privacy-preserving user profilesExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.08.03364:C(523-535)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.08.033
Viejo ASánchez D(2016)Enforcing transparent access to private content in social networks by means of automatic sanitizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.06.02662:C(148-160)Online publication date: 15-Nov-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.06.026
Sánchez DBatet M(2016)C-sanitizedJournal of the Association for Information Science and Technology10.1002/asi.2336367:1(148-163)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1002/asi.23363
(2015)Feature-based approaches to semantic similarity assessment of concepts using WikipediaInformation Processing and Management: an International Journal10.1016/j.ipm.2015.01.00151:3(215-234)Online publication date: 1-May-2015
https://dl.acm.org/doi/10.1016/j.ipm.2015.01.001
Viejo ASánchez D(2015)Enforcing transparent access to private content in social networks by means of automatic sanitizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.08.01442:23(9366-9378)Online publication date: 15-Dec-2015
https://dl.acm.org/doi/10.1016/j.eswa.2015.08.014
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents