Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Ontology-driven web-based semantic similarity

Published: 01 December 2010 Publication History

Abstract

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge--such as the structure of a taxonomy--or implicit knowledge--such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies ---like specific domain ontologies- and massive corpus ---like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.

References

[1]
Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137-146). Yokohama, Japan.
[2]
Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34-43.
[3]
Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.
[4]
Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360-369). Mexico City, Mexico.
[5]
Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13-47.
[6]
Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115-164). New Jersey, USA.
[7]
Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370-383.
[8]
Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.
[9]
Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652-659). New York: ACM.
[10]
Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111-134). Amsterdam: Elsevier.
[11]
Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733-2739).
[12]
Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711-722.
[13]
Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91-134.
[14]
Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.
[15]
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
[16]
Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369-381).
[17]
Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.
[18]
Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3-15). Trento: IOS Press.
[19]
Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48-54.
[20]
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19- 33), Japan.
[21]
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.
[22]
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265-283). Cambridge: MIT Press.
[23]
Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188-207.
[24]
Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). http://cpl.revues.org/document471.html. Accessed 26 May 2009.
[25]
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296-304). San Francisco: Kaufmann.
[26]
Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303-308). Morristown: Association for Computational Linguistics.
[27]
Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1-28.
[28]
Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1-8). Trento, Italy.
[29]
Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288-299.
[30]
Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17-30.
[31]
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448-453).
[32]
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95-130.
[33]
Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729-733).
[34]
Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.
[35]
Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).
[36]
Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317-330.
[37]
Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557-570.
[38]
Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs -- Geological Society of America, 36(5), 149.
[39]
Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491-499). Freiburg, Germany.
[40]
Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133-138). New Mexico, USA.
[41]
Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189- 196). Cambridge, MA.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Intelligent Information Systems
Journal of Intelligent Information Systems  Volume 35, Issue 3
December 2010
166 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2010

Author Tags

  1. Information content
  2. Knowledge discovery
  3. Ontologies
  4. Semantic similarity
  5. Web

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)AquilisProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34322054:4(1-28)Online publication date: 18-Dec-2020
  • (2017)Wikipedia-based information content and semantic similarity computationInformation Processing and Management: an International Journal10.1016/j.ipm.2016.09.00153:1(248-265)Online publication date: 1-Jan-2017
  • (2017)Toward sensitive document release with privacy guaranteesEngineering Applications of Artificial Intelligence10.1016/j.engappai.2016.12.01359:C(23-34)Online publication date: 1-Mar-2017
  • (2016)Improving Semantic Relatedness AssessmentsProcedia Computer Science10.1016/j.procs.2016.08.14996:C(365-374)Online publication date: 1-Oct-2016
  • (2016)Evaluating the Suitability of Web Search Engines as Proxies for Knowledge Discovery from the WebProcedia Computer Science10.1016/j.procs.2016.08.12396:C(169-178)Online publication date: 1-Oct-2016
  • (2016)Working at the web search engine side to generate privacy-preserving user profilesExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.08.03364:C(523-535)Online publication date: 1-Dec-2016
  • (2016)Enforcing transparent access to private content in social networks by means of automatic sanitizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.06.02662:C(148-160)Online publication date: 15-Nov-2016
  • (2016)C-sanitizedJournal of the Association for Information Science and Technology10.1002/asi.2336367:1(148-163)Online publication date: 1-Jan-2016
  • (2015)Feature-based approaches to semantic similarity assessment of concepts using WikipediaInformation Processing and Management: an International Journal10.1016/j.ipm.2015.01.00151:3(215-234)Online publication date: 1-May-2015
  • (2015)Enforcing transparent access to private content in social networks by means of automatic sanitizationExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.08.01442:23(9366-9378)Online publication date: 15-Dec-2015
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media