Abstract
Ambiguities, which are inherently present in natural languages represent a challenge of determining the actual identities of entities mentioned in a document (e.g., Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas). Disambiguation is a problem that can be successfully solved by entity resolution methods.
This paper studies various methods for estimating relatedness between entities, used in collective entity resolution. We define a unified entity resolution approach, capable of using implicit as well as explicit relatedness for collectively identifying in-text entities. As a relatedness measure, we propose a method, which expresses relatedness using the heterogeneous relations of a domain ontology. We also experiment with other relatedness measures, such as using statistical learning of co-occurrences of two entities or using content similarity between them. Evaluation on real data shows that the new methods for relatedness estimation give good results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mladenić, D.: Text Mining: Machine Learning on Documents. In: Encyclopedia of Data Warehousing and Mining, pp. 1109–1112 (2006)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association, 1183–1210 (1969)
Haas, L., Miller, R., Niswonger, B., Roth, M., Schwarz, P., Wimmers, E.: Transforming heterogeneous data with database middleware: Beyond integration. IEEE Data Engineering Bulletin 22(1), 31–36 (1999)
Winkler, W.: The state of record linkage and current research problems. Statistical Research Division, US Bureau of the Census, Wachington, DC (1999)
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. Information Systems 26(8), 607–633 (2001)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey (2006)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp. 189–196. Association for Computational Linguistics, Morristown (1995)
Kalashnikov, D., Mehrotra, S.: A probabilistic model for entity disambiguation using relationships. In: SIAM International Conference on Data Mining (SDM), Newport Beach, California, pp. 21–23 (2005)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data (2007)
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123 (1998)
Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3–7 (2006)
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 708–716 (2007)
Klyne, G., Carroll, J., McBride, B.: Resource description framework (RDF): Concepts and abstract syntax. W3C recommendation 10 (2004)
Bizer, C., Seaborne, A.: D2RQ-treating non-RDF databases as virtual RDF graphs. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298. Springer, Heidelberg (2004)
McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005)
Lloyd, L., Bhagwan, V., Gruhl, D., Tomkins, A.: Disambiguation of references to individuals. IBM Research Report (2005)
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Mihalcea, R.: Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In: Proceedings of the conference on Human Language Technology and EMNLP, pp. 411–418. Association for Computational Linguistics, Morristown (2005)
Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of the Sixth IEEE International Conference on Data Mining, pp. 572–582 (2006)
Chen, Z., Kalashnikov, D., Mehrotra, S.: Adaptive graphical approach to entity resolution. In: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 204–213. ACM, New York (2007)
Ramakrishnan, C., Milnor, W.H., Perry, M., Sheth, A.P.: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explor. Newsl. 7(2), 56–63 (2005)
Štajner, T.: From unstructured to linked data: entity extraction and disambiguation by collective similarity maximization, Identity and reference in web-base knowledge representation workshop (2009)
Li, X., Morie, P., Roth, D.: Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26(1), 45–58 (2005)
Bunescu, R., Mooney, R., Ramani, A., Marcotte, E.: Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In: Proceedings of the BioNLP Workshop on Linking NLP Processing and Biology at HLTNAACL, vol. 6, pp. 49–56 (2006)
Overell, S., Magalhaes, J., Ruger, S.: Place disambiguation with co-occurrence models. In: CLEF 2006 Workshop, Working notes (2006)
Yates, A., Etzioni, O.: Unsupervised resolution of objects and relations on the Web. In: Proceedings of NAACL HLT, pp. 121–130 (2007)
Finkel, J., Grenager, T., Manning, C.: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Ann Arbor 100 (2005)
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, IIWeb 2003 (2003)
Jang, M., Myaeng, S., Park, S.: Using mutual information to resolve query translation ambiguities and query term weighting. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 223–229. Association for Computational Linguistics, Morristown (1999)
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Li, H., Abe, N.: Word clustering and disambiguation based on cooccurrence data. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, pp. 749–755. Association for Computational Linguistics, Morristown (1998)
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
Sandhaus, E: The New York Times Annotated Corpus, 2008.40
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Suchanek, F., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web, pp. 697–706. ACM, New York (2007)
Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 631–640. ACM, New York (2009)
Decker, S., Melnik, S., Van Harmelen, F., Fensel, D., Klein, M., Broekstra, J., Erdmann, M., Horrocks, I.: The semantic web: The roles of XML and RDF. IEEE Internet Computing 4(5), 63–73 (2000)
Fortuna, B., Grobelnik, M., Mladenić, D.: Visualization of text document corpus. Special Issue: Hot Topics in European Agent Research 29, 497–502 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Štajner, T., Mladenić, D. (2009). Entity Resolution in Texts Using Statistical Learning and Ontologies. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds) The Semantic Web. ASWC 2009. Lecture Notes in Computer Science, vol 5926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10871-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-10871-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10870-9
Online ISBN: 978-3-642-10871-6
eBook Packages: Computer ScienceComputer Science (R0)