Entity Resolution in Texts Using Statistical Learning and Ontologies

Tadej Štajner¹⁹ &
Dunja Mladenić¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5926))

Included in the following conference series:

Asian Semantic Web Conference

Abstract

Ambiguities, which are inherently present in natural languages represent a challenge of determining the actual identities of entities mentioned in a document (e.g., Paris can refer to a city in France but it can also refer to a small city in Texas, USA or to a 1984 film directed by Wim Wenders having title Paris, Texas). Disambiguation is a problem that can be successfully solved by entity resolution methods.

This paper studies various methods for estimating relatedness between entities, used in collective entity resolution. We define a unified entity resolution approach, capable of using implicit as well as explicit relatedness for collectively identifying in-text entities. As a relatedness measure, we propose a method, which expresses relatedness using the heterogeneous relations of a domain ontology. We also experiment with other relatedness measures, such as using statistical learning of co-occurrences of two entities or using content similarity between them. Evaluation on real data shows that the new methods for relatedness estimation give good results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining Textual and Graph-Based Features for Named Entity Disambiguation Using Undirected Probabilistic Graphical Models

Entity Linking with Distributional Semantics

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

References

Mladenić, D.: Text Mining: Machine Learning on Documents. In: Encyclopedia of Data Warehousing and Mining, pp. 1109–1112 (2006)
Google Scholar
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association, 1183–1210 (1969)
Google Scholar
Haas, L., Miller, R., Niswonger, B., Roth, M., Schwarz, P., Wimmers, E.: Transforming heterogeneous data with database middleware: Beyond integration. IEEE Data Engineering Bulletin 22(1), 31–36 (1999)
Google Scholar
Winkler, W.: The state of record linkage and current research problems. Statistical Research Division, US Bureau of the Census, Wachington, DC (1999)
Google Scholar
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. Information Systems 26(8), 607–633 (2001)
Article MATH Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate Record Detection: A Survey (2006)
Google Scholar
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp. 189–196. Association for Computational Linguistics, Morristown (1995)
Chapter Google Scholar
Kalashnikov, D., Mehrotra, S.: A probabilistic model for entity disambiguation using relationships. In: SIAM International Conference on Data Mining (SDM), Newport Beach, California, pp. 21–23 (2005)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data (2007)
Google Scholar
Schütze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123 (1998)
Google Scholar
Bunescu, R., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3–7 (2006)
Google Scholar
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 708–716 (2007)
Google Scholar
Klyne, G., Carroll, J., McBride, B.: Resource description framework (RDF): Concepts and abstract syntax. W3C recommendation 10 (2004)
Google Scholar
Bizer, C., Seaborne, A.: D2RQ-treating non-RDF databases as virtual RDF graphs. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298. Springer, Heidelberg (2004)
Google Scholar
McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005)
Article Google Scholar
Lloyd, L., Bhagwan, V., Gruhl, D., Tomkins, A.: Disambiguation of references to individuals. IBM Research Report (2005)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Mihalcea, R.: Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In: Proceedings of the conference on Human Language Technology and EMNLP, pp. 411–418. Association for Computational Linguistics, Morristown (2005)
Google Scholar
Singla, P., Domingos, P.: Entity resolution with markov logic. In: Proceedings of the Sixth IEEE International Conference on Data Mining, pp. 572–582 (2006)
Google Scholar
Chen, Z., Kalashnikov, D., Mehrotra, S.: Adaptive graphical approach to entity resolution. In: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 204–213. ACM, New York (2007)
Google Scholar
Ramakrishnan, C., Milnor, W.H., Perry, M., Sheth, A.P.: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explor. Newsl. 7(2), 56–63 (2005)
Article Google Scholar
Štajner, T.: From unstructured to linked data: entity extraction and disambiguation by collective similarity maximization, Identity and reference in web-base knowledge representation workshop (2009)
Google Scholar
Li, X., Morie, P., Roth, D.: Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration 26(1), 45–58 (2005)
Google Scholar
Bunescu, R., Mooney, R., Ramani, A., Marcotte, E.: Integrating co-occurrence statistics with information extraction for robust retrieval of protein interactions from Medline. In: Proceedings of the BioNLP Workshop on Linking NLP Processing and Biology at HLTNAACL, vol. 6, pp. 49–56 (2006)
Google Scholar
Overell, S., Magalhaes, J., Ruger, S.: Place disambiguation with co-occurrence models. In: CLEF 2006 Workshop, Working notes (2006)
Google Scholar
Yates, A., Etzioni, O.: Unsupervised resolution of objects and relations on the Web. In: Proceedings of NAACL HLT, pp. 121–130 (2007)
Google Scholar
Finkel, J., Grenager, T., Manning, C.: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Ann Arbor 100 (2005)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, IIWeb 2003 (2003)
Google Scholar
Jang, M., Myaeng, S., Park, S.: Using mutual information to resolve query translation ambiguities and query term weighting. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 223–229. Association for Computational Linguistics, Morristown (1999)
Chapter Google Scholar
Church, K., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Google Scholar
Li, H., Abe, N.: Word clustering and disambiguation based on cooccurrence data. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, pp. 749–755. Association for Computational Linguistics, Morristown (1998)
Chapter Google Scholar
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Sandhaus, E: The New York Times Annotated Corpus, 2008.40
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Chapter Google Scholar
Suchanek, F., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th international conference on World Wide Web, pp. 697–706. ACM, New York (2007)
Chapter Google Scholar
Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 631–640. ACM, New York (2009)
Chapter Google Scholar
Decker, S., Melnik, S., Van Harmelen, F., Fensel, D., Klein, M., Broekstra, J., Erdmann, M., Horrocks, I.: The semantic web: The roles of XML and RDF. IEEE Internet Computing 4(5), 63–73 (2000)
Article Google Scholar
Fortuna, B., Grobelnik, M., Mladenić, D.: Visualization of text document corpus. Special Issue: Hot Topics in European Agent Research 29, 497–502 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Tadej Štajner & Dunja Mladenić

Authors

Tadej Štajner
View author publications
You can also search for this author in PubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Facultad de Informática, Dpto. de Inteligencia Artificial, Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo s/n, 28660, Boadilla del Monte, Madrid
Asunción Gómez-Pérez
Shanghai Jiao Tong University, 200030, Shanghai, China
Yong Yu
Indiana University, USA
Ying Ding

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Štajner, T., Mladenić, D. (2009). Entity Resolution in Texts Using Statistical Learning and Ontologies. In: Gómez-Pérez, A., Yu, Y., Ding, Y. (eds) The Semantic Web. ASWC 2009. Lecture Notes in Computer Science, vol 5926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10871-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-10871-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10870-9
Online ISBN: 978-3-642-10871-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Entity Resolution in Texts Using Statistical Learning and Ontologies

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Combining Textual and Graph-Based Features for Named Entity Disambiguation Using Undirected Probabilistic Graphical Models

Entity Linking with Distributional Semantics

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Entity Resolution in Texts Using Statistical Learning and Ontologies

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Combining Textual and Graph-Based Features for Named Entity Disambiguation Using Undirected Probabilistic Graphical Models

Entity Linking with Distributional Semantics

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation