Abstract
Data provenance can support the understanding and debugging of complex data processing pipelines, which are for instance common in data integration scenarios. One task in data integration is entity resolution (ER), i.e., the identification of multiple representations of a same real world entity. This paper focuses of provenance modeling and capture for typical ER tasks. While our definition of ER provenance is independent of the actual language or technology used to define an ER task, the method we implement as a proof of concept instruments ER rules specified in HIL, a high-level data integration language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Hernández, M.A., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: HIL: a high-level scripting language for entity integration. In: EDBT (2013)
Grumbach, S., Milo, T.: Towards Tractable Algebras for Bags. In: PODS (1993)
Camacho-Rodríguez, J., Colazzo, D., Herschel, M., Manolescu, I., Roy Chowdhury, S.: Reuse-based optimization for pig latin. In: CIKM (2016)
Herschel, M., Diestelkämper, R., Lahmar, H.B.: A survey on provenance: what for? What form? What from? VLDB J. (2017)
Acknowledgements
The authors thank the German Research Foundation (DFG) for financial support within project D03 of SFB/ Transregio 161. This research was also partly funded by an IBM Faculty Award.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Oppold, S., Herschel, M. (2018). Provenance for Entity Resolution. In: Belhajjame, K., Gehani, A., Alper, P. (eds) Provenance and Annotation of Data and Processes. IPAW 2018. Lecture Notes in Computer Science(), vol 11017. Springer, Cham. https://doi.org/10.1007/978-3-319-98379-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-98379-0_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98378-3
Online ISBN: 978-3-319-98379-0
eBook Packages: Computer ScienceComputer Science (R0)