Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

On-the-fly entity-aware query processing in the presence of linkage

Published: 01 September 2010 Publication History

Abstract

Entity linkage is central to almost every data integration and data cleaning scenario. Traditional techniques use some computed similarity among data structure to perform merges and then answer queries on the merged data. We describe a novel framework for entity linkage with uncertainty. Instead of using the linkage information to merge structures a-priori, possible linkages are stored alongside the data with their belief value. A new probabilistic query answering technique is used to take the probabilistic linkage into consideration. The framework introduces a series of novelties: (i) it performs merges at run time based not only on existing linkages but also on the given query; (ii) it allows results that may contain structures not explicitly represented in the data, but generated as a result of a reasoning on the linkages; and (iii) enables an evaluation of the query conditions that spans across linked structures, offering a functionality not currently supported by any traditional probabilistic databases. We formally define the semantics, describe an efficient implementation and report on the findings of our experimental evaluation.

References

[1]
P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
[2]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002.
[3]
P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006.
[4]
L. Antova, C. Koch, and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
[5]
G. J. Bex, F. Neven, and S. Vansummeren. Inferring xml schema definitions from xml data. In VLDB, pages 998--1009, 2007.
[6]
I. Bhattacharya and L. Getoor. Deduplication and group detection using links. In LinkKDD, 2004.
[7]
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In DMKD, pages 11--18, 2004.
[8]
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.
[9]
N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts. In PODS, pages 1--12, 2009.
[10]
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB, 16(4):523--544, 2007.
[11]
N. N. Dalvi and D. Suciu. Management of probabilistic data: foundations and challenges. In PODS, pages 1--12, 2007.
[12]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003.
[13]
A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 26(1):83--94, 2005.
[14]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85--96, 2005.
[15]
X. Dong and A. Y. Halevy. Indexing dataspaces. In SIGMOD, pages 43--54, 2007.
[16]
X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB, pages 687--698, 2007.
[17]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
[18]
L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explorations, 2005.
[19]
R. V. Guha and R. McCool. Tap: a semantic web platform. Computer Networks, 42(5):557--577, 2003.
[20]
A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006.
[21]
M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Min. Knowl. Discov., 1998.
[22]
E. Ioannou, C. Niederée, and W. Nejdl. Probabilistic entity linkage for heterogeneous information spaces. In CAiSE, pages 556--570, 2008.
[23]
E. Ioannou, C. Niederee, and Y. Velegrakis. Enabling Entity-Based Aggregators for Web 2.0 data. pages 1119--1120, 2010.
[24]
D. V. Kalashnikov and S. Mehrotra. Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst., 31(2):716--767, 2006.
[25]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006.
[26]
A. M. Ouksel and A. P. Sheth. Semantic interoperability in global information systems: A brief introduction to the research area and the special section. SIGMOD Record, 28(1):5--12, 1999.
[27]
L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, and R. Fagin. Translating Web Data. In VLDB, pages 598--609, 2002.
[28]
C. Re and D. Suciu. Managing probabilistic data with mystiq: The can-do, the could-do, and the can't-do. In SUM, pages 5--18, 2008.
[29]
F. Rizzolo, Y. Velegrakis, J. Mylopoulos, and S. Bykau. Modeling Concept Evolution: A Historical Perspective. volume 5829, pages 331--345, 2009.
[30]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, pages 269--278, 2002.
[31]
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.
[32]
P. Singla and P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, 2004.
[33]
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009.

Cited By

View all
  • (2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
  • (2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
  • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
  • Show More Cited By

Index Terms

  1. On-the-fly entity-aware query processing in the presence of linkage
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
        September 2010
        1658 pages
        ISSN:2150-8097
        • Editors:
        • Elisa Bertino,
        • Paolo Atzeni,
        • Kian Lee Tan,
        • Yi Chen,
        • Y. C. Tay
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        Published: 01 September 2010
        Published in PVLDB Volume 3, Issue 1-2

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)14
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 19 Sep 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)A Randomized Blocking Structure for Streaming Record LinkageProceedings of the VLDB Endowment10.14778/3611479.361148716:11(2783-2791)Online publication date: 24-Aug-2023
        • (2023)FlexER: Flexible Entity Resolution for Multiple IntentsProceedings of the ACM on Management of Data10.1145/35887221:1(1-27)Online publication date: 30-May-2023
        • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
        • (2019)EMBench++Semantic Web10.3233/SW-18033110:2(435-450)Online publication date: 1-Jan-2019
        • (2018)Keyword Search with Real-time Entity Resolution in Relational DatabasesProceedings of the 2018 10th International Conference on Machine Learning and Computing10.1145/3195106.3195171(134-139)Online publication date: 26-Feb-2018
        • (2017)Multi-source uncertain entity resolutionInformation Systems10.5555/3050918.305095365:C(124-136)Online publication date: 1-Apr-2017
        • (2017)Holistic query evaluation over information extraction pipelinesProceedings of the VLDB Endowment10.14778/3149193.314920111:2(217-229)Online publication date: 1-Oct-2017
        • (2017)QDAIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2016.262360729:2(402-417)Online publication date: 1-Feb-2017
        • (2016)Searching Web 2.0 Data Through Entity-Based AggregationTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090183(159-174)Online publication date: 1-Jan-2016
        • (2016)Entity-Based Keyword Search in Web DocumentsTransactions on Computational Collective Intelligence XXI - Volume 963010.5555/3090176.3090178(21-49)Online publication date: 1-Jan-2016
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media