In data integration efforts, portal development in particular, much development time is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates or solve other semantic conflicts. It proves impossible, however, to automatically get rid of all semantic problems. An often-used rule of thumb states that about 90% of the development effort is devoted to semi-automatically resolving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that strives for a ‘good enough’ initial integration which stores any remaining semantic uncertainty and conflicts in a probabilistic database. The remaining cases are to be resolved with user feedback during query time. The main contribution of this paper is an experimental investigation of the effects and sensitivity of rule definition, threshold tuning, and user feedback on the integration quality. We claim that our approach indeed reduces development effort—and not merely shifts the effort—by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ initial integration that can be meaningfully used, and that user feedback is effective in gradually improving the integration quality.
Similar content being viewed by others
Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: Proceedings of the 23nd international conference on data engineering (ICDE), Istanbul, Turkey, pp. 1479–1480, April 2007
Abiteboul, S., Senellart, P.: Querying and updating probabilistic information in XML. In: Proceedings of the international conference on extending database technology (EDBT), Munich, Germany, pp. 1059–1068, (2006) (LNCS 3896)
Boulos, J., Dalvi, N.N., Mandhani, B., Mathur, S., Re, C., Suciu, D.: MYSTIQ: a system for finding more answers by using probabilities. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data, Baltimore, Maryland, USA, pp. 891–893 (2005)
Benjelloun O., Garcia-Molina H., Menestrina D., Su Q., Whang S., Widom J.: Swoosh: a generic approach to entity resolution. VLDB J. 18(1), 255–276 (2009)
Barbará, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology (EDBT) Venice, Italy, vol. 416 of LNCS, pp. 60–74. Springer, Berlin, March 1990. ISBN 3-540-52291-3
Boncz, P.A., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: Proceedings of SIGMOD, Chicago, IL, pp. 479–490 (2006)
Benjelloun O., Das Sarma A., Hayworth C., Widom J.: An introduction to ULDBs and the Trio system. IEEE Data Eng. Bull 29(1), 5–16 (2006)
Baeza-Yates R., Ribeiro-Neto B.: Modern information retrieval. Addison Wesley, Reading (1999) ISBN 0-201-39829-X
Cheng, R., Chen, J., Xie, X.: Cleaning uncertain data with quality guarantees. In: Proceedings of the 34th international conference on very large data bases (VLDB), Auckland, New Zealand, pp. 722–735, August 2008
Cheng, R., Singh, S., Prabhakar, S.: U-DBMS: a database system for managing constantly-evolving data. In: Proceedings of the 31st international conference on very large data bases (VLDB), Trondheim, Norway, pp. 1271–1274 (2005)
Cheng, T., Yan, X., Chang, C.-C.K.: EntityRank: searching entities directly and holistically. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 387–398, September 23–27, 2007. ACM, (2007)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the ACM SIGMOD international conference on management of data, Santa Barbara, California, USA, pp. 509–520, May 2001. ISBN 1-58113-332-4
Doan A., Domingos P., Halevy A.Y.: Learning to match the schemas of data sources: a multistrategy approach. Mach Learn 50(3), 279–301 (2003)
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: a brief survey. AI Magazine, (2005)
Luna Dong, X., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, September 23–27, pp. 687–698. ACM (2007)
de Keijzer, A., van Keulen, M.: Quality measures in uncertain data management. In: Proceedings of the 1st international conference on scalable uncertainty management (SUM), Washington DC, vol. 4772 of LNCS, pp. 104–115 (2007)
de Keijzer, A., van Keulen, M.: User feedback in probabilistic integration. In: 2nd international workshop on flexible database and information system technology (FlexDBIST), Regensburg, Germany, Los Alamitos, pp. 377–381, September 2007
de Keijzer, A., van Keulen, M.: IMPrECISE: good-is-good-enough data integration. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, April 2008
de Keijzer, A., van Keulen, M., Li, Y.: Taming data explosion in probabilistic information integration. In: On-line pre-proc. of IIDB, Munich, Germany, pp. 82–86 (2006). Position paper. http://ssi.umh.ac.be/iidb
de Rougemont, M.: The reliability of queries. In: Proceedings of the 14th ACM symposium on principles of database systems (PODS), San Jose, CA, pages 286–291, May 1995
DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., Ramakrishnan, R.: Dblife: A community information management platform for the database research community (demo). In: Proceedings of the 3rd biennial conference on innovative data systems research (CIDR), Asilomar, CA, pp. 169–172, January 2007
DeRose, P., Shen, W., Chen, F., Doan, A., Ramakrishnan, R.: Building structured web community portals: a top-down, compositional, and incremental approach. In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria, pp. 399–410. ACM, September 2007
Furfaro F., Greco S., Molinaro C.: A three-valued semantics for querying and repairing inconsistent databases. Ann Math Artif Intell 51(2–4), 167–193 (2007)
Gal, A.: Interpreting similarity measures: bridging the gap between schema matching and data integration. In: Proceedings of the workshop on information integration methods, architectures, and systems (IIMAS), Cancun, Mexico, pp. 278–285, April 2008
Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: Proceedings of the 17th ACM symposium on principles of database systems (PODS), Seattle, WA, pp. 227–234, June 1998
Hung, E., Getoor, L., Subrahmanian, V.S.: PXML: A probabilistic semistructured data model and algebra. In: Proceedings of the 19th international conference on data engineering (ICDE), Bangalore, India, p. 467, March 2003. ISBN 0-7803-7665-X
Hunter A., Liu W.: Merging uncertain information with semantic heterogeneity in XML. Knowl Inf Syst 9(2), 230–258 (2006)
Kanagal, B., Deshpande, A.: Online filtering, smoothing and probabilistic modeling of streaming data. In: Proceedings of the 24th international conference on data engineering (ICDE), Cancun, Mexico, pp. 1160–1169, April 2008
van Keulen, M., de Keijzer, A., Alink, W.: A probabilistic XML approach to data integration. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, pp. 459–470, April 2005
Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: Proceedings of the 34th international conference on very large data bases (VLDB2008), Auckland, New Zealand, pages 313–326, August 2008
Lakshmanan L.V.S., Leone N., Ross R., Subrahmanian V.S.: ProbView: a flexible probabilistic database system. ACM Trans Database Syst 22(3), 419–469 (1997)
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proceedings of the first international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006
Magnani, M., Montesi, D.: Uncertainty in data integration: current approaches and open problems. In: Proceedings of the 1st international workshop on management of uncertain data (MUD), Vienna, Austria, number WP07-08 in CTIT workshop proceedings, September 2007. ISSN 0929-0672
Milano, D., Scannapieco, M., Catarci, T.: Structure aware xml object identification. In: Proceedings of the 1st international VLDB workshop on clean databases (CleanDB), Seoul, Korea, September 2006
Orr K.: Data quality and systems theory. Commun ACM 41(2), 66–71 (1998)
Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD), Vancouver, Canada, pages 861–874, (2008)
Serdyukov, P., Hiemstra, D.: Modeling documents as mixtures of persons for expert finding. In: Proceedings of the 30th European conference on IR research (ECIR), Glasgow, UK, vol. 4956 of LNCS, pp. 309–320, Springer, Berlin, April 2008
van Kessel, R.: Querying probabilistic xml. Master’s thesis, University of Twente, Enschede, Netherlands, April 2008
Wijsen, J.: Project-join-repair: an approach to consistent query answering under functional dependencies. In: Proceedings of the 7th international conference on flexible query answering systems (FQAS), Milan, Italy, vol. 4027 of LNCS, pp. 1–12. Springer, Berlin, June 2006
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
van Keulen, M., de Keijzer, A. Qualitative effects of knowledge rules and user feedback in probabilistic data integration. The VLDB Journal 18, 1191–1217 (2009). https://doi.org/10.1007/s00778-009-0156-z
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0156-z