Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Methodology and Architecture Embedding Quality Assessment in Data Integration

Published: 01 May 2014 Publication History

Abstract

Data integration aims to combine heterogeneous information sources and to provide interfaces for accessing the integrated resource. Data integration is a collaborative task that may involve many people with different degrees of experience, knowledge of the application domain, and expectations relating to the integrated resource. It may be difficult to determine and control the quality of an integrated resource due to these factors. In this article, we propose a data integration methodology that has embedded within it iterative quality assessment and improvement of the integrated resource. We also propose an architecture for the realisation of this methodology. The quality assessment is based on an ontology representation of different users’ quality requirements and of the main elements of the integrated resource. We use description logic as the formal basis for reasoning about users’ quality requirements and for validating that an integrated resource satisfies these requirements. We define quality factors and associated metrics which enable the quality of alternative global schemas for an integrated resource to be assessed quantitively, and hence the improvement which results from the refinement of a global schema following our methodology to be measured. We evaluate our approach through a large-scale real-life case study in biological data integration in which an integrated resource is constructed from three autononous proteomics data sources.

Supplementary Material

PDF File (a17-martin_appendix.pdf)
The proof is given in an electronic appendix, available online in the ACM Digital Library.

References

[1]
Aumueller, D., Do, H. H., Massmann, S., and Rahm, E. 2005. Schema and ontology matching with coma++. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’05). 906--908.
[2]
Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., and Patel-Schneider, P. F., Eds. 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press.
[3]
Batini, C., Lenzerini, M., and Navathe, S. B. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364.
[4]
Batini, C. and Scannapieco, M. 2006. Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer.
[5]
Batista, M. D. C. M. and Salgado, A. C. 2007. Information quality measurement in data integration schemas. In Proceedings of the 5th International Workshop on Quality in Databases (QDB’07). 61--72.
[6]
Belhajjame, K., Paton, N. W., Embury, S. M., Fernandes, A. A. A., and Hedeler, C. 2010. Feedback-based annotation, selection and refinement of schema mappings for dataspaces. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT’10). 573--584.
[7]
Belhajjame, K., Paton, N. W., Fernandes, A. A. A., Hedeler, C., and Embury, S. M. 2011. User feedback as a first class citizen in information integration systems. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR’11). 175--183.
[8]
Bernstein, P. and Melnik, S. 2007. Model management 2.0: Manipulating richer mappings. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). 1--12.
[9]
Bonifati, A., Chang, E. Q., Lakshmanan, A. V. S., Ho, T., and Pottinger, R. 2005. HePToX: Marrying xml and heterogeneity in your p2p databases. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05). 1267--1270.
[10]
Bonifati, A., Mecca, G., Pappalardo, A., Raunich, S., and Summa, G. 2008. Schema mapping verification: The spicy way. In Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology (EDBT’08). 85--96.
[11]
Buneman, P., Libkin, L., Suciu, D., Tannen, V., and Wong, L. 1994. Comprehension syntax. SIGMOD Rec. 23, 87--96.
[12]
Cabibbo, L. 2009. On keys, foreign keys and nullable attributes in relational mapping systems. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (EDBT’09). 263--274.
[13]
Calì, A., Calvanese, D., De Giacomo, G., and Lenzerini, M. 2004. Data integration under integrity constraints. Inf. Syst. 29, 147--163.
[14]
Calvanese, D., De Giacomo, G., Lenzerini, M., and Rosati, R. 2004. Logical foundations of peer-to-peer data integration. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database systems (PODS’04). 241--251.
[15]
Chiticariu, L. and Tan, W.-C. 2006. Debugging schema mappings with routes. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06). 79--90.
[16]
Choi, N., Song, I.-Y., and Han, H. 2006. A survey on ontology mapping. SIGMOD Rec. 35, 3, 34--41.
[17]
Chu, X., Ilyas, I. F., and Papotti, P. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE’13). 458--469.
[18]
Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., and Halevy, A. 2003. Learning to match ontologies on the semantic web. VLDB J. 12, 4, 303--319.
[19]
Duchateau, F. and Bellahsene, Z. 2010. Measuring the quality of an integrated schema. In Proceedings of the 29th International Conference on Conceptual Modeling (ER’10). 261--273.
[20]
Fagin, R., Kolaitis, P. G., and Popa, L. 2005. Data exchange: Getting to the core. ACM Trans. Database Syst. 30, 1, 174--210.
[21]
Fan, W., Jia, X., Li, J., and Ma, S. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1, 407--418.
[22]
Fan, W., Ma, S., Hu, Y., Liu, J., and Wu, Y. 2008. Propagating functional dependencies with conditions. Proc. VLDB Endow. 1, 1, 391--407.
[23]
Friedman, M., Levy, A., and Millstein, T. 1999. Navigational plans for data integration. In Proceedings of the 16th National Conference on Artificial Intelligence and the 11th Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence (AAAI/IAAI’99). 67--73.
[24]
Gertz, M. 1998. Managing data quality and integrity in federated databases. In Proceedings of the IFIP TC11 Working Group 11.5, 2nd Working Conference on Integrity and Internal Control in Information Systems: Bridging Business Requirements and Research Results (IICIS’98). 211--230.
[25]
Halevy, A. Y. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270--294.
[26]
Horrocks, I., Sattler, U., and Tobies, S. 1999. Practical reasoning for expressive description logics. In Proceedings of the 6th International Conference on Logic Programming and Automated Reasoning (LPAR’99). 161--180.
[27]
Jarke, M., Vassiliou, Y., Vassiliadis, P., and Lenzerini, M. 1999. Fundamentals of Data Warehouses. Springer.
[28]
Jasper, E., Tong, N.,Mcbrien, P., and Poulovassilis, A. 2004. Generating and optimising views from both as view data integration rules. In Proceedings of the 6th Baltic Conference on Database and Information Systems (DBIS’04). Vol. 972, 13--30.
[29]
Kalfoglou, Y. and Schorlemmer, M. 2003. Ontology mapping: The state of the art. Knowl. Engin. Rev. 18, 1--31.
[30]
Klug, A. 1980. Calculating constraints on relational expression. ACM Trans. Database Syst. 5, 3, 260--290.
[31]
Klug, A. and Price, R. 1982. Determining view dependencies using tableaux. ACM Trans. Database Syst. 7, 3, 361--380.
[32]
Lenzerini, M. 2002. Data integration: A theoretical perspective. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’02). 233--246.
[33]
Leser, U., Naumann, F., and Eckman, B. A., Eds. 2006. Data integration in the life sciences. In Proceedings of the 3rd International Workshop on Data Integration for the Life Sciences (DILS’06). Lecture Notes in Computer Science, vol. 4075, Springer.
[34]
Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01). 49--58.
[35]
McBrien, P. and Poulovassilis, A. 1998. A formalisation of semantic schema integration. Inf. Syst. 23, 5, 307--334.
[36]
McBrien, P. and Poulovassilis, A. 1999. A uniform approach to inter-model transformations. In Proceedings of the 11th International Conference on Advanced Information Systems Engineering (CAiSE’99). 333--348.
[37]
McBrien, P. and Poulovassilis, A. 2003. Data integration by bi-directional schema transformation rules. In Proceedings of the 19th International Conference on Data Engineering (ICDE’03). 227--238.
[38]
Meilicke, C., Stuckenschmidt, H., and Tamilin, A. 2008. Supporting manual mapping revision using logical reasoning. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08). Vol. 2, 1213--1218.
[39]
Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the 18th International Conference on Data Engineering (ICDE’02). 117--120.
[40]
Miller, R. J., Hernandez, M. A., Haas, L. M., Yan, L.-L., Howard Ho, C. T., Fagin, R., and Popa, L. 2001. The Clio project: Managing heterogeneity. SIGMOD Rec. 30, 1, 78--83.
[41]
Missier, P., Embury, S., Greenwood, M., Preece, A., and Jin, B. 2006. Quality views: Capturing and exploiting the user perspective on data quality. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB’06). 977--988.
[42]
Naumann, F., Leser, U., and Freytag, J. C. 1999. Quality-driven integration of heterogenous information systems. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99). 447--458.
[43]
Noy, N. F. and Musen, M. A. 2000. PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence (AAAI/IAAI’00). 450--455.
[44]
Poulovassilis, A. 2001. The automed intermediate query langauge. Tech. rep. 2, Department of Computer Science and Information Systems, Birkbeck College.
[45]
Poulovassilis, A. and McBrien, P. 1998. A general formal framework for schema transformation. Data Knowl. Engin. 28, 47--71.
[46]
Qian, X. 1996. Query folding. In Proceedings of the 12th International Conference on Data Engineering (ICDE’96). 48--55.
[47]
Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.
[48]
Saleem, K., Bellahsene, Z., and Hunt, E. 2008. Porsche: Performance oriented schema mediation. Inf. Syst. 33, 637--657.
[49]
Straccia, U. 2001. Reasoning within fuzzy description logics. J. Artif. Intell. Res. 14, 137--166.
[50]
Vassiliadis, P. 2000. Data warehouse modeling and quality issues. Ph.D. thesis, National Technical University of Athens.
[51]
Wang, J. 2012. A framework and architecture for quality assessment in data integration.Ph.D. thesis, Birkbeck College, University of London.
[52]
Wang, J., Martin, N., and Poulovassilis, A. 2012. An ontology-based quality framework for data integration. In Proceedings of the 10th International Conference on Business Informatics Research Workshop Post-Proceedings. Lecture Notes in Business Information Processing, vol. 106, Springer, 196--208.
[53]
Yan, L. L., Miller, R. J., Haas, L. M., and Fagin, R. 2001. Data-driven understanding and refinement of schema mappings. SIGMOD Rec. 30, 485--496.
[54]
Zamboulis, L. 2009. XML data transformation and integration - A schema transformation approach. Ph.D. thesis, Birkbeck College, University of London.
[55]
Zamboulis, L., Fan, H., Belhajjame, K., Siepen, J. A., Jones, A. C., Martin, N. J., Poulovassilis, A., Hubbard, S. J., Embury, S. M., and Paton, N. W. 2006. Data access and integration in the ispider proteomics grid. In Proceedings of the 3rd International Workshop on Data Integration for the Life Sciences (DILS’06). Lecture Notes in Computer Science, vol. 4075, Springer, 3--18.
[56]
Zamboulis, L., Poulovassilis, A., and Wang, J. 2008. Ontology-assisted data transformation and integration. In Proceedings of the 4th International VLDB Workshop on Ontology-based Techniques for DataBases in Information Systems and Knowledge Systems (ODBIS’08). 29--36.

Cited By

View all
  • (2022)Domain-Specific Visual Language for Data Engineering QualityProceedings of the 1st ACM SIGPLAN International Workshop on Programming Abstractions and Interactive Notations, Tools, and Environments10.1145/3563836.3568727(48-56)Online publication date: 29-Nov-2022
  • (2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
  • (2019)A Framework for the Data Integration of Earthquake EventsIEEE Access10.1109/ACCESS.2019.29570247(172628-172637)Online publication date: 2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 4, Issue 4
May 2014
97 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/2628135
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2014
Accepted: 01 December 2013
Revised: 01 October 2013
Received: 01 August 2012
Published in JDIQ Volume 4, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data integration
  2. data quality
  3. data quality assessment

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Domain-Specific Visual Language for Data Engineering QualityProceedings of the 1st ACM SIGPLAN International Workshop on Programming Abstractions and Interactive Notations, Tools, and Environments10.1145/3563836.3568727(48-56)Online publication date: 29-Nov-2022
  • (2019)Crowdsourced Targeted Feedback Collection for Multicriteria Data Source SelectionJournal of Data and Information Quality10.1145/328493411:1(1-27)Online publication date: 4-Jan-2019
  • (2019)A Framework for the Data Integration of Earthquake EventsIEEE Access10.1109/ACCESS.2019.29570247(172628-172637)Online publication date: 2019
  • (2019)Framework for Construction and Incremental Maintenance of High-Quality Linked Data MashupAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_19(213-221)Online publication date: 4-Nov-2019
  • (2018)E-Business-Software – Big-Data-Management mit semantischen TechnologienMedia Management10.1007/978-3-658-23297-9_21(365-386)Online publication date: 8-Aug-2018
  • (2017)Ontology-Based Big Data ManagementSystems10.3390/systems50300455:3(45)Online publication date: 6-Jul-2017
  • (2017)Replacing missing values using trustworthy data values from web data sourcesJournal of Physics: Conference Series10.1088/1742-6596/892/1/012009892(012009)Online publication date: 21-Sep-2017
  • (2016)Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA)10.1109/IWSM-Mensura.2016.026(115-124)Online publication date: Oct-2016
  • (2016)Some Future Directions for Business Process ModelingQuality in Business Process Modeling10.1007/978-3-319-42512-2_6(227-239)Online publication date: 28-Oct-2016
  • (2015)Document and Corpus Quality Challenges for Knowledge Management in Engineering EnterprisesJournal of Data and Information Quality10.1145/28183796:4(1-3)Online publication date: 19-Oct-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media