Nothing Special   »   [go: up one dir, main page]

skip to main content
article

A survey of data provenance in e-science

Published: 01 September 2005 Publication History

Abstract

Data management is growing in complexity as large-scale applications take advantage of the loosely coupled resources brought together by grid middleware and by abundant storage capacity. Metadata describing the data products used in and generated by these applications is essential to disambiguate the data and enable reuse. Data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources.In this paper we create a taxonomy of data provenance characteristics and apply it to current research efforts in e-science, focusing primarily on scientific workflow approaches. The main aspect of our taxonomy categorizes provenance systems based on why they record provenance, what they describe, how they represent and store provenance, and ways to disseminate it. The survey culminates with an identification of open research problems in the field.

References

[1]
J. Brase, "Using Digital Library Techniques - Registration of Scientific Primary Data," in ECDL, 2004.]]
[2]
D. G. Clarke and D. M. Clark, "Lincage," in Elements of Spatial Data Quality, 1995.]]
[3]
J. L. Romeu, "Data Quality and Pedigree," in Material Ease, 1999.]]
[4]
H. V. Jagadish and F. Olken, "Database Management for Life Sciences Research," in SIGMOD Record, vol. 33, 2004.]]
[5]
"Access to genetic resources and Benefit-Sharing (ABS) Program," United Nations University, 2003.]]
[6]
P. Buneman, S. Khanna, and W. C. Tan, "Why and Where: A Characterization of Data Provenance," in ICDT, 2001.]]
[7]
D. P. Lanter, "Design of a Lineage-Based Meta-Data Base for GIS," in Cartography and Geographic Information Systems, vol. 18, 1991.]]
[8]
M. Greenwood, C. Goble, R. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn, "Provenance of e-Science Experiments - experience from Bioinformatics," in Proceedings of the UK OST e-Science 2nd AHM, 2003.]]
[9]
Y. L. Simmhan, B. Plale, and D. Gannon, "A Survey of Data Provenance Techniques," in Technical Report TR-618: Computer Science Department, Indiana University, 2005.]]
[10]
R. Bose and J. Frew, "Lineage retrieval for scientific data processing: a survey," in ACM Comput. Surv., vol. 37, 2005.]]
[11]
S. Miles, P. Groth, M. Branco, and L. Moreau, "The requirements of recording and using provenance in e-Science experiments," in Technical Report, Electronics and Computer Science, University of Southampton, 2005.]]
[12]
D. Pearson, "Presentation on Grid Data Requirements Scoping Metadata & Provenance," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]
[13]
G. Cameron, "Provenance and Pragmatics," in Workshop on Data Provenance and Annotation, Edinburgh, 2003.]]
[14]
C. Goble, "Position Statement: Musings on Provenance, Workflow and (Semantic Web) Annotations for Bioinformatics," in Workshop on Data Derivation and Provenance, Chicago, 2002.]]
[15]
P. P. da Silva, D. L. McGuinness, and R. McCool, "Knowledge Provenance Infrastructure," in IEEE Data Engineering Bulletin, vol. 26, 2003.]]
[16]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving Data Cleaning Quality Using a Data Lineage Facility," in DMDW, 2001.]]
[17]
I. T. Foster, J. S. Vöckler, M. Wilde, and Y. Zhao. "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration," in CIDR, 2003.]]
[18]
J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer, "Semantically Linking and Browsing Provenance Logs for E-science," in ICSNW, 2004.]]
[19]
A. Woodruff and M. Stonebraker, "Supporting Fine-grained Data Lineage in a Database Visualization Environment," in ICDE, 1997.]]
[20]
B. Plale, D. Gannon, D. Reed, S. Graves, K. Droegemeier, B. Wilhelmson, and M. Ramamurthy, "Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD," in ICCS workshop on Dynamic Data Driven Applications, 2005.]]
[21]
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya, "An Annotation Management System for Relational Databases," in VLDB, 2004.]]
[22]
Y. Cui and J. Widom, "Practical Lineage Tracing in Data Warehouses," in ICDE, 2000.]]
[23]
J. Widom, "Trio: A System for Integrated Management of Data, Accuracy, and Lineage," in CIDR, 2005.]]
[24]
C. Pancerella, J. Hewson, W. Koegler, D. Leahy, M. Lee, L. Rahn, C. Yang, J. D. Myers, B. Didier, R. McCoy, K. Schuchardt, E. Stephan, T. Windus, K. Amin, S. Bittner, C. Lansing, M. Minkoff, S. Nijsure, G. v. Laszewski, R. Pinzon, B. Ruscic, Al Wagner, B. Wang, W. Pitz, Y. L. Ho, D. Montoya, L. Xu, T. C. Allison, W. H. Green, Jr, and M. Frenklach, "Metadata in the collaboratory for multi-scale chemical science," in Dublin Core Conference, 2003.]]
[25]
J. Myers, C. Pancerella, C. Lansing, K. Schuchardt, and B. Didier, "Multi-Scale Science, Supporting Emerging Practice with Semantically Derived Provenance," in ISWC workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003.]]
[26]
R. Bose and J. Frew, "Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products," in SSDBM, 2004.]]
[27]
I. T. Foster, J.-S. Vöckler, M. Wilde, and Y. Zhao, "Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation," in SSDBM, 2002.]]
[28]
J. Frew and R. Bose, "Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products," in SSDBM, 2001.]]
[29]
Y. Cui and J. Widom, "Lineage tracing for general data warehouse transformations," in VLDB Journal, vol. 12, 2003.]]

Cited By

View all
  • (2025)Tracking provenance in clinical data warehouses for quality managementInternational Journal of Medical Informatics10.1016/j.ijmedinf.2024.105690193(105690)Online publication date: Jan-2025
  • (2024)Segurança da informação na pesquisa científicaRevista Sociedade Científica10.61411/rsc2024366177:1(1952-1964)Online publication date: 15-Apr-2024
  • (2024)Cloud‐based provenance framework for duplicates identification and data quality enhancementExpert Systems10.1111/exsy.13600Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 34, Issue 3
September 2005
115 pages
ISSN:0163-5808
DOI:10.1145/1084805
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2005
Published in SIGMOD Volume 34, Issue 3

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)26
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Tracking provenance in clinical data warehouses for quality managementInternational Journal of Medical Informatics10.1016/j.ijmedinf.2024.105690193(105690)Online publication date: Jan-2025
  • (2024)Segurança da informação na pesquisa científicaRevista Sociedade Científica10.61411/rsc2024366177:1(1952-1964)Online publication date: 15-Apr-2024
  • (2024)Cloud‐based provenance framework for duplicates identification and data quality enhancementExpert Systems10.1111/exsy.13600Online publication date: Apr-2024
  • (2024)This is the Table I Want! Interactive Data Transformation on Desktop and in Virtual RealityIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.329960230:8(5635-5650)Online publication date: 1-Aug-2024
  • (2024)From Invisible to Visible: Impacts of Metadata in Communicative Data VisualizationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323171630:7(3427-3443)Online publication date: Jul-2024
  • (2024)PROV-IO$^+$+: A Cross-Platform Provenance Framework for Scientific Data on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337455535:5(844-861)Online publication date: 14-Mar-2024
  • (2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
  • (2024)A Hardware-Based Correct Execution Environment Supporting Virtual MemoryIEEE Access10.1109/ACCESS.2024.344350912(114008-114022)Online publication date: 2024
  • (2024)e-Science workflow: A semantic approach for airborne pollen predictionKnowledge-Based Systems10.1016/j.knosys.2023.111230284(111230)Online publication date: Jan-2024
  • (2023)Framework for Data Provenance Assurance in Cloud Environment using Ethereum BlockchainICST Transactions on Scalable Information Systems10.4108/eetsis.3536Online publication date: 9-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media