Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

RAMP: a system for capturing and tracing provenance in MapReduce workflows

Published: 01 August 2011 Publication History

Abstract

RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. RAMP uses a wrapper-based approach, requiring little if any user intervention in most cases, while retaining Hadoop's parallel execution and fault tolerance. We demonstrate RAMP on a real-world MapReduce workflow generated from a Pig script that performs sentiment analysis over Twitter data. We show how RAMP's automatic provenance capture and tracing capabilities provide a convenient and efficient means of drilling-down and verifying output elements.

References

[1]
Jaql. http://code.google.com/p/jaql/.
[2]
Apache. Hadoop. http://hadoop.apache.org/.
[3]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[4]
R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR, pages 273--283, 2011.
[5]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.
[6]
B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, pages 115--124, 2005.
[7]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive: A petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.
[8]
J. Yang and J. Leskovec. Patterns of temporal variation in online media. In WSDM, pages 177--186, 2011.

Cited By

View all
  • (2024)MAESTRO: a lightweight ontology-based framework for composing and analyzing script-based scientific experimentsKnowledge and Information Systems10.1007/s10115-024-02134-266:10(5959-6000)Online publication date: 1-Oct-2024
  • (2021)Twitter Data Modelling and Provenance Support for Key-Value Pair DatabasesDatabases Theory and Applications10.1007/978-3-030-69377-0_8(87-98)Online publication date: 29-Jan-2021
  • (2020)A generic explainability framework for function circuitsProceedings of the 12th USENIX Conference on Theory and Practice of Provenance10.5555/3488890.3488897(7-7)Online publication date: 22-Jun-2020
  • Show More Cited By

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 4, Issue 12
August 2011
303 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2011
Published in PVLDB Volume 4, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MAESTRO: a lightweight ontology-based framework for composing and analyzing script-based scientific experimentsKnowledge and Information Systems10.1007/s10115-024-02134-266:10(5959-6000)Online publication date: 1-Oct-2024
  • (2021)Twitter Data Modelling and Provenance Support for Key-Value Pair DatabasesDatabases Theory and Applications10.1007/978-3-030-69377-0_8(87-98)Online publication date: 29-Jan-2021
  • (2020)A generic explainability framework for function circuitsProceedings of the 12th USENIX Conference on Theory and Practice of Provenance10.5555/3488890.3488897(7-7)Online publication date: 22-Jun-2020
  • (2020)Improving reproducibility of data science pipelines through transparent provenance captureProceedings of the VLDB Endowment10.14778/3415478.341555613:12(3354-3368)Online publication date: 14-Sep-2020
  • (2020)Demonstration of interactive runtime debugging of distributed dataflows in TexeraProceedings of the VLDB Endowment10.14778/3415478.341551713:12(2953-2956)Online publication date: 14-Sep-2020
  • (2020)A Column-Level Data Lineage Processing System Based on HiveProceedings of the 3rd International Conference on Big Data Technologies10.1145/3422713.3422719(47-52)Online publication date: 18-Sep-2020
  • (2019)Orchestrating Big Data Analysis Workflows in the CloudACM Computing Surveys10.1145/333230152:5(1-41)Online publication date: 13-Sep-2019
  • (2019)AriadneProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300091(521-536)Online publication date: 25-Jun-2019
  • (2019)Hypothetical Reasoning via Provenance AbstractionProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300084(537-554)Online publication date: 25-Jun-2019
  • (2018)Debugging Distributed Systems with Why-Across-Time ProvenanceProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267839(333-346)Online publication date: 11-Oct-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media