Abstract
We examine provenance in the context of a distributed job execution system. It is crucial to capture provenance information during the execution of a job in a distributed environment because often this information is lost once the job has finished. In this paper we discuss the type of information that is available within a distributed job execution system, how to capture such information, and what the burdens on the user and system are when such information is captured. We identify what we think is the key data that must be captured and discuss the collection of provenance in the Quill++ project of Condor. Our conclusion is that it is possible to capture important provenance information in a distributed job execution system with relatively little intrusion on the user or the system.
To be published in: Proceedings of the International Provenance and Annotation Workshop, May 3-5, 2006, Chicago, IL. In: Lecture Notes in Computer Science.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Bose, R., Frew, J.: Lineage retrieval for scientific data processing: A survey. ACM Computing Surveys 37, 1–28 (2005)
Jagadish, H., Olken, F.: Data management for the biosciences: Report of the NSF/NLM workshop on data management for molecular and cell biology, national library of medicine. Technical Report LBNL Report LBNL-52767, Lawrence Berkeley National Laboratory (2003)
Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance in e-science. SIGMOD Record 34, 31–36 (2005)
Simmhan, Y.L., Plale, B., Gannon, D.: A survey of data provenance techniques. Technical Report IUB-CS-TR618, Computer Science Department, Indiana University, Bloomington, Indiana (2005)
Condor: Project homepage (2006), http://www.cs.wisc.edu/condor/
Tannenbaum, T., Wright, D., Miller, K., Livny, M.: Condor – A distributed job scheduler. In: Sterling, T. (ed.) Beowulf Cluster Computing with Linux. MIT Press, Cambridge (2001)
Buneman, P., Khanna, S., Tan, W.C.: Why and where: A characterization of data provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. In: Proceedings of the 27th VLDB Conference, Roma, Italy (2001)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. Technical report, Stanford University Database Group (2001)
Cui, Y., Widom, J.: Lineage tracing for general data warehouse transformations. VLDB Journal 12, 41–58 (2003)
Fan, H., Poulovassilis, A.: Tracing data lineage using schema transformation pathways. In: Omelayenko, B., Klein, M. (eds.) Knowledge Transformation for the Semantic Web. IOS Press, Amsterdam (2003)
Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: 14th International Conference on Scientific and Statistical Database Management (2002)
Frew, J., Bose, R.: Earth system science workbench: A data management infrastructure for earth science products. In: Thirteenth International Conference on Scientific and Statistical Database Management, Fairfax, Virginia, pp. 180–189 (2001)
Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: CIDR (2005)
Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: Proceedings of the 13th International Conference on Data Engineering, Birmingham, England, April 1997, pp. 91–102 (1997)
Cui, Y., Widom, J.: Storing auxiliary data for efficient maintenance and lineage tracing of complex views. In: Proceedings of the International Workshop on Design and Management of Data Warehouses (DMDW), Stockholm, Sweden (2000)
Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and grid services. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 603–620. Springer, Heidelberg (2003)
Barga, R.: Automatic generation of workflow execution provenance. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 1–9. Springer, Heidelberg (2006), http://www.ipaw.info/ipaw06
Braun, U., Garfinkel, S., Holland, D.A., Muniswamy-Reddy, K.K., Seltzer, M.I.: Issues in automatic provenance collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006), http://www.ipaw.info/ipaw06
Huang, J., Kini, A., Reilly, C., Robinson, E., Shankar, S., Shrinivas, L., DeWitt, D., Naughton, J.: An overview of Quill++: A passive operational data logging system for Condor (2006), https://www.cs.wisc.edu/condordb
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Reilly, C.F., Naughton, J.F. (2006). Exploring Provenance in a Distributed Job Execution System. In: Moreau, L., Foster, I. (eds) Provenance and Annotation of Data. IPAW 2006. Lecture Notes in Computer Science, vol 4145. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11890850_24
Download citation
DOI: https://doi.org/10.1007/11890850_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46302-3
Online ISBN: 978-3-540-46303-0
eBook Packages: Computer ScienceComputer Science (R0)