Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Colt: concept lineage tool for data flow metadata capture and analysis

Published: 01 August 2017 Publication History

Abstract

Most organizations are becoming increasingly data-driven, often processing data from many different sources to enable critical business operations. Beyond the well-addressed challenge of storing and processing large volumes of data, financial institutions in particular are increasingly subject to federal regulations requiring high levels of accountability for the accuracy and lineage of this data. For companies like GE Capital, which maintain data across a globally interconnected network of thousands of systems, it is becoming increasingly challenging to capture an accurate understanding of the data flowing between those systems. To address this problem, we designed and developed a concept lineage tool allowing organizational data flows to be modeled, visualized and interactively explored. This tool has novel features that allow a data flow network to be contextualized in terms of business-specific metadata such as the concept, business, and product for which it applies. Key analysis features have been implemented, including the ability to trace the origination of particular datasets, and to discover all systems where data is found that meets some user-defined criteria. This tool has been readily adopted by users at GE Capital and in a short time has already become a business-critical application, with over 2,200 data systems and over 1,000 data flows captured.

References

[1]
McAfee, A. and Brynjolfsson, E., "Big Data: The Management Revolution", Harvard Business Review, 90 (10): 60--68, Oct, 2012.
[2]
The Dodd-Frank Wall Street Reform and Consumer Protection Act, https://www.govtrack.us/congress/bills/111/hr4173/text, Jul. 2010.
[3]
Sar, C. and Cao, P. "Lineage File System". Technical Report, Stanford University, Jan 2005.
[4]
Muniswamy-Reddy, K., Holland, D. A., Braun, U., Seltzer, M. I. "Provenance-Aware Storage Systems". In Proc. of USENIX Annual Technical Conference, 2006: 43--56.
[5]
Gehani, A. and Tariq, D. "SPADE: Support for Provenance Auditing in Distributed Environments", In Proc. of 13th ACM/IFIP/USENIX Middleware Conference, 2012: 101--120.
[6]
Yu, J. and Buyya, R. "A Taxonomy of Scientific Workflow Systems for Grid Computing". ACM SIGMOD Record: 34(3): 44--49, 2005.
[7]
Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C. "Data Lineage Model for Taverna Workflows with Lightweight Annotation Requirements", In proc. of 2nd International Provenance and Annotation Workshop (IPAW), 2008: 17--30.
[8]
Anand, M. K., Bowers, S., McPhillips, T. M., Ludascher, B. "Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs". In proc. of 21st International Conference on Scientific and Statistical Database Mgmt (SSDBM), 2009: 237--254.
[9]
Bose, R. and Frew, J., "Composing Lineage Metadata with XML for Custom Satellite-derived Data Products". In proc. of 16th International Conference on Scientific and Statistical Database Mgmt (SSDBM), 2004: 275--284.
[10]
Foster, I., Vockler, J., Wilde M., Zhao Y. "Chimera: A Virtual Data System for Representing, Querying and Automating Data Derivation". In proc. of 14th International Conference on Scientific and Statistical Database Mgmt (SSDBM), 2002: 37--46.
[11]
Tomingas K., Tammet T., Kliimask M., Jarv P. "Automating Component Dependency Analysis for Enterprise Business Intelligence". In Proc. of International Conference on Information Systems (ICIS), 2014.
[12]
Manta Tools -- Manta Flow, https://mantatools.com/manta-flow.
[13]
SQLdep: Data Lineage Tool for Data Warehouse Teams, https://sqldep.com/.
[14]
Solidatus: Data Lineage and Impact Analysis simplified, https://www.threadneedletechnology.com/solidatus/.
[15]
WhereScape RED. https://www.wherescape.com/products-services/ wherescape-red/.
[16]
Glavic B. "Big Data Provenance: Challenges and Implications for Benchmarking". In proc. of the First Workshop on Specifying Big Data Benchmarks, 2014: 72--80.
[17]
Wang, J., Crawl, D., Purawat, S., Nguyen M., Altintas I. "Big Data Provenance: Challenges, State of the Art and Opportunities". Proc. of the IEEE International Conf. on Big Data, 2015: 2509--2516.
[18]
Overview of Teradata Loom Technology, http://blogs.teradata.com/data-points/overview-of-teradata-loom-technology/.
[19]
Cloudera Navigator Lineage Diagrams, http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cn_iu_lineage.html.
[20]
Gundecha, P., Ranganath, S., Feng, Z., Liu, H. "A Tool for Collecting Provenance Data in Social Media". In proc. of the 19th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2013: 1462--1465.
[21]
Ranganath, S., Gundecha, P., Liu, H., "A Tool for Assisting Provenance Search in Social Media". In Proc. of the 22nd ACM Intl. Conf. on Information and Knowledge Management (CIKM), 2013: 2517--2520.
[22]
Spillane R., Sears R., Yalamanchili C, Gaikwad S, Chinni M., Zadok E. "Story Book: An Efficient Extensible Provenance Framework. In Proc. of the 1st USENIX Workshop on Theory and Practice of Provenance, 2009.
[23]
Linkurious: "How to Track and Visualize Data Lineage", https://linkurio.us/how-to-track-and-visualize-data-lineage/.
[24]
Macko P., Seltzer M. "A General-Purpose Provenance Library". Proc. of the 4th USENIX Workshop on Theory and Practice of Provenance, 2012, https://github.com/pmacko86/core-provenance-library.
[25]
Simmhan Y. L., Plale B., Gannon S. "Karma2: Provenance Management for Data-driven Workflows". Int. Journal of Web Services Research. 5(2):1--22, 2008.
[26]
Moreau L. et al. "The Open Provenance Model: An Overview". Provenance and Annotation of Data and Processes, LNCS 5272: 323--326. 2008.
[27]
Missier P., Sahoo S., Zhao J., Goble G., Sheth A. "Janus: From Workflows to Semantic Provenance and Linked Open Data". Provenance and Annotation of Data and Processes, LNCS 6378: 129--141. 2010.
[28]
Moreau L et al. "The First Provenance Challenge", Concurrency and Computation: Pract. Exper.: 20(5): 577--586. 2000.
[29]
Benjamin, A. R., McClennen, C. E., Santo Domingo, M. G., Dufresne, J. M., Sullivan, C. R. "Data Lineage Management Operation Procedures". US Patent# US9384231 B2, 2016.
[30]
Khandelwal A., Walden C., Clarke D., Worley I., Saggu J., Sourekas S., Brizzi S. "Tracking Data Flow in Distributed Computing Systems". US Patent# US20160285701 A1, 2016
[31]
Diaku Axon for BCBS 239 Compliance. https://diaku.com/wp-content/uploads/2015/04/AxonForBcbs239Compliance2.pdf, 2015.
[32]
Case Study: Collibra Data Governance for a Domestic Significantly Important Bank. https://blog.knowledgent.com/case-study-collibra-data-governance-domestic-significantly-important-bank/, 2016.
[33]
Marchant, H. "Exploring Data Lineage: Get a Complete Picture of your Data Flows". IBM developerWorks technical article, 2010.
[34]
"Provenance delivers complete data lineage". Bloomberg Provenance Fact Sheet. https://www.bbhub.io/solutions/sites/8/2015/09/Bloomberg_Provenance_Fact_Sheet.pdf.
[35]
"Bloomberg is providing critical BCBS-239 compliance support". Bloomberg Ontology and Provenance fact sheet. https://www.bbhub.io/solutions/sites/8/2015/09/Ontology_and_Provenance_BCBS-239_Fact_Sheet.pdf.
[36]
W3C RDF, https://www.w3.org/RDF/.
[37]
SPARQL, http://www.w3.org/TR/rdf-sparql-query/.
[38]
Crapo, A., and Moitra, A. "Toward a Unified English-Like Representation of Semantic Models, Data, and Graph Patterns for Subject Matter Experts". Int. Journal Semantic Computing 7:215. 2013.
[39]
Franz, M., Lopes, C.T., Huck, G., Dong, Y., Sumer, O. and Bader, G.D., "Cytoscape.js: a graph theory library for visualization and analysis", Bioinformatics, 32 (2): 309--311, 2016.
[40]
Wendt, M. DynaTree.js, http://wwwendt.de/tech/dynatree/.

Cited By

View all
  • (2025)Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimizationFrontiers in Big Data10.3389/fdata.2024.14765067Online publication date: 7-Jan-2025
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
  • (2021)Collecting and visualizing data lineage of Spark jobsDatenbank-Spektrum10.1007/s13222-021-00387-721:3(179-189)Online publication date: 4-Oct-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 10, Issue 12
August 2017
427 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2017
Published in PVLDB Volume 10, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimizationFrontiers in Big Data10.3389/fdata.2024.14765067Online publication date: 7-Jan-2025
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
  • (2021)Collecting and visualizing data lineage of Spark jobsDatenbank-Spektrum10.1007/s13222-021-00387-721:3(179-189)Online publication date: 4-Oct-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media