Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Lineage tracing for general data warehouse transformations

Published: 01 May 2003 Publication History

Abstract

Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex “data cleansing” procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.

References

[1]
{ACM+99} S. Abiteboul, S. Cluet, T. Milo, P. Mogilevsky, J. Simeon, S. Zohar (1999) Tools for data translation and integration. IEEE Data Eng Bull 22(1):3-8.
[2]
{BB99} P. Bernstein, T. Bergstraesser (1999) Meta-data support for data transformations using Microsoft Repository. IEEE Data Eng Bull 22(1):9-14.
[3]
{BDH+95} P. Buneman, S.B. Davidson, K. Hart, G.C. Overton, L.Wong (1995) A data transformation system for biological data sources. In: Proc. 21st International Conference onVery Large Data Bases, pp 158-169, Zurich, Switzerland.
[4]
{CD97} S. Chaudhuri, U. Dayal (1997) An overview of data warehousing and OLAP technology. SIGMOD Rec 26(1):65-74.
[5]
{CR99} K.T. Claypool, E.A. Rundensteiner (1999) Flexible database transformations: the SERF approach. IEEE Data Eng Bull 22(1):19-24.
[6]
{Cui01} Y. Cui (2001) Lineage tracing in data warehouses. Ph.D. Thesis, Computer Science Department, Stanford University, Calif., USA.
[7]
{CW00} Y. Cui, J. Widom (2000) Practical lineage tracing in data warehouses. In: Proc. 16th International Conference on Data Engineering, pp 367-378, San Diego, Calif., USA.
[8]
{CW01a} Y. Cui, J. Widom (2001) Lineage tracing for general data warehouse transformations. In: Proc. 27th International Conference on Very Large Data Bases, pp 471-480, Rome, Italy.
[9]
{CW01b} Y. Cui, J. Widom (2001) Run-time translation of view tuple deletions using data lineage. Technical report, Stanford University Database Group. Available at: http://dbpubs.stanford.edu/pub/2001-24
[10]
{CWW00} Y. Cui, J. Widom, J.L. Wiener (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans Database Syst 25(2):179-227.
[11]
{DB2} IBM Corporation DB2 OLAP Server http://www.ibm.com/db2/
[12]
{FJS97} C. Faloutsos, H.V. Jagadish, N.D. Sidiropoulos (1997) Recovering information from summary data. In: Proc. 23rd International Conference on Very Large Data Bases, pp 36-45, Athens, Greece.
[13]
{HMN+99} L.M. Haas, R.J. Miller, B. Niswonger, M.T. Roth, P.M. Schwarz, E.L. Wimmers (1999) Transforming heterogeneous data with database middleware: beyond integration. IEEE Data Eng Bull 22(1):31-36.
[14]
{HQGW93} N.I. Hachem, K. Qiu, M. Gennert, M. Ward (1993) Managing derived data in the Gaea scientific DBMS. In: Proc. 19th International Conference on Very Large Data Bases, pp 1-12, Dublin, Ireland.
[15]
{Inf} Informix Formation Data Transformation Tool http://www.informix.com/informix/products/integration
[16]
{LBM98} T. Lee, S. Bressan, S. Madnick (1998) Source attribution for querying against semi-structured documents. In: Proc. Workshop on Web Information and Data Management, pp 33-39, Washington, D.C., USA.
[17]
{LGMW00} W.J. Labio, H. Garcia-Molina, J.L. Weiner (2000) Efficient resumption of interrupted warehouse loads. In: Proc. ACM SIGMOD International Conference on Management of Data, pp 46-57, Dallas, Tex., USA.
[18]
{LSS96} L. Lakshmanan, F. Sadri, I.N. Subramanian (1996) SchemaSQL - a language for interoperability in relational multi-database systems. In: Proc. 22nd International Conference on Very Large Data Bases, pp 239-250, Bombay, India.
[19]
{LW95} D. Lomet, J. Widom (eds) (1995) Special issue on materialized views and data warehousing. IEEE Data Eng Bull 18(2).
[20]
{Mic} Microsoft SQL Server 7.0, Data Transformation Services http://msdn.microsoft.com/library/ psdk/sql/dts_ovrw.htm
[21]
{Ora} Oracle 8i http://technet.oracle.com/products/oracle8i/
[22]
{Pow} Cognos PowerPlay OLAP Analysis Tool. http://www.cognos.com/powerplay/
[23]
{PPD} PPD Informatics (2002) TableTrans Data Transformation Software. http://www.belmont.com/tt.html
[24]
{RH00} V. Raman, J. Hellerstein (2000) Potters Wheel: an interactive framework for data cleaning. Technical report, U.C. Berkeley. http://control.cs.berkeley.edu/abc
[25]
{RS98} A. Rosenthal, E. Sciore (1998) Propagating integrity information among interrelated databases. In: Proc. 2nd Working Conference on Integrity and Internal Control in Information Systems, pp 5-18, Warrenton, Va., USA.
[26]
{RS99} A. Rosenthal, E. Sciore (1999) First class views: a key to user-centered computing. SIGMOD Rec 28(3):29- 36.
[27]
{Sag} Sagent Technology http://www.sagent.com/
[28]
{Shu87} N.C. Shu (1987) Automatic data transformation and restructuring. In: Proc. 3rd International Conference on Data Engineering, pp 173-180, Los Angeles, Calif., USA.
[29]
{Squ95} C. Squire (1995) Data extraction and transformation for the data warehouse. In: Proc. ACM SIGMOD International Conference on Management of Data, pp 446-447, San Jose, Calif., USA.
[30]
{Sto75} M. Stonebraker (1975) Implementation of integrity constraints and views by query modification. In: Proc. ACMSIGMOD International Conference on Management of Data, pp 65-78, San Jose, Calif., USA.
[31]
{TPC96} Transaction Processing Performance Council (1996) TPC-D Benchmark Specification, Version 1.2. http://www.tpc.org/
[32]
{WS97} A. Woodruff, M. Stonebraker (1997) Supporting fine-grained data lineage in a database visualization environment. In: Proc. 13th International Conference on Data Engineering, pp 91-102, Birmingham, UK.

Cited By

View all
  • (2024)Ontology-Based Update in Virtual Knowledge Graphs via Schema Mapping RecoveryRules and Reasoning10.1007/978-3-031-72407-7_6(59-74)Online publication date: 17-Sep-2024
  • (2023)Characterizing and Verifying Queries Via CINSGENCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589721(143-146)Online publication date: 4-Jun-2023
  • (2022)Understanding Queries by Conditional InstancesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517898(355-368)Online publication date: 10-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 12, Issue 1
May 2003
85 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2003

Author Tags

  1. Data lineage
  2. Data warehouse
  3. Inverse
  4. Lineage tracing
  5. Transformation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)19
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Ontology-Based Update in Virtual Knowledge Graphs via Schema Mapping RecoveryRules and Reasoning10.1007/978-3-031-72407-7_6(59-74)Online publication date: 17-Sep-2024
  • (2023)Characterizing and Verifying Queries Via CINSGENCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589721(143-146)Online publication date: 4-Jun-2023
  • (2022)Understanding Queries by Conditional InstancesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517898(355-368)Online publication date: 10-Jun-2022
  • (2022)Augmented lineage: traceability of data analysis including complex UDF processingThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00769-732:5(963-983)Online publication date: 23-Nov-2022
  • (2022)Answering Why-Not Questions on GeoSPARQL QueriesWeb and Big Data10.1007/978-3-031-25198-6_22(286-300)Online publication date: 11-Aug-2022
  • (2021)Debugging missing answers for spark queries over nested data with breadcrumbProceedings of the VLDB Endowment10.14778/3476311.347633114:12(2731-2734)Online publication date: 28-Oct-2021
  • (2021)OptDebugProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487016(359-372)Online publication date: 1-Nov-2021
  • (2021)Augmented Lineage: Traceability of Data Analysis Including Complex UDFsDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_6(65-77)Online publication date: 27-Sep-2021
  • (2020)Influence-based provenance for dataflow applications with taint propagationProceedings of the 11th ACM Symposium on Cloud Computing10.1145/3419111.3421292(372-386)Online publication date: 12-Oct-2020
  • (2019)An Interactive Mechanism to Improve Question Answering Systems via FeedbackProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3358059(1381-1390)Online publication date: 3-Nov-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media