Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

You say 'what', i hear 'where' and 'why': (mis-)interpreting SQL to derive fine-grained provenance

Published: 01 July 2018 Publication History

Abstract

SQL declaratively specifies what the desired output of a query is. This work shows that a non-standard interpretation of the SQL semantics can, instead, disclose where a piece of the output originated in the input and why that piece found its way into the result. We derive such data provenance for very rich SQL dialects---including recursion, windowed aggregates, and user-defined functions---at the fine-grained level of individual table cells. The approach is non-invasive and implemented as a compositional source-level SQL rewrite: an input SQL query is transformed into its own interpreter that wields data dependencies instead of regular values. We deliberately design this transformation to preserve the shape of both data and query, which allows provenance derivation to scale to complex queries without overwhelming the underlying database system.

References

[1]
M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proc. SIGMOD, New York, USA, 2015.
[2]
S. Bahareh, S. Feng, B. Glavic, S. Lee, X. Niu, and Q. Zeng. GProM---A Swiss Army Knife for Your Provenance Needs. IEEE Data Engineering Bulletin, 41(1), 2018.
[3]
O. Benjelloun, A. Sarma, A. Halevy, and J. Widom. ULDBs: Databases with Uncertainty and Lineage. In Proc. VLDB, Seoul, Korea, 2006.
[4]
P. Boncz, T. Neumann, and O. Erling. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In Proc. TPCTC, Riva del Garda, Italy, 2013.
[5]
P. Buneman, J. Cheney, and S. Vansummeren. On the Expressiveness of Implicit Provenance in Query and Update Languages. ACM TODS, 33(4), 2008.
[6]
P. Buneman, S. Khanna, and W.-C. Tan. Why and Where: A Characterization of Data Provenance. In Proc. ICDT, London, UK, 2001.
[7]
S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better Bitmap Performance with Roaring Bitmaps. Software: Practice and Experience, 46(11), 2016.
[8]
A. Chapman and H. Jagadish. Why Not? In Proc. SIGMOD, Providence, RI, USA, 2009.
[9]
J. Cheney. Program Slicing and Data Provenance. IEEE Data Engineering Bulletin, 30(4), 2007.
[10]
J. Cheney, A. Ahmed, and U. Acar. Provenance as Dependency Analysis. Mathematical Structures in Computer Science, 21(6), 2011.
[11]
J. Cheney, A. Ahmed, and U. Acar. Database Queries that Explain their Work. In Proc. PPDP, 2014.
[12]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4), 2007.
[13]
Z. Chothia, J. Liagouris, F. McSherry, and T. Roscoe. Explaining Outputs in Modern Data Analytics. PVLDB, 9(12):1137--1148, 2016.
[14]
Y. Cui and J. Widom. Lineage Tracing for General Data Warehouse Transformations. In Proc. VLDB, Rome, Italy, 2001.
[15]
Y. Cui, J. Widom, and J. Wiener. Tracing the Lineage of View Data in a Warehousing Environment. ACM TODS, 25(2), 2000.
[16]
B. Dietrich, T. Müller, and T. Grust. The Best Bang for Your Bu(ck)g---When SQL Debugging and Data Provenance Go Hand in Hand. In Proc. EDBT, Bordeaux, France, 2016.
[17]
J. Fan, A. Gerald, S. Raj, and J. Patel. The Case Against Specialized Graph Analytics Engines. In Proc. CIDR, Asilomar, CA, USA, 2015.
[18]
S. Fehrenbach and J. Cheney. Language-Integrated Provenance. Science of Computer Programming, 2017.
[19]
J. Freire, P. Bonnet, and D. Shasha. Computational Reproducibility: State-of-the-Art, Challenges, and Database Research Opportunities. In Proc. ACM SIGMOD, Scottsdale, AZ, USA, 2012.
[20]
B. Glavic. Perm: Efficient Provenance Support for Relational Databases. PhD thesis, University of Zürich, Switzerland, 2010.
[21]
B. Glavic and G. Alonso. Perm: Processing Provenance and Data on the Same Data Model Through Query Rewriting. In Proc. ICDE, Shanghai, China, 2009.
[22]
B. Glavic and G. Alonso. Provenance for Nested Subqueries. In Proc. EDBT, Saint Petersburg, Russia, 2009.
[23]
B. Glavic and G. Alonso. Provenance for Nested Subqueries. In Proc. EDBT, Saint Petersburg, Russia, 2009.
[24]
B. Glavic, R. Miller, and G. Alonso. Using SQL for Efficient Generation and Querying of Provenance Information. Lecture Notes in Computer Science, 8000, 2013.
[25]
T. Green, G. Karvounarakis, and V. Tannen. Provenance Semirings. In Proc. PODS, Beijing, China, 2007.
[26]
T. Grust and J. Rittinger. Observing SQL Queries in their Natural Habitat. ACM TODS, 38(1), 2013.
[27]
J. Hellerstein, C. Ré, F. Schoppmann, D. Wang, E. Fratkin, A. Gorajek, K. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Analytics Library: Or MAD Skills, the SQL. PVLDB, 5(12):1700--1711, 2012.
[28]
S. Helmer and G. Moerkotte. A Performance Study of Four Index Structures for Set-Valued Attributes of Low Cardinality. VLDB Journal, 12(3), 2003.
[29]
M. Herschel, R. Diestelkämper, and H. Lahmar. A Survey on Provenance: What For? What Form? What From? VLDB Journal, 26(6), 2017.
[30]
Z. Hu, H. Iwasaki, M. Takeichi, and A. Takano. Tupling Calculation Eliminates Multiples Data Traversals. In Proc. ICFP, 1997.
[31]
C. Jay. Shape in Computing. ACM Computing Surveys, 28(2), 1996.
[32]
C. Jay and R. Cockett. Shapely Types and Shape Polymorphism. In Proc. ESOP, Edinburgh, UK, 1994.
[33]
G. Karvounarakis, Z. Ives, and V. Tannen. Querying Data Provenance. In Proc. SIGMOD, Indianapolis, USA, 2010.
[34]
A. Kemper and G. Moerkotte. Access Support Relations: An Indexing Method for Object Bases. Information Systems, 17(2), 1992.
[35]
A. Kemper and T. Neumann. HyPer: A Hybrid OLTP/OLAP Main Memory Database System Based on Virtual Memory Snapshots. In Proc. ICDE, Hannover, Germany, 2011.
[36]
T. Müller and T. Grust. Provenance for SQL Based on Abstract Interpretation: Value-less, but Worthwhile. PVLDB, 8(12):1872--1875, 2015.
[37]
T. Neumann and A. Kemper. Unnesting Arbitrary Queries. In Proc. BTW, Hamburg, Germany, 2015.
[38]
X. Niu, B. Arab, S. Lee, F. Su, X. Zou, D. Gawlick, V. Krishnaswamy, Z. Liu, and B. Glavic. Debugging Transactions and Tracking their Provenance with Reenactment. PVLDB, 10(12):1857--1860, 2017.
[39]
X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. Liu, V. Krishnaswamy, and V. Radhakrishnan. Provenance-Aware Query Optimization. In Proc. ICDE, San Diego, CA, USA, 2017.
[40]
X. Niu, R. Kapoor, B. Glavic, D. Gawlick, Z. Liu, V. Krishnaswamy, and V. Radhakrishnan. Heuristic and Cost-Based Optimization for Diverse Provenance Tasks. IEEE TKDE, 2018.
[41]
D. O'Grady, T. Müller, and T. Grust. How 'How' Explains What 'What' Computes---How-Provenance for SQL and Query Compilers. In Proc. TaPP, London, UK, 2018.
[42]
K. Ramasamy, J. Patel, J. Naughton, and R. Kaushik. Set Containment Joins: The Good, the Bad, and the Ugly. In Proc. VLDB, Cairo, Egypt, 2000.
[43]
Database Languages-SQL-Part 2: Foundation. ISO/IEC 9075-2:2016.
[44]
W.-C. Tan. Provenance in Databases: Past, Present, and Future. IEEE Data Engineering Bulletin, 32(4), 2007.
[45]
F. Tip. A Survey of Program Slicing Techniques. Technical report, CWI, Amsterdam, The Netherlands, 1994.
[46]
The TPC Benchmark H. tpc.org/tpch.
[47]
P. Valduriez. Join Indices. ACM TODS, 12(2), 1987.
[48]
P. Wadler. Theorems for Free! In Proc. ICFP, London, UK, 1989.
[49]
Y. Wang and S. Madnick. A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. In Proc. VLDB, Brisbane, Australia, 1990.
[50]
M. Weiser. Program Slicing. IEEE Transactions on Software Engineering, SE-10(4), 1984.

Cited By

View all
  • (2022)Provenance-based data skippingProceedings of the VLDB Endowment10.14778/3494124.349413015:3(451-464)Online publication date: 4-Feb-2022
  • (2021)OptDebugProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487016(359-372)Online publication date: 1-Nov-2021
  • (2021)To Not Miss the Forest for the Trees - A Holistic Approach for Explaining Missing Answers over Nested DataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457249(405-417)Online publication date: 9-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 11
July 2018
507 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2018
Published in PVLDB Volume 11, Issue 11

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Provenance-based data skippingProceedings of the VLDB Endowment10.14778/3494124.349413015:3(451-464)Online publication date: 4-Feb-2022
  • (2021)OptDebugProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3487016(359-372)Online publication date: 1-Nov-2021
  • (2021)To Not Miss the Forest for the Trees - A Holistic Approach for Explaining Missing Answers over Nested DataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457249(405-417)Online publication date: 9-Jun-2021
  • (2019)Language-integrated provenance by trace analysisProceedings of the 17th ACM SIGPLAN International Symposium on Database Programming Languages10.1145/3315507.3330198(74-84)Online publication date: 23-Jun-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media