Authors:
Matheus Vieira
1
;
Thiago de Oliveira
2
;
Leandro Cicco
2
;
Daniel de Oliveira
1
and
Marcos Bedo
1
Affiliations:
1
Institute of Computing, Fluminense Federal University, Brazil
;
2
Information Technology Superintendence, Fluminense Federal University, Brazil
Keyword(s):
Data Warehousing, ETL, Provenance, Data Quality, Business Intelligence.
Abstract:
Business intelligence processes running over Data Warehouses (BIDW) heavily rely on quality, structured data to support decision-making and prescriptive analytics. In this study, we discuss the coupling of provenance mechanisms into the BIDW Extract-Transform-Load (ETL) stage to provide lineage tracking and data auditing, which (i) enhances the debugging of data transformation and (ii) facilitates issuing data accountability reports and dashboards. These two features are particularly beneficial for BIDWs tailored to assist managers and counselors in Universities and other educational institutions, as systematic auditing processes and accountability delineation depend on data quality and tracking. To validate the usefulness of provenance in this domain, we introduce the ProvETL tool that extends a BIDW with provenance support, enabling the monitoring of user activities and data transformations, along with the compilation of an execution summary for each ETL task. Accordingly, ProvETL
offers an additional BIDW analytical layer that allows visualizing data flows through provenance graphs. The exploration of such graphs provides details on data lineage and the execution of transformations, spanning from the insertion of input data into BIDW dimensional tables to the final BIDW fact tables. We showcased ProvETL capabilities in three real-world scenarios using a BIDW from our University: personnel admission, public information in paycheck reports, and staff dismissals. The results indicate that the solution has contributed to spotting poor-quality data in each evaluated scenario. ProvETL also promptly pinpointed the transformation summary, elapsed time, and the attending user for every data flow, keeping the provenance collection overhead within milliseconds.
(More)