Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2064676.2064684acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Easy and effective parallel programmable ETL

Published: 28 October 2011 Publication History

Abstract

Extract-Transform-Load (ETL) programs are used to load data into data warehouses (DWs). An ETL program must extract data from sources, apply different transformations to it, and use the DW to look up/insert the data. It is both time consuming to develop and to run an ETL program. It is, however, typically the case that the ETL program can exploit both task parallelism and data parallelism to run faster. This, on the other hand, makes the development time longer as it is complex to create a parallel ETL program. To remedy this situation, we propose efficient ways to parallelize typical ETL tasks and we implement these new constructs in an ETL framework. The constructs are easy to apply and do only require few modifications to an ETL program to parallelize it. They support both task and data parallelism and give the programmer different possibilities to choose from. An experimental evaluation shows that by using a little more CPU time, the (wall-clock) time to run an ETL program can be greatly reduced.

References

[1]
J. M. Bjørndalen, B. Vinter, and O. Anshus: "PyCSP - Communicating Sequential Processes for Python". In Communicating Process Architectures, pp. 229--248, 2007.
[2]
J. Dean and S. Ghemawat: "MapReduce: Simplified Data Processing on Large Clusters". In Proc. OSDI, pp. 137--150, 2004.
[3]
T. Friedman, M. A. Beyer, and E. Thoo: "Gartner Magic Quadrant for DataIntegration Tools", 2010
[4]
download.oracle.com/javase/6/docs/api/java/util/concurrent/Future.html as of 2011-08--12.
[5]
jython.org as of 2011-08-12.
[6]
kettle.pentaho.com as of 2011-08-12.
[7]
X. Liu, C. Thomsen, and T. B. Pedersen: "ETLMR: A Highly Scalable Dimensional ETL Framework based on MapReduce". In Proc. DaWaK, pp. 96--111, 2011.
[8]
microsoft.com/sqlserver/2008/en/us/integration.aspx as of 2011-08-12.
[9]
C. Olston et al. "Pig latin: a not-so-foreign language for data processing". In Proc. SIGMOD, pp. 1099--1110, 2008.
[10]
python.org as of 2011-07-02.
[11]
python.org/dev/peps/pep-3148/ as of 2011-08-12.
[12]
M. Stonebreaker et al. "MapReduce and Parallel DBMSs: Frineds or Foes?". CACM 53(1):64--71 2010.
[13]
C. Thomsen and T. B. Pedersen: "pygrametl: A Powerful Programming Framework for Extract-Transform-Load Programmers". In Proc. DOLAP, pp. 49--56, 2009.
[14]
A. Thusoo et al. "Hive - apetabyte scale data warehouse using Hadoop". In Proc. ICDE, pp. 996--1005, 2010.
[15]
V. Tziovara, P. Vassiliadis, A. Simitsis: "Deciding the Physical Implementation of ETL Workflows". In Proc. DOLAP, pp. 49--56, 2007.
[16]
P. Vassiliadis: A Survey of Extract-Transform-Load Technology. IJDWM 5(3):1--27, 2009.

Cited By

View all
  • (2021)Demand-Driven Data Provisioning in Data LakesThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487784(187-198)Online publication date: 29-Nov-2021
  • (2021)pygrametl: A Powerful Programming Framework for Easy Creation and Testing of ETL FlowsTransactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII10.1007/978-3-662-63519-3_3(45-84)Online publication date: 18-May-2021
  • (2019)Parallelizing user–defined functions in the ETL workflow using orchestration style sheetsInternational Journal of Applied Mathematics and Computer Science10.2478/amcs-2019-000529:1(69-79)Online publication date: 1-Mar-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DOLAP '11: Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
October 2011
112 pages
ISBN:9781450309639
DOI:10.1145/2064676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. extract-transform-load (etl)
  2. parallelism
  3. python

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 29 of 79 submissions, 37%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Demand-Driven Data Provisioning in Data LakesThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487784(187-198)Online publication date: 29-Nov-2021
  • (2021)pygrametl: A Powerful Programming Framework for Easy Creation and Testing of ETL FlowsTransactions on Large-Scale Data- and Knowledge-Centered Systems XLVIII10.1007/978-3-662-63519-3_3(45-84)Online publication date: 18-May-2021
  • (2019)Parallelizing user–defined functions in the ETL workflow using orchestration style sheetsInternational Journal of Applied Mathematics and Computer Science10.2478/amcs-2019-000529:1(69-79)Online publication date: 1-Mar-2019
  • (2019)Efficient incremental loading in ETL processing for real-time data integrationInnovations in Systems and Software Engineering10.1007/s11334-019-00344-4Online publication date: 15-May-2019
  • (2019)Empirical Analysis of Programmable ETL ToolsComputational Intelligence, Communications, and Business Analytics10.1007/978-981-13-8581-0_22(267-277)Online publication date: 26-Jun-2019
  • (2018)Real-time ETL in StriimProceedings of the International Workshop on Real-Time Business Intelligence and Analytics10.1145/3242153.3242157(1-10)Online publication date: 27-Aug-2018
  • (2018)Programmatic ETLBusiness Intelligence and Big Data10.1007/978-3-319-96655-7_2(21-50)Online publication date: 15-Jul-2018
  • (2018)Extraction, Transformation, and LoadingEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_158(1432-1440)Online publication date: 7-Dec-2018
  • (2017)ETLator - a scripting ETL framework2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)10.23919/MIPRO.2017.7973632(1349-1354)Online publication date: May-2017
  • (2017)From conceptual design to performance optimization of ETL workflowsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0477-226:6(777-801)Online publication date: 1-Dec-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media