research-article

RHEEM: enabling cross-platform data processing: may the big data be with you!

Authors:

Bertty Contreras-Rojas,

Ahmed Elmagarmid,

Sebastian Kruse,

Mourad Ouzzani,

Jorge-Arnulfo Quiané-Ruiz,

Saravanan Thirumuruganathan,

Anis TroudiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 11

Pages 1414 - 1427

https://doi.org/10.14778/3236187.3236195

Published: 01 July 2018 Publication History

Abstract

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

References

[1]

Apache Beam. https://beam.apache.org.

[2]

Apache Drill. https://drill.apache.org.

[3]

Apache Flink. https://flink.apache.org.

[4]

Apache Flume. https://flume.apache.org.

[5]

Apache HBase. http://hbase.apache.org/.

[6]

Apache Hive: A data warehouse software for distributed storage. http://hive.apache.org.

[7]

Apache Mahout. http://mahout.apache.org.

[8]

Apache Spark: Lightning-Fast Cluster Computing. http://spark.incubator.apache.org/.

[9]

Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/.

[10]

Luigi Project. https://github.com/spotify/luigi.

[11]

PostgreSQL. http://www.postgresql.org/.

[12]

PrestoDB Project. https://prestodb.io.

[13]

Spark MLlib: http://spark.apache.org/mllib.

[14]

Spark SQL programming guide. http://spark.apache.org/docs/latest/sql-programming-guide.html.

[15]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, pages 265--283, 2016.

Digital Library

[16]

D. Agrawal, L. Ba, L. Berti-Equille, S. Chawla, A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi, Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and M. Zaki. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD, pages 2069--2072, 2016.

Digital Library

[17]

D. Agrawal, S. Chawla, A. K. Elmagarmid, Z. Kaoudi, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and M. J. Zaki. Road to Freedom in Big Data Analytics. In EDBT, pages 479--484, 2016.

[18]

A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The Stratosphere platform for big data analytics. VLDB J., 23(6):939--964, 2014.

Digital Library

[19]

A. Baaziz and L. Quoniam. How to use big data technologies to optimize operations in upstream petroleum industry. In 21<sup>st</sup> World Petroleum Congress, 2014.

[20]

M. Boehm, M. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. Reiss, P. Sen, A. Surve, and S. Tatikonda. SystemML: Declarative Machine Learning on Spark. PVLDB, 9(13):1425--1436, 2016.

Digital Library

[21]

O. A. Bukhres, J. Chen, W. Du, A. K. Elmagarmid, and R. Pezzoli. InterBase: An Execution Environment for Heterogeneous Software Systems. IEEE Computer, 26(8):57--69, 1993.

Digital Library

[22]

M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, pages 124--131, 1995.

Digital Library

[23]

S. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, pages 7--18, 1994.

[24]

H. Chen, R. Kazman, and S. Haziyev. Strategic prototyping for developing big data systems. IEEE Software, 33(2):36--43, 2016.

Digital Library

[25]

M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.

Digital Library

[26]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 2008.

Digital Library

[27]

D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The Data Civilizer System. In CIDR, 2017.

[28]

D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in polybase. In SIGMOD, pages 1255--1266, 2013.

Digital Library

[29]

K. Doka, N. Papailiou, V. Giannakouris, D. Tsoumakos, and N. Koziris. Mix 'n' match multi-engine analytics. In IEEE BigData, pages 194--203, 2016.

[30]

A. J. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Çetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, S. Madden, D. Maier, T. G. Mattson, S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, and S. Zdonik. A Demonstration of the BigDAWG Polystore System. PVLDB, 8(12):1908--1911, 2015.

Digital Library

[31]

W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008.

Digital Library

[32]

R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A Demo of the Data Civilizer System. In SIGMOD, pages 1639--1642, 2017.

Digital Library

[33]

M. Franklin. Making sense of big data with the berkeley data analytics stack. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM), pages 1--2, 2015.

Digital Library

[34]

I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, 2015.

Digital Library

[35]

B. Haynes, A. Cheung, and M. Balazinska. PipeGen: Data Pipe Generator for Hybrid Analytics. In SoCC, pages 470--483, 2016.

Digital Library

[36]

A. Hems, A. Soofi, and E. Perez. How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft White Paper, http://goo.gl/2Bn0xq, 2014.

[37]

F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012.

Digital Library

[38]

IBM. Data-driven healthcare organizations use big data analytics for big gains. White paper, http://goo.gl/AFIHpk.

[39]

Z. Kaoudi and J.-A. Quiané-Ruiz. Cross-Platform Data Processing: Use Cases and Challenges. In ICDE (tutorial), 2018.

[40]

Z. Kaoudi, J.-A. Quiane-Ruiz, S. Thirumuruganathan, S. Chawla, and D. Agrawal. A Cost-based Optimizer for Gradient Descent Optimization. In SIGMOD, 2017.

Digital Library

[41]

Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. BigDansing: A System for Big Data Cleansing. In SIGMOD, pages 1215--1230, 2015.

Digital Library

[42]

Z. Khayyat, W. Lucia, M. Singh, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and P. Kalnis. Lightning Fast and Space Efficient Inequality Joins. PVLDB, 8(13):2074--2085, 2015.

Digital Library

[43]

S. Kruse, Z. Kaoudi, J.-A. Quiané-Ruiz, S. Chawla, F. Naumann, and B. Contreras-Rojas. RHEEMix in the Data Jungle - A Cross-Platform Query Optimizer. arXiv: 1805.03533 https://arxiv.org/abs/1805.03533, 2018.

[44]

J. LeFevre, J. Sankaranarayanan, H. Hacigümüs, J. Tatemura, N. Polyzotis, and M. J. Carey. MISO: souping up big data query processing with a multistore system. In SIGMOD, pages 1591--1602, 2014.

Digital Library

[45]

V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3):204--215, 2015.

Digital Library

[46]

H. Lim, Y. Han, and S. Babu. How to Fit when No One Size Fits. In CIDR, 2013.

[47]

J. Lucas, Y. Idris, B. Contreras-Rojas, J.-A. Quiané-Ruiz, and S. Chawla. Cross-Platform Data Analytics Made Easy. In ICDE, 2018.

[48]

V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. In SIGMOD, pages 659--670, 2004.

Digital Library

[49]

N. Marz and J. Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning, 2015.

Digital Library

[50]

M. Mitchell. An introduction to genetic algorithms. MIT press, 1998.

Digital Library

[51]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08, pages 1099--1110, 2008.

Digital Library

[52]

S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. Weld: A Common Runtime for High Performance Data Analysis. In CIDR, 2017.

[53]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

[54]

A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: An extensible logical optimizer for UDF-heavy data flows. Inf. Syst., 52:96--125, 2015.

Digital Library

[55]

P. J. Sadalage and M. Fowler. NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional, 2012.

Digital Library

[56]

S. Shankar, A. Choi, and J.-P. Dijcks. Integrating Hadoop Data with Oracle Parallel Processing. Oracle White Paper, http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-integrating-hadoop-data-with-or-130063.pdf, 2010.

[57]

A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3):183--236, 1990.

Digital Library

[58]

A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. Optimizing Analytic Data Flows for Multiple Execution Engines. In SIGMOD, pages 829--840, 2012.

Digital Library

[59]

C. Statchuk, N. Madhavji, A. Miranskyy, and F. Dehne. Taming a tiger: Software engineering in the era of big data & continuous development. In CASCON, 2015.

Digital Library

[60]

M. Stonebraker. The Case for Polystores. http://wp.sigmod.org/?p=1629, 2015.

[61]

D. Tsoumakos and C. Mantas. The Case for Multi-Engine Data Analytics. In Euro-Par, pages 406--415, 2013.

[62]

J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR, 2017.

[63]

F. Yang, J. Li, and J. Cheng. Husky: Towards a More Efficient and Expressive Distributed Computing Framework. PVLDB, 9(5):420--431, 2016.

Digital Library

Cited By

Yu GWu ZKossmann FLi TMarkakis MNgom AMadden SKraska T(2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682026
Kraska TLi TMadden SMarkakis MNgom AWu ZYu G(2023)Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data MeshesProceedings of the VLDB Endowment10.14778/3611479.361152616:11(3293-3301)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611526
Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Show More Cited By

RHEEM: enabling cross-platform data processing: may the big data be with you!
1. Information systems
  1. Data management systems

Recommendations

Rheem: Enabling Multi-Platform Task Execution
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Many emerging applications, from domains such as healthcare and oil & gas, require several data processing systems for complex analytics. This demo paper showcases system, a framework that provides multi-platform task execution for such applications. It ...
Pro Smartphone Cross-Platform Development: iPhone, Blackberry, Windows Mobile and Android Development and Distribution
Testing cross-platform mobile app development frameworks
ASE '15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering

Mobile app developers often wish to make their apps available on a wide variety of platforms, e.g., Android, iOS, and Windows devices. Each of these platforms uses a different programming environment, each with its own language and APIs for app ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 11

July 2018

507 pages

ISSN:2150-8097

Editors:
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
,
Jian Pei
Simon Fraser University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 July 2018

Published in PVLDB Volume 11, Issue 11

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
418
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)24

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu GWu ZKossmann FLi TMarkakis MNgom AMadden SKraska T(2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682026
Kraska TLi TMadden SMarkakis MNgom AWu ZYu G(2023)Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data MeshesProceedings of the VLDB Endowment10.14778/3611479.361152616:11(3293-3301)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611526
Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Giatrakos NAlevizos EDeligiannakis AKlinkenberg RArtikis AFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Proactive Streaming Analytics at Scale: A Journey from the State-of-the-art to a Production PlatformProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615293(5204-5207)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615293
Kaoudi ZQuiané-Ruiz J(2022)Unified data analyticsProceedings of the VLDB Endowment10.14778/3554821.355489815:12(3778-3781)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554898
Kiehn FSchmidt MGlake DPanse FWingerath WWollmer BPoppinga MRitter N(2022)Polyglot data managementProceedings of the VLDB Endowment10.14778/3554821.355489115:12(3750-3753)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.14778/3554821.3554891
Winter CGiceva JNeumann TKemper A(2022)On-demand state separation for cloud data warehousingProceedings of the VLDB Endowment10.14778/3551793.355184515:11(2966-2979)Online publication date: 29-Sep-2022
https://dl.acm.org/doi/10.14778/3551793.3551845
Grulich PZeuch SMarkl V(2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489501
El-Mahgary SSoisalon-Soininen EOrponen PRönnholm PHyyppä H(2022)OVI-3: A NoSQL visual query system supporting efficient anti-joinsJournal of Intelligent Information Systems10.1007/s10844-022-00742-460:3(777-801)Online publication date: 21-Sep-2022
https://dl.acm.org/doi/10.1007/s10844-022-00742-4
Forresi CFrancia MGallinucci EGolfarelli M(2022)Cost-based Optimization of Multistore Query PlansInformation Systems Frontiers10.1007/s10796-022-10320-225:5(1925-1951)Online publication date: 4-Oct-2022
https://dl.acm.org/doi/10.1007/s10796-022-10320-2
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents