Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

RHEEM: enabling cross-platform data processing: may the big data be with you!

Published: 01 July 2018 Publication History

Abstract

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) an interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

References

[1]
Apache Beam. https://beam.apache.org.
[2]
Apache Drill. https://drill.apache.org.
[3]
Apache Flink. https://flink.apache.org.
[4]
Apache Flume. https://flume.apache.org.
[5]
Apache HBase. http://hbase.apache.org/.
[6]
Apache Hive: A data warehouse software for distributed storage. http://hive.apache.org.
[7]
Apache Mahout. http://mahout.apache.org.
[8]
Apache Spark: Lightning-Fast Cluster Computing. http://spark.incubator.apache.org/.
[9]
Fortune magazine. http://fortune.com/2014/06/19/big-data-airline-industry/.
[10]
Luigi Project. https://github.com/spotify/luigi.
[11]
PostgreSQL. http://www.postgresql.org/.
[12]
PrestoDB Project. https://prestodb.io.
[13]
Spark MLlib: http://spark.apache.org/mllib.
[14]
Spark SQL programming guide. http://spark.apache.org/docs/latest/sql-programming-guide.html.
[15]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, pages 265--283, 2016.
[16]
D. Agrawal, L. Ba, L. Berti-Equille, S. Chawla, A. Elmagarmid, H. Hammady, Y. Idris, Z. Kaoudi, Z. Khayyat, S. Kruse, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and M. Zaki. Rheem: Enabling Multi-Platform Task Execution. In SIGMOD, pages 2069--2072, 2016.
[17]
D. Agrawal, S. Chawla, A. K. Elmagarmid, Z. Kaoudi, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and M. J. Zaki. Road to Freedom in Big Data Analytics. In EDBT, pages 479--484, 2016.
[18]
A. Alexandrov, R. Bergmann, S. Ewen, J. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The Stratosphere platform for big data analytics. VLDB J., 23(6):939--964, 2014.
[19]
A. Baaziz and L. Quoniam. How to use big data technologies to optimize operations in upstream petroleum industry. In 21<sup>st</sup> World Petroleum Congress, 2014.
[20]
M. Boehm, M. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. Reiss, P. Sen, A. Surve, and S. Tatikonda. SystemML: Declarative Machine Learning on Spark. PVLDB, 9(13):1425--1436, 2016.
[21]
O. A. Bukhres, J. Chen, W. Du, A. K. Elmagarmid, and R. Pezzoli. InterBase: An Execution Environment for Heterogeneous Software Systems. IEEE Computer, 26(8):57--69, 1993.
[22]
M. J. Carey et al. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In RIDE-DOM, pages 124--131, 1995.
[23]
S. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom. The TSIMMIS Project: Integration of Heterogeneous Information Sources. In IPSJ, pages 7--18, 1994.
[24]
H. Chen, R. Kazman, and S. Haziyev. Strategic prototyping for developing big data systems. IEEE Software, 33(2):36--43, 2016.
[25]
M. Dallachiesa, A. Ebaid, A. Eldawy, A. K. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. NADEEF: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013.
[26]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 2008.
[27]
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The Data Civilizer System. In CIDR, 2017.
[28]
D. J. DeWitt, A. Halverson, R. V. Nehme, S. Shankar, J. Aguilar-Saborit, A. Avanes, M. Flasza, and J. Gramling. Split query processing in polybase. In SIGMOD, pages 1255--1266, 2013.
[29]
K. Doka, N. Papailiou, V. Giannakouris, D. Tsoumakos, and N. Koziris. Mix 'n' match multi-engine analytics. In IEEE BigData, pages 194--203, 2016.
[30]
A. J. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Çetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, S. Madden, D. Maier, T. G. Mattson, S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, and S. Zdonik. A Demonstration of the BigDAWG Polystore System. PVLDB, 8(12):1908--1911, 2015.
[31]
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional Functional Dependencies for Capturing Data Inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008.
[32]
R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A Demo of the Data Civilizer System. In SIGMOD, pages 1639--1642, 2017.
[33]
M. Franklin. Making sense of big data with the berkeley data analytics stack. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM), pages 1--2, 2015.
[34]
I. Gog, M. Schwarzkopf, N. Crooks, M. P. Grosvenor, A. Clement, and S. Hand. Musketeer: all for one, one for all in data processing systems. In EuroSys, 2015.
[35]
B. Haynes, A. Cheung, and M. Balazinska. PipeGen: Data Pipe Generator for Hybrid Analytics. In SoCC, pages 470--483, 2016.
[36]
A. Hems, A. Soofi, and E. Perez. How innovative oil and gas companies are using big data to outmaneuver the competition. Microsoft White Paper, http://goo.gl/2Bn0xq, 2014.
[37]
F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012.
[38]
IBM. Data-driven healthcare organizations use big data analytics for big gains. White paper, http://goo.gl/AFIHpk.
[39]
Z. Kaoudi and J.-A. Quiané-Ruiz. Cross-Platform Data Processing: Use Cases and Challenges. In ICDE (tutorial), 2018.
[40]
Z. Kaoudi, J.-A. Quiane-Ruiz, S. Thirumuruganathan, S. Chawla, and D. Agrawal. A Cost-based Optimizer for Gradient Descent Optimization. In SIGMOD, 2017.
[41]
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. BigDansing: A System for Big Data Cleansing. In SIGMOD, pages 1215--1230, 2015.
[42]
Z. Khayyat, W. Lucia, M. Singh, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and P. Kalnis. Lightning Fast and Space Efficient Inequality Joins. PVLDB, 8(13):2074--2085, 2015.
[43]
S. Kruse, Z. Kaoudi, J.-A. Quiané-Ruiz, S. Chawla, F. Naumann, and B. Contreras-Rojas. RHEEMix in the Data Jungle - A Cross-Platform Query Optimizer. arXiv: 1805.03533 https://arxiv.org/abs/1805.03533, 2018.
[44]
J. LeFevre, J. Sankaranarayanan, H. Hacigümüs, J. Tatemura, N. Polyzotis, and M. J. Carey. MISO: souping up big data query processing with a multistore system. In SIGMOD, pages 1591--1602, 2014.
[45]
V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3):204--215, 2015.
[46]
H. Lim, Y. Han, and S. Babu. How to Fit when No One Size Fits. In CIDR, 2013.
[47]
J. Lucas, Y. Idris, B. Contreras-Rojas, J.-A. Quiané-Ruiz, and S. Chawla. Cross-Platform Data Analytics Made Easy. In ICDE, 2018.
[48]
V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh, and M. Cilimdzic. Robust query processing through progressive optimization. In SIGMOD, pages 659--670, 2004.
[49]
N. Marz and J. Warren. Big Data: Principles and best practices of scalable realtime data systems. Manning, 2015.
[50]
M. Mitchell. An introduction to genetic algorithms. MIT press, 1998.
[51]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08, pages 1099--1110, 2008.
[52]
S. Palkar, J. J. Thomas, A. Shanbhag, M. Schwarzkopt, S. P. Amarasinghe, and M. Zaharia. Weld: A Common Runtime for High Performance Data Analysis. In CIDR, 2017.
[53]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.
[54]
A. Rheinländer, A. Heise, F. Hueske, U. Leser, and F. Naumann. SOFA: An extensible logical optimizer for UDF-heavy data flows. Inf. Syst., 52:96--125, 2015.
[55]
P. J. Sadalage and M. Fowler. NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional, 2012.
[56]
S. Shankar, A. Choi, and J.-P. Dijcks. Integrating Hadoop Data with Oracle Parallel Processing. Oracle White Paper, http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-integrating-hadoop-data-with-or-130063.pdf, 2010.
[57]
A. P. Sheth and J. A. Larson. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22(3):183--236, 1990.
[58]
A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal. Optimizing Analytic Data Flows for Multiple Execution Engines. In SIGMOD, pages 829--840, 2012.
[59]
C. Statchuk, N. Madhavji, A. Miranskyy, and F. Dehne. Taming a tiger: Software engineering in the era of big data & continuous development. In CASCON, 2015.
[60]
M. Stonebraker. The Case for Polystores. http://wp.sigmod.org/?p=1629, 2015.
[61]
D. Tsoumakos and C. Mantas. The Case for Multi-Engine Data Analytics. In Euro-Par, pages 406--415, 2013.
[62]
J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu. The Myria Big Data Management and Analytics System and Cloud Services. In CIDR, 2017.
[63]
F. Yang, J. Li, and J. Cheng. Husky: Towards a More Efficient and Expressive Distributed Computing Framework. PVLDB, 9(5):420--431, 2016.

Cited By

View all
  • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
  • (2023)Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data MeshesProceedings of the VLDB Endowment10.14778/3611479.361152616:11(3293-3301)Online publication date: 24-Aug-2023
  • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
  • Show More Cited By
  1. RHEEM: enabling cross-platform data processing: may the big data be with you!

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 11
    July 2018
    507 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2018
    Published in PVLDB Volume 11, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Blueprinting the Cloud: Unifying and Automatically Optimizing Cloud Data Infrastructures with BRADProceedings of the VLDB Endowment10.14778/3681954.368202617:11(3629-3643)Online publication date: 1-Jul-2024
    • (2023)Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data MeshesProceedings of the VLDB Endowment10.14778/3611479.361152616:11(3293-3301)Online publication date: 24-Aug-2023
    • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
    • (2023)Proactive Streaming Analytics at Scale: A Journey from the State-of-the-art to a Production PlatformProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615293(5204-5207)Online publication date: 21-Oct-2023
    • (2022)Unified data analyticsProceedings of the VLDB Endowment10.14778/3554821.355489815:12(3778-3781)Online publication date: 1-Aug-2022
    • (2022)Polyglot data managementProceedings of the VLDB Endowment10.14778/3554821.355489115:12(3750-3753)Online publication date: 1-Aug-2022
    • (2022)On-demand state separation for cloud data warehousingProceedings of the VLDB Endowment10.14778/3551793.355184515:11(2966-2979)Online publication date: 29-Sep-2022
    • (2022)BabelfishProceedings of the VLDB Endowment10.14778/3489496.348950115:2(196-210)Online publication date: 4-Feb-2022
    • (2022)OVI-3: A NoSQL visual query system supporting efficient anti-joinsJournal of Intelligent Information Systems10.1007/s10844-022-00742-460:3(777-801)Online publication date: 21-Sep-2022
    • (2022)Cost-based Optimization of Multistore Query PlansInformation Systems Frontiers10.1007/s10796-022-10320-225:5(1925-1951)Online publication date: 4-Oct-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media