Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Using Cloud Functions as Accelerator for Elastic Data Analytics

Published: 20 June 2023 Publication History

Abstract

Cloud function (CF) services, such as AWS Lambda, have been applied as the new computing infrastructure in implementing analytical query engines. For bursty and sparse workloads, CF-based query engine is more elastic than the traditional query engines running in servers, i.e., virtual machines (VMs), and might provide a higher performance/price ratio. However, it is still controversial whether CF services are good suites for general analytical workloads, in respect of the limitations of CFs in storage, network, and lifetime, as well as the much higher resource unit prices than VMs.
In this paper, we first present micro-benchmark evaluations of the features of CF and VM. We reveal that for query processing, though CF is more elastic than VM, it is less scalable and is more expensive for continuous workloads. Then, to get the best of both worlds, we propose Pixels-Turbo - a hybrid query engine that processes queries in a scalable VM cluster by default and invokes CFs to accelerate the processing of unpredictable workload spikes. In the query engine, we propose several optimizations to improve the performance and scalability of the CF-based operators and a cost-based optimizer to select the appropriate algorithm and parallelism for the physical query plan. Evaluations on TPC-H and real-world workload show that our query engine has a 1-2 orders of magnitude higher performance/price ratio than state-of-the-art serverless query engines for sustained workloads while not compromising the elasticity for workload spikes.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023.
PDF File
Read me
ZIP File
Source Code

References

[1]
2022. Alibaba Cloud E-MapReduce. https://www.alibabacloud.com/product/emapreduce
[2]
2022. Amazon Athena Engine Version 3. https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html
[3]
2022. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/
[4]
2022. Amazon EC2 On-demand Instances. https://aws.amazon.com/ec2/spot/
[5]
2022. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/pricing/on-demand/
[6]
2022. Amazon Redshift Advisor recommendations. https://docs.aws.amazon.com/redshift/latest/dg/advisor-recommendations.html
[7]
2022. Amazon Redshift Serverless. https://aws.amazon.com/redshift/redshift-serverless/
[8]
2022. Amazon S3. https://aws.amazon.com/s3/
[9]
2022. Apache Hudi. https://hudi.apache.org/
[10]
2022. Apache Iceberg. https://iceberg.apache.org/
[11]
2022. Auto Scaling Groups. https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html
[12]
2022. AWS - Spot Instance advisor. https://aws.amazon.com/ec2/spot/instance-advisor/
[13]
2022. AWS EMR. https://aws.amazon.com/emr/
[14]
2022. AWS Glue. https://aws.amazon.com/glue/
[15]
2022. AWS Lambda. https://aws.amazon.com/lambda/
[16]
2022. Azure - Use Azure Spot Virtual Machines - Pricing and eviction history. https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms#pricing-and-eviction-history
[17]
2022. Azure Analysis Services. https://azure.microsoft.com/en-us/services/analysis-services/#overview
[18]
2022. Azure Functions Hosting Options - Scale. https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#scale
[19]
2022. BigQuery under the Hood. https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood
[20]
2022. Configuring Lambda Function Options. https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html
[21]
2022. EC2 Instance Rebalance Recommendations. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalance-recommendations.html
[22]
2022. Google BigQuery. https://cloud.google.com/bigquery
[23]
2022. Google Cloud Storage - Quotas and Limits - Bandwidth. https://cloud.google.com/storage/quotas#bandwidth
[24]
2022. Google Cloud Storage - Request rate and access distribution guidelines. https://cloud.google.com/storage/docs/request-rate
[25]
2022. Presto Docs - Join Reordering Strategy. https://prestodb.io/docs/current/admin/properties.html#optimizer-join-reordering-strategy
[26]
2022. Presto. https://prestodb.io/. https://prestodb.io/
[27]
2022. Redshift Serverless Considerations. https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-considerations.html
[28]
2022. Resource Quota of Google Cloud Functions 2nd Gen. https://cloud.google.com/functions/docs/concepts/version-comparison#new-in-2nd-gen
[29]
2022. S3 Select and Glacier Select -- Retrieving Subsets of Objects. https://aws.amazon.com/blogs/aws/s3-glacier-select/
[30]
2022. Scalability and performance targets for Blob storage. https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets
[31]
2022. Spark Operator Pushdown. https://www.databricks.com/dataaisummit/session/spark-data-source-v2-performance-improvement-aggregate-push-down
[32]
2022. Spark SQL. http://spark.apache.org/sql/
[33]
2022. Spot Instance Interruption Notices. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html
[34]
2022. Step and simple scaling policies for Amazon EC2 Auto Scaling. https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scaling-simple-step.html
[35]
2022. Top 10 Performance Tuning Tips for Amazon Athena. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
[36]
2022. Trino. https://trino.io/
[37]
2022. Trino - Task Properties. https://trino.io/docs/current/admin/properties-task.html
[38]
2022. Trino Graceful-shutdown. https://trino.io/docs/current/admin/graceful-shutdown.html
[39]
2022. Trino Operator Pushdown. https://trino.io/docs/current/optimizer/pushdown.html
[40]
2023. Billing for Amazon Redshift Serverless. https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-billing.html
[41]
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2, 1 (2009).
[42]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD.
[43]
Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-Invented. In SIGMOD.
[44]
Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient Processing of Data Warehousing Queries in a Split Execution Environment. In SIGMOD.
[45]
Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column Store for Cloud Data Lakes. In ICDE.
[46]
Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization Based on Column Ordering and Duplication. In SIGMOD.
[47]
Nicolas Bruno, Johnny Debrodt, Chujun Song, and Wei Zheng. 2022. Computation Reuse via Fusion in Amazon Athena. In ICDE.
[48]
Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv, Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, and Milind Bhandarkar. 2014. HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In SIGMOD.
[49]
Yueguo Chen, Xiongpai Qin, Haoqiong Bian, Jun Chen, Zhaoan Dong, Xiaoyong Du, Yanjie Gao, Dehai Liu, Jiaheng Lu, and Huijie Zhang. 2014. A Study of SQL-on-Hadoop Systems. In BPOE@ASPLOS/VLDB.
[50]
Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al . 2016. The snowflake elastic data warehouse. In SIGMOD.
[51]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI.
[52]
Alex Fuerst, Stanko Novakovic, Íñigo Goiri, Gohar Irfan Chaudhry, Prateek Sharma, Kapil Arya, Kevin Broas, Eugene Bak, Mehmet Iyigun, and Ricardo Bianchini. 2022. Memory-Harvesting VMs in Cloud Platforms. In ASPLOS.
[53]
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.
[54]
Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. 2019. Serverless Computing: One Step Forward, Two Steps Back. In CIDR.
[55]
Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. 2022. Astrea: Auto-Serverless Analytics Towards Cost-Efficiency and QoS-Awareness. TPDS 33, 12 (2022), 3833--3849.
[56]
Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In SoCC (SoCC '17).
[57]
Y. Kim and J. Lin. 2018. Serverless Data Analytics with Flint. In CLOUD.
[58]
Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR.
[59]
Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, and Gustavo Alonso. 2021. Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms. PVLDB 14, 13 (2021).
[60]
Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, Alexandra Wang, Wen Lin, Ashwin Agrawal, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, and Venkatesh Raghavan. 2021. Greenplum: A Hybrid Database for Transactional and Analytical Workloads. In SIGMOD.
[61]
Ashraf Mahgoub, Karthick Shankar, Subrata Mitra, Ana Klimovic, Somali Chaterji, and Saurabh Bagchi. 2021. SONIC: Application-aware Data Passing for Chained Serverless Applications. In ATC.
[62]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3, 1 (2010).
[63]
Armbrust Michael, Das Tathagata, Sun Liwen, Yavuz Burak, Zhu Shixiong, Murthy Mukul, Torres Joseph, Hovell Herman, van, Ionescu Adrian, Luszczak Alicja, Switakowski Michal, Michalm Szafranski, Li Xiao, Ueshin Takuya, Mokhtar Mostafa, Boncz Peter, Ghodsi Ali, Paranjpye Sameer, Senster Pieter, Xin Reynold, and Zaharia Matei. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. PVLDB 13, 12 (2020).
[64]
Wawrzoniak Michal, Müller Ingo, Bruno Rodrigo, and Alonso Gustavo. 2021. Boxer: Data Analytics on Network-enabled Serverless Platforms. In CIDR.
[65]
Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure. In SIGMOD.
[66]
Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. 2020. Starling: A Scalable Query Engine on Cloud Functions. In SIGMOD.
[67]
Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure. In NSDI.
[68]
Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In SIGMOD.
[69]
Johann Schleier-Smith, Vikram Sreekanti, Anurag Khandelwal, Joao Carreira, Neeraja J. Yadwadkar, Raluca Ada Popa, Joseph E. Gonzalez, Ion Stoica, and David A. Patterson. 2021. What Serverless Computing is and Should Become: The next Phase of Cloud Computing. CACM 64, 5 (apr 2021), 76--84.
[70]
Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In ICDE.
[71]
Panagiotis Sioulas and Anastasia Ailamaki. 2021. Scalable Multi-Query Execution Using Reinforcement Learning. In SIGMOD.
[72]
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. 2010. MapReduce and Parallel DBMSs: Friends or Foes? CACM 53, 1 (2010), 64--71.
[73]
Xiangyao Yu, Matt Youill, Matthew E. Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Accelerating a DBMS Using S3 Computation. In ICDE.
[74]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In HotCloud.
[75]
Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. G-TADOC: Enabling efficient GPU-based text analytics without decompression. In ICDE.
[76]
Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Efficient document analytics on compressed data: Method, challenges, algorithms, insights. PVLDB 11, 11 (2018), 1522--1535.
[77]
Qizhen Zhang, Phil Bernstein, Daniel S. Berger, Badrish Chandramouli, Boon Thao Loo, and Vincent Liu. 2022. CompuCache: Remote Computable Caching using Spot VMs. In CIDR.
[78]
Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, and Badrish Chandramouli. 2021. Redy: Remote Dynamic Memory Cache. PVLDB 15, 4 (2021).

Cited By

View all
  • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
  • (2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
  • (2024)FunDa: Towards Serverless Data Analytics and In Situ Query ProcessingProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3663741.3664788(1-6)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023
Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Badges

Author Tags

  1. FAAS
  2. OLAP
  3. QAAS
  4. cloud databases
  5. cloud function
  6. cloud storage
  7. column store
  8. cost efficiency
  9. data lake
  10. data warehouse
  11. elasticity
  12. query optimization
  13. query processing
  14. serverless

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)380
  • Downloads (Last 6 weeks)29
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
  • (2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
  • (2024)FunDa: Towards Serverless Data Analytics and In Situ Query ProcessingProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3663741.3664788(1-6)Online publication date: 9-Jun-2024
  • (2024)Vexless: A Serverless Vector Data Management System Using Cloud FunctionsProceedings of the ACM on Management of Data10.1145/36549902:3(1-26)Online publication date: 30-May-2024
  • (2024)Online Container Caching with Late-Warm for IoT Data Processing2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00127(1547-1560)Online publication date: 13-May-2024
  • (2023)Efficient Resource Utilization in IoT and Cloud ComputingInformation10.3390/info1411061914:11(619)Online publication date: 19-Nov-2023
  • (2023)Cackle: Analytical Workload Cost and Performance Stability With Elastic PoolsProceedings of the ACM on Management of Data10.1145/36267201:4(1-25)Online publication date: 12-Dec-2023
  • (2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media