research-article

Using Cloud Functions as Accelerator for Elastic Data Analytics

Authors:

Anastasia AilamakiAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 161, Pages 1 - 27

https://doi.org/10.1145/3589306

Published: 20 June 2023 Publication History

Abstract

Cloud function (CF) services, such as AWS Lambda, have been applied as the new computing infrastructure in implementing analytical query engines. For bursty and sparse workloads, CF-based query engine is more elastic than the traditional query engines running in servers, i.e., virtual machines (VMs), and might provide a higher performance/price ratio. However, it is still controversial whether CF services are good suites for general analytical workloads, in respect of the limitations of CFs in storage, network, and lifetime, as well as the much higher resource unit prices than VMs.

In this paper, we first present micro-benchmark evaluations of the features of CF and VM. We reveal that for query processing, though CF is more elastic than VM, it is less scalable and is more expensive for continuous workloads. Then, to get the best of both worlds, we propose Pixels-Turbo - a hybrid query engine that processes queries in a scalable VM cluster by default and invokes CFs to accelerate the processing of unpredictable workload spikes. In the query engine, we propose several optimizations to improve the performance and scalability of the CF-based operators and a cost-based optimizer to select the appropriate algorithm and parallelism for the physical query plan. Evaluations on TPC-H and real-world workload show that our query engine has a 1-2 orders of magnitude higher performance/price ratio than state-of-the-art serverless query engines for sustained workloads while not compromising the elasticity for workload spikes.

Supplemental Material

MP4 File

Presentation video for SIGMOD 2023.

Download
50.97 MB

PDF File

Read me

Download
27.79 KB

ZIP File

Source Code

Download
90.43 MB

References

[1]

2022. Alibaba Cloud E-MapReduce. https://www.alibabacloud.com/product/emapreduce

[2]

2022. Amazon Athena Engine Version 3. https://docs.aws.amazon.com/athena/latest/ug/engine-versions-reference-0003.html

[3]

2022. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/

[4]

2022. Amazon EC2 On-demand Instances. https://aws.amazon.com/ec2/spot/

[5]

2022. Amazon EC2 Spot Instances. https://aws.amazon.com/ec2/pricing/on-demand/

[6]

2022. Amazon Redshift Advisor recommendations. https://docs.aws.amazon.com/redshift/latest/dg/advisor-recommendations.html

[7]

2022. Amazon Redshift Serverless. https://aws.amazon.com/redshift/redshift-serverless/

[8]

2022. Amazon S3. https://aws.amazon.com/s3/

[9]

2022. Apache Hudi. https://hudi.apache.org/

[10]

2022. Apache Iceberg. https://iceberg.apache.org/

[11]

2022. Auto Scaling Groups. https://docs.aws.amazon.com/autoscaling/ec2/userguide/auto-scaling-groups.html

[12]

2022. AWS - Spot Instance advisor. https://aws.amazon.com/ec2/spot/instance-advisor/

[13]

2022. AWS EMR. https://aws.amazon.com/emr/

[14]

2022. AWS Glue. https://aws.amazon.com/glue/

[15]

2022. AWS Lambda. https://aws.amazon.com/lambda/

[16]

2022. Azure - Use Azure Spot Virtual Machines - Pricing and eviction history. https://learn.microsoft.com/en-us/azure/virtual-machines/spot-vms#pricing-and-eviction-history

[17]

2022. Azure Analysis Services. https://azure.microsoft.com/en-us/services/analysis-services/#overview

[18]

2022. Azure Functions Hosting Options - Scale. https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#scale

[19]

2022. BigQuery under the Hood. https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood

[20]

2022. Configuring Lambda Function Options. https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html

[21]

2022. EC2 Instance Rebalance Recommendations. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/rebalance-recommendations.html

[22]

2022. Google BigQuery. https://cloud.google.com/bigquery

[23]

2022. Google Cloud Storage - Quotas and Limits - Bandwidth. https://cloud.google.com/storage/quotas#bandwidth

[24]

2022. Google Cloud Storage - Request rate and access distribution guidelines. https://cloud.google.com/storage/docs/request-rate

[25]

2022. Presto Docs - Join Reordering Strategy. https://prestodb.io/docs/current/admin/properties.html#optimizer-join-reordering-strategy

[26]

2022. Presto. https://prestodb.io/. https://prestodb.io/

[27]

2022. Redshift Serverless Considerations. https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-considerations.html

[28]

2022. Resource Quota of Google Cloud Functions 2nd Gen. https://cloud.google.com/functions/docs/concepts/version-comparison#new-in-2nd-gen

[29]

2022. S3 Select and Glacier Select -- Retrieving Subsets of Objects. https://aws.amazon.com/blogs/aws/s3-glacier-select/

[30]

2022. Scalability and performance targets for Blob storage. https://learn.microsoft.com/en-us/azure/storage/blobs/scalability-targets

[31]

2022. Spark Operator Pushdown. https://www.databricks.com/dataaisummit/session/spark-data-source-v2-performance-improvement-aggregate-push-down

[32]

2022. Spark SQL. http://spark.apache.org/sql/

[33]

2022. Spot Instance Interruption Notices. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html

[34]

2022. Step and simple scaling policies for Amazon EC2 Auto Scaling. https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scaling-simple-step.html

[35]

2022. Top 10 Performance Tuning Tips for Amazon Athena. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

[36]

2022. Trino. https://trino.io/

[37]

2022. Trino - Task Properties. https://trino.io/docs/current/admin/properties-task.html

[38]

2022. Trino Graceful-shutdown. https://trino.io/docs/current/admin/graceful-shutdown.html

[39]

2022. Trino Operator Pushdown. https://trino.io/docs/current/optimizer/pushdown.html

[40]

2023. Billing for Amazon Redshift Serverless. https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-billing.html

[41]

Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. 2009. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB 2, 1 (2009).

Digital Library

[42]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD.

[43]

Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-Invented. In SIGMOD.

[44]

Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient Processing of Data Warehousing Queries in a Split Execution Environment. In SIGMOD.

[45]

Haoqiong Bian and Anastasia Ailamaki. 2022. Pixels: An Efficient Column Store for Cloud Data Lakes. In ICDE.

[46]

Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization Based on Column Ordering and Duplication. In SIGMOD.

[47]

Nicolas Bruno, Johnny Debrodt, Chujun Song, and Wei Zheng. 2022. Computation Reuse via Fusion in Amazon Athena. In ICDE.

[48]

Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv, Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, and Milind Bhandarkar. 2014. HAWQ: A Massively Parallel Processing SQL Engine in Hadoop. In SIGMOD.

Digital Library

[49]

Yueguo Chen, Xiongpai Qin, Haoqiong Bian, Jun Chen, Zhaoan Dong, Xiaoyong Du, Yanjie Gao, Dehai Liu, Jiaheng Lu, and Huijie Zhang. 2014. A Study of SQL-on-Hadoop Systems. In BPOE@ASPLOS/VLDB.

[50]

Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, et al . 2016. The snowflake elastic data warehouse. In SIGMOD.

[51]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI.

[52]

Alex Fuerst, Stanko Novakovic, Íñigo Goiri, Gohar Irfan Chaudhry, Prateek Sharma, Kapil Arya, Kevin Broas, Eugene Bak, Mehmet Iyigun, and Ricardo Bianchini. 2022. Memory-Harvesting VMs in Cloud Platforms. In ASPLOS.

[53]

Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In SIGMOD.

[54]

Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. 2019. Serverless Computing: One Step Forward, Two Steps Back. In CIDR.

[55]

Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. 2022. Astrea: Auto-Serverless Analytics Towards Cost-Efficiency and QoS-Awareness. TPDS 33, 12 (2022), 3833--3849.

[56]

Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. 2017. Occupy the Cloud: Distributed Computing for the 99%. In SoCC (SoCC '17).

Digital Library

[57]

Y. Kim and J. Lin. 2018. Serverless Data Analytics with Flint. In CLOUD.

[58]

Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR.

[59]

Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, and Gustavo Alonso. 2021. Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms. PVLDB 14, 13 (2021).

Digital Library

[60]

Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, Alexandra Wang, Wen Lin, Ashwin Agrawal, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, and Venkatesh Raghavan. 2021. Greenplum: A Hybrid Database for Transactional and Analytical Workloads. In SIGMOD.

[61]

Ashraf Mahgoub, Karthick Shankar, Subrata Mitra, Ana Klimovic, Somali Chaterji, and Saurabh Bagchi. 2021. SONIC: Application-aware Data Passing for Chained Serverless Applications. In ATC.

[62]

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB 3, 1 (2010).

[63]

Armbrust Michael, Das Tathagata, Sun Liwen, Yavuz Burak, Zhu Shixiong, Murthy Mukul, Torres Joseph, Hovell Herman, van, Ionescu Adrian, Luszczak Alicja, Switakowski Michal, Michalm Szafranski, Li Xiao, Ueshin Takuya, Mokhtar Mostafa, Boncz Peter, Ghodsi Ali, Paranjpye Sameer, Senster Pieter, Xin Reynold, and Zaharia Matei. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. PVLDB 13, 12 (2020).

[64]

Wawrzoniak Michal, Müller Ingo, Bruno Rodrigo, and Alonso Gustavo. 2021. Boxer: Data Analytics on Network-enabled Serverless Platforms. In CIDR.

[65]

Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud Infrastructure. In SIGMOD.

[66]

Matthew Perron, Raul Castro Fernandez, David DeWitt, and Samuel Madden. 2020. Starling: A Scalable Query Engine on Cloud Functions. In SIGMOD.

[67]

Qifan Pu, Shivaram Venkataraman, and Ion Stoica. 2019. Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure. In NSDI.

[68]

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. In SIGMOD.

Digital Library

[69]

Johann Schleier-Smith, Vikram Sreekanti, Anurag Khandelwal, Joao Carreira, Neeraja J. Yadwadkar, Raluca Ada Popa, Joseph E. Gonzalez, Ion Stoica, and David A. Patterson. 2021. What Serverless Computing is and Should Become: The next Phase of Cloud Computing. CACM 64, 5 (apr 2021), 76--84.

Digital Library

[70]

Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In ICDE.

[71]

Panagiotis Sioulas and Anastasia Ailamaki. 2021. Scalable Multi-Query Execution Using Reinforcement Learning. In SIGMOD.

[72]

Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. 2010. MapReduce and Parallel DBMSs: Friends or Foes? CACM 53, 1 (2010), 64--71.

Digital Library

[73]

Xiangyao Yu, Matt Youill, Matthew E. Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Accelerating a DBMS Using S3 Computation. In ICDE.

[74]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In HotCloud.

[75]

Feng Zhang, Zaifeng Pan, Yanliang Zhou, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Xiaoyong Du. 2021. G-TADOC: Enabling efficient GPU-based text analytics without decompression. In ICDE.

[76]

Feng Zhang, Jidong Zhai, Xipeng Shen, Onur Mutlu, and Wenguang Chen. 2018. Efficient document analytics on compressed data: Method, challenges, algorithms, insights. PVLDB 11, 11 (2018), 1522--1535.

Digital Library

[77]

Qizhen Zhang, Phil Bernstein, Daniel S. Berger, Badrish Chandramouli, Boon Thao Loo, and Vincent Liu. 2022. CompuCache: Remote Computable Caching using Spot VMs. In CIDR.

[78]

Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, and Badrish Chandramouli. 2021. Redy: Remote Dynamic Memory Cache. PVLDB 15, 4 (2021).

Digital Library

Cited By

Al-Sayeh HJibril MSattler K(2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681990
Shojaee Rad ZGhobaei-Arani M(2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
https://doi.org/10.1186/s40537-024-00939-0
Das SPeter RZhang XRay S(2024)FunDa: Towards Serverless Data Analytics and In Situ Query ProcessingProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3663741.3664788(1-6)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3663741.3664788
Show More Cited By

Index Terms

Recommendations

Starling: A Scalable Query Engine on Cloud Functions
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle ...
Amazon Redshift Re-invented
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

In 2013, AmazonWeb Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully-managed, petabyte-scale, enterprise-grade cloud data warehouse. Amazon Redshift made it simple and cost-effective to efficiently ...
Optimizing Communication for Multi-Join Query Processing in Cloud Data Warehouses

In this paper, the authors present storage structures, PK-map and Tuple-index-map, to improve the performance of query execution and inter-node communication in Cloud Data Warehouses. Cloud Data Warehouses require Read-Optimized databases because large ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 2

PACMMOD

June 2023

2310 pages

EISSN:2836-6573

DOI:10.1145/3605748

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Badges

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
562
Total Downloads

Downloads (Last 12 months)380
Downloads (Last 6 weeks)29

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Al-Sayeh HJibril MSattler K(2024)Agile-Ant: Self-Managing Distributed Cache Management for Cost Optimization of Big Data ApplicationsProceedings of the VLDB Endowment10.14778/3681954.368199017:11(3151-3164)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681990
Shojaee Rad ZGhobaei-Arani M(2024)Data pipeline approaches in serverless computing: a taxonomy, review, and research trendsJournal of Big Data10.1186/s40537-024-00939-011:1Online publication date: 11-Jun-2024
https://doi.org/10.1186/s40537-024-00939-0
Das SPeter RZhang XRay S(2024)FunDa: Towards Serverless Data Analytics and In Situ Query ProcessingProceedings of the International Workshop on Big Data in Emergent Distributed Environments10.1145/3663741.3664788(1-6)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3663741.3664788
Su YSun YZhang MWang J(2024)Vexless: A Serverless Vector Data Management System Using Cloud FunctionsProceedings of the ACM on Management of Data10.1145/36549902:3(1-26)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654990
Li GTan HZhang XZhang CZhou RHan ZChen G(2024)Online Container Caching with Late-Warm for IoT Data Processing2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00127(1547-1560)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00127
Prasad VDansana DBhavsar MAcharya BGerogiannis VKanavos A(2023)Efficient Resource Utilization in IoT and Cloud ComputingInformation10.3390/info1411061914:11(619)Online publication date: 19-Nov-2023
https://doi.org/10.3390/info14110619
Perron MCastro Fernandez RDeWitt DCafarella MMadden S(2023)Cackle: Analytical Workload Cost and Performance Stability With Elastic PoolsProceedings of the ACM on Management of Data10.1145/36267201:4(1-25)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626720
Liu JZhang FGuan JSung HGuo XDu XShen XAamodt TJerger NSwift M(2023)Space-Efficient TREC for Enabling Deep Learning on MicrocontrollersProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582062(644-659)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582062

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents