research-article

Computation Reuse in Analytics Job Service at Microsoft

Authors:

Konstantinos Karanasos,

Sriram RaoAuthors Info & Claims

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 191 - 203

https://doi.org/10.1145/3183713.3190656

Published: 27 May 2018 Publication History

Abstract

Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for data analytics, be it in a cloud environment or within enterprises. In this setting, users are not required to manage or tune their hardware and software infrastructure, and they pay only for the processing resources consumed per job. However, the shared nature of these job services across several users and teams leads to significant overlaps in partial computations, i.e., parts of the processing are duplicated across multiple jobs, thus generating redundant costs. In this paper, we describe a computation reuse framework, coined CLOUDVIEWS, which we built to address the computation overlap problem in Microsoft's SCOPE job service. We present a detailed analysis from our production workloads to motivate the computation overlap problem and the possible gains from computation reuse. The key aspects of our system are the following: (i) we reuse computations by creating materialized views over recurring workloads, i.e., periodically executing jobs that have the same script templates but process new data each time, (ii) we select the views to materialize using a feedback loop that reconciles the compile-time and run-time statistics and gathers precise measures of the utility and cost of each overlapping computation, and (iii) we create materialized views in an online fashion, without requiring an offline phase to materialize the overlapping computations.

References

[1]

Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. 2012. Re-optimizing data-parallel computing. In NSDI.

Digital Library

[2]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB. 496--505.

Digital Library

[3]

Amazon Athena 2018. https://aws.amazon.com/athena/. (2018).

[4]

Amazon RDS 2018. https://aws.amazon.com/rds/. (2018).

[5]

Azure Data Lake 2018. https://azure.microsoft.com/en-us/solutions/data-lake/. (2018).

[6]

Azure SQL 2018. https://azure.microsoft.com/en-us/services/sql-database/. (2018).

[7]

Shivnath Babu, Pedro Bizarro, and David J. DeWitt. 2005. Proactive Reoptimization. In SIGMOD Conference. 107--118.

Digital Library

[8]

Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In OSDI.

Digital Library

[9]

Nico Bruno, Sapna Jain, and Jingren Zhou. 2013. Continuous Cloud-Scale Query Optimization and Processing. In VLDB.

Digital Library

[10]

Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, and Soudip Roy Chowdhury. 2016. Reuse-based Optimization for Pig Latin. In CIKM.

[11]

Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efcient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276.

Digital Library

[12]

Iman Elghandour and Ashraf Aboulnaga. 2012. ReStore: Reusing Results of MapReduce Jobs. PVLDB 5, 6 (2012).

Digital Library

[13]

EU GDPR 2018. https://www.eugdpr.org/. (2018).

[14]

Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2014. Shared Workload Optimization. PVLDB 7, 6 (2014).

Digital Library

[15]

Google BigQuery 2018. https://cloud.google.com/bigquery. (2018).

[16]

Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.

[17]

Zhongxian Gu, Mohamed A. Soliman, and Florian M. Waas. 2012. Testing the Accuracy of Query Optimizers. In DBTest. 11:1--11:6.

Digital Library

[18]

Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang. 2010. Nectar: Automatic Management of Data and Computation in Datacenters. In OSDI.

Digital Library

[19]

Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Trans. Knowl. Data Eng. 17, 1 (2005), 24--43.

Digital Library

[20]

Alon Y. Halevy. 2001. Answering queries using views: A survey. VLDB J. 10, 4 (2001).

Digital Library

[21]

Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's Datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 795--806.

Digital Library

[22]

Venky Harinarayan, Anand Rajaraman, and Jefrey D. Ullman. 1996. Implementing Data Cubes Efciently. In ACM SIGMOD.

Digital Library

[23]

Milena Ivanova, Martin L. Kersten, Niels J. Nes, and Romulo Goncalves. 2009. An architecture for recycling intermediates in a column-store. In SIGMOD.

Digital Library

[24]

Alekh Jindal, Konstantinos Karanasos, Sriram Rao, and Hiren Patel. 2017. Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale. Under Submission (2017).

[25]

Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In OSDI. 117--134.

Digital Library

[26]

Navin Kabra and David J. DeWitt. 1998. Efcient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. In SIGMOD Conference. 106--117.

Digital Library

[27]

Konstantinos Karanasos, Andrey Balmin, Marcel Kutsch, Fatma Ozcan, Vuk Ercegovac, Chunyang Xia, and Jesse Jackson. 2014. Dynamically optimizing queries over large scale data platforms. In SIGMOD.

Digital Library

[28]

Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, and Shivnath Babu. 2017. ROBUS: Fair Cache Allocation for Data-parallel Workloads. In SIGMOD. 219--234.

Digital Library

[29]

Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? PVLDB 9, 3 (2015), 204--215.

Digital Library

[30]

Shaosu Liu, Bin Song, Sriharsha Gangam, Lawrence Lo, and Khaled Elmeleegy. 2016. Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data. PVLDB 9, 13 (2016).

Digital Library

[31]

Guy Lohman. 2014. http://wp.sigmod.org/?p=1075. (2014).

[32]

Darko Makreshanski, Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2016. MQJoin: Efcient Shared Execution of Main-Memory Joins. PVLDB 9, 6 (2016).

Digital Library

[33]

Imene Mami and Zohra Bellahsene. 2012. A survey of view selection methods. SIGMOD Record 41, 1 (2012), 20--29.

Digital Library

[34]

Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M. Lohman, and Hamid Pirahesh. 2004. Robust Query Processing through Progressive Optimization. In SIGMOD. 659--670.

Digital Library

[35]

Ruslan Mavlyutov, Carlo Curino, Boris Asipov, and Philippe Cudré-Mauroux. 2017. Dependency-Driven Analytics: A Compass for Uncharted Data Oceans. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings.

[36]

Fabian Nagel, Peter A. Boncz, and Stratis Viglas. 2013. Recycling in pipelined query evaluation. In ICDE.

Digital Library

[37]

Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB 3, 1 (2010).

Digital Library

[38]

Luis Leopoldo Perez and Christopher M. Jermaine. 2014. History-aware query optimization with materialized intermediate views. In ICDE.

[39]

Power BI 2018. https://powerbi.microsoft.com. (2018).

[40]

Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, and Guy M. Lohman. 2008. Main-memory scan sharing for multi-core CPUs. PVLDB 1, 1 (2008).

Digital Library

[41]

Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. 2000. Efcient and Extensible Algorithms for Multi Query Optimization. In SIGMOD.

Digital Library

[42]

Timos K. Sellis. 1988. Multiple-Query Optimization. ACM Trans. Database Syst. 13, 1 (1988), 23--52.

Digital Library

[43]

Shayak Sen, Saikat Guha, Anupam Datta, Sriram K. Rajamani, Janice Y. Tsai, and Jeannette M. Wing. 2014. Bootstrapping Privacy Compliance in Big Data Systems. In IEEE Symposium on Security and Privacy. 327--342.

Digital Library

[44]

Yasin N. Silva, Per-Åke Larson, and Jingren Zhou. 2012. Exploiting Common Subexpressions for Cloud Query Processing. In ICDE.

Digital Library

[45]

Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In VLDB.

Digital Library

[46]

Dimitri Theodoratos and Timos K. Sellis. 1997. Data Warehouse Confguration. In VLDB.

Digital Library

[47]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckof, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009).

Digital Library

[48]

TPC-DS Benchmark 2018. http://www.tpc.org/tpcds. (2018).

[49]

Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In EuroSys.

Digital Library

[50]

Guoping Wang and Chee-Yong Chan. 2013. Multi-Query Optimization in MapReduce Framework. PVLDB 7, 3 (2013).

Digital Library

[51]

Xiaodan Wang, Christopher Olston, Anish Das Sarma, and Randal Burns. 2011. CoScan: Cooperative Scan Sharing in the Cloud. In SOCC. 11:1--11:12.

Digital Library

[52]

Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Åke Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: parallel databases meet MapReduce. VLDB J. 21, 5 (2012), 611--636.

Digital Library

[53]

Jingren Zhou, Per-Åke Larson, Johann Christoph Freytag, and Wolfgang Lehner. 2007. Efcient exploitation of similar subexpressions for query processing. In ACM SIGMOD. 533--544.

Digital Library

[54]

Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB.

Digital Library

Cited By

Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639308
Song WEo JUm TJeon MChun B(2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629558
Show More Cited By

Index Terms

Computation Reuse in Analytics Job Service at Microsoft
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization

Recommendations

TOC: Joint Task Offloading and Computation Reuse in Vehicular Edge Computing
Algorithms and Architectures for Parallel Processing
Abstract
With the proliferation of intelligent vehicles, addressing the demands of computing-intensive and delay-sensitive vehicle tasks has become a formidable challenge. Vehicle edge computing (VEC) has been proposed as an advanced paradigm that ...
Enhancing computation reuse efficiency in ICN-based edge computing by modifying content store table structure
Abstract
In edge computing, repetitive computations are a common occurrence. However, the traditional TCP/IP architecture used in edge computing fails to identify these repetitions, resulting in redundant computations being recomputed by edge resources. To ...
Computation reuse in domain-specific optimization of signal recognition
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

May 2018

1874 pages

ISBN:9781450347037

DOI:10.1145/3183713

General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '18

Sponsor:

SIGMOD

SIGMOD/PODS '18: International Conference on Management of Data

June 10 - 15, 2018

TX, Houston, USA

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
593
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)4

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu RWeng LWei WWu DPeng JWang YDing BLian DZheng BZhou J(2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.14778/3641204.3641209
Huang HSiddiqui TAlotaibi RCurino CLeeka JJindal AZhao JCamacho-Rodríguez JTian Y(2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639308
Song WEo JUm TJeon MChun B(2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3627703.3629558
Qiao SJindal A(2023)PikePlace: Generating Intelligence for Marketplace DatasetsProceedings of the VLDB Endowment10.14778/3611540.361163216:12(4106-4109)Online publication date: 12-Sep-2023
https://doi.org/10.14778/3611540.3611632
Haynes BAlotaibi RPavlenko ALeeka JJindal ATian Y(2023)GEqO: ML-Accelerated Semantic Equivalence DetectionProceedings of the ACM on Management of Data10.1145/36267101:4(1-25)Online publication date: 12-Dec-2023
https://doi.org/10.1145/3626710
Huang SQin YZhang XTu YLi ZCui B(2023)Survey on performance optimization for database systemsScience China Information Sciences10.1007/s11432-021-3578-666:2Online publication date: 11-Jan-2023
https://doi.org/10.1007/s11432-021-3578-6
Shin AJeong JKim DJung SChun B(2022)HippoProceedings of the VLDB Endowment10.14778/3510397.351040215:5(1038-1052)Online publication date: 18-May-2022
https://dl.acm.org/doi/10.14778/3510397.3510402
Liu RWong DLange DLarsson PJethava VZheng QBoehm MVarma PXin D(2022)Accelerating container-based deep learning hyperparameter optimization workloadsProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533309(1-10)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533028.3533309
Zhang WInterlandi MMineiro PQiao SGhazanfari NLie KFriedman MHosn RPatel HJindal AIves ZBonifati AEl Abbadi A(2022)Deploying a Steered Query Optimizer in Production at MicrosoftProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526052(2299-2311)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526052
Han YLi GYuan HSun J(2022)AutoView: An Autonomous Materialized View Management System with Encoder-ReducerIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3163195(1-1)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3163195
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents