Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3183713.3190656acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Computation Reuse in Analytics Job Service at Microsoft

Published: 27 May 2018 Publication History

Abstract

Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for data analytics, be it in a cloud environment or within enterprises. In this setting, users are not required to manage or tune their hardware and software infrastructure, and they pay only for the processing resources consumed per job. However, the shared nature of these job services across several users and teams leads to significant overlaps in partial computations, i.e., parts of the processing are duplicated across multiple jobs, thus generating redundant costs. In this paper, we describe a computation reuse framework, coined CLOUDVIEWS, which we built to address the computation overlap problem in Microsoft's SCOPE job service. We present a detailed analysis from our production workloads to motivate the computation overlap problem and the possible gains from computation reuse. The key aspects of our system are the following: (i) we reuse computations by creating materialized views over recurring workloads, i.e., periodically executing jobs that have the same script templates but process new data each time, (ii) we select the views to materialize using a feedback loop that reconciles the compile-time and run-time statistics and gathers precise measures of the utility and cost of each overlapping computation, and (iii) we create materialized views in an online fashion, without requiring an offline phase to materialize the overlapping computations.

References

[1]
Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. 2012. Re-optimizing data-parallel computing. In NSDI.
[2]
Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB. 496--505.
[3]
Amazon Athena 2018. https://aws.amazon.com/athena/. (2018).
[4]
Amazon RDS 2018. https://aws.amazon.com/rds/. (2018).
[5]
Azure Data Lake 2018. https://azure.microsoft.com/en-us/solutions/data-lake/. (2018).
[6]
Azure SQL 2018. https://azure.microsoft.com/en-us/services/sql-database/. (2018).
[7]
Shivnath Babu, Pedro Bizarro, and David J. DeWitt. 2005. Proactive Reoptimization. In SIGMOD Conference. 107--118.
[8]
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In OSDI.
[9]
Nico Bruno, Sapna Jain, and Jingren Zhou. 2013. Continuous Cloud-Scale Query Optimization and Processing. In VLDB.
[10]
Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, and Soudip Roy Chowdhury. 2016. Reuse-based Optimization for Pig Latin. In CIKM.
[11]
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efcient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276.
[12]
Iman Elghandour and Ashraf Aboulnaga. 2012. ReStore: Reusing Results of MapReduce Jobs. PVLDB 5, 6 (2012).
[13]
EU GDPR 2018. https://www.eugdpr.org/. (2018).
[14]
Georgios Giannikis, Darko Makreshanski, Gustavo Alonso, and Donald Kossmann. 2014. Shared Workload Optimization. PVLDB 7, 6 (2014).
[15]
Google BigQuery 2018. https://cloud.google.com/bigquery. (2018).
[16]
Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29.
[17]
Zhongxian Gu, Mohamed A. Soliman, and Florian M. Waas. 2012. Testing the Accuracy of Query Optimizers. In DBTest. 11:1--11:6.
[18]
Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang. 2010. Nectar: Automatic Management of Data and Computation in Datacenters. In OSDI.
[19]
Himanshu Gupta and Inderpal Singh Mumick. 2005. Selection of Views to Materialize in a Data Warehouse. IEEE Trans. Knowl. Data Eng. 17, 1 (2005), 24--43.
[20]
Alon Y. Halevy. 2001. Answering queries using views: A survey. VLDB J. 10, 4 (2001).
[21]
Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing Google's Datasets. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016. 795--806.
[22]
Venky Harinarayan, Anand Rajaraman, and Jefrey D. Ullman. 1996. Implementing Data Cubes Efciently. In ACM SIGMOD.
[23]
Milena Ivanova, Martin L. Kersten, Niels J. Nes, and Romulo Goncalves. 2009. An architecture for recycling intermediates in a column-store. In SIGMOD.
[24]
Alekh Jindal, Konstantinos Karanasos, Sriram Rao, and Hiren Patel. 2017. Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale. Under Submission (2017).
[25]
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Íñigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In OSDI. 117--134.
[26]
Navin Kabra and David J. DeWitt. 1998. Efcient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. In SIGMOD Conference. 106--117.
[27]
Konstantinos Karanasos, Andrey Balmin, Marcel Kutsch, Fatma Ozcan, Vuk Ercegovac, Chunyang Xia, and Jesse Jackson. 2014. Dynamically optimizing queries over large scale data platforms. In SIGMOD.
[28]
Mayuresh Kunjir, Brandon Fain, Kamesh Munagala, and Shivnath Babu. 2017. ROBUS: Fair Cache Allocation for Data-parallel Workloads. In SIGMOD. 219--234.
[29]
Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? PVLDB 9, 3 (2015), 204--215.
[30]
Shaosu Liu, Bin Song, Sriharsha Gangam, Lawrence Lo, and Khaled Elmeleegy. 2016. Kodiak: Leveraging Materialized Views For Very Low-Latency Analytics Over High-Dimensional Web-Scale Data. PVLDB 9, 13 (2016).
[31]
Guy Lohman. 2014. http://wp.sigmod.org/?p=1075. (2014).
[32]
Darko Makreshanski, Georgios Giannikis, Gustavo Alonso, and Donald Kossmann. 2016. MQJoin: Efcient Shared Execution of Main-Memory Joins. PVLDB 9, 6 (2016).
[33]
Imene Mami and Zohra Bellahsene. 2012. A survey of view selection methods. SIGMOD Record 41, 1 (2012), 20--29.
[34]
Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M. Lohman, and Hamid Pirahesh. 2004. Robust Query Processing through Progressive Optimization. In SIGMOD. 659--670.
[35]
Ruslan Mavlyutov, Carlo Curino, Boris Asipov, and Philippe Cudré-Mauroux. 2017. Dependency-Driven Analytics: A Compass for Uncharted Data Oceans. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings.
[36]
Fabian Nagel, Peter A. Boncz, and Stratis Viglas. 2013. Recycling in pipelined query evaluation. In ICDE.
[37]
Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, and Nick Koudas. 2010. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB 3, 1 (2010).
[38]
Luis Leopoldo Perez and Christopher M. Jermaine. 2014. History-aware query optimization with materialized intermediate views. In ICDE.
[39]
Power BI 2018. https://powerbi.microsoft.com. (2018).
[40]
Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, and Guy M. Lohman. 2008. Main-memory scan sharing for multi-core CPUs. PVLDB 1, 1 (2008).
[41]
Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. 2000. Efcient and Extensible Algorithms for Multi Query Optimization. In SIGMOD.
[42]
Timos K. Sellis. 1988. Multiple-Query Optimization. ACM Trans. Database Syst. 13, 1 (1988), 23--52.
[43]
Shayak Sen, Saikat Guha, Anupam Datta, Sriram K. Rajamani, Janice Y. Tsai, and Jeannette M. Wing. 2014. Bootstrapping Privacy Compliance in Big Data Systems. In IEEE Symposium on Security and Privacy. 327--342.
[44]
Yasin N. Silva, Per-Åke Larson, and Jingren Zhou. 2012. Exploiting Common Subexpressions for Cloud Query Processing. In ICDE.
[45]
Michael Stillger, Guy M. Lohman, Volker Markl, and Mokhtar Kandil. 2001. LEO - DB2's LEarning Optimizer. In VLDB.
[46]
Dimitri Theodoratos and Timos K. Sellis. 1997. Data Warehouse Confguration. In VLDB.
[47]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckof, and Raghotham Murthy. 2009. Hive: a warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009).
[48]
TPC-DS Benchmark 2018. http://www.tpc.org/tpcds. (2018).
[49]
Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In EuroSys.
[50]
Guoping Wang and Chee-Yong Chan. 2013. Multi-Query Optimization in MapReduce Framework. PVLDB 7, 3 (2013).
[51]
Xiaodan Wang, Christopher Olston, Anish Das Sarma, and Randal Burns. 2011. CoScan: Cooperative Scan Sharing in the Cloud. In SOCC. 11:1--11:12.
[52]
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Åke Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: parallel databases meet MapReduce. VLDB J. 21, 5 (2012), 611--636.
[53]
Jingren Zhou, Per-Åke Larson, Johann Christoph Freytag, and Wolfgang Lehner. 2007. Efcient exploitation of similar subexpressions for query processing. In ACM SIGMOD. 533--544.
[54]
Marcin Zukowski, Sándor Héman, Niels Nes, and Peter A. Boncz. 2007. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB.

Cited By

View all
  • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
  • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computation reuse
  2. materialized views
  3. shared clouds

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)4
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)PilotScope: Steering Databases with Machine Learning DriversProceedings of the VLDB Endowment10.14778/3641204.364120917:5(980-993)Online publication date: 1-Jan-2024
  • (2024)Sibyl: Forecasting Time-Evolving Query WorkloadsProceedings of the ACM on Management of Data10.1145/36393082:1(1-27)Online publication date: 26-Mar-2024
  • (2024)Blaze: Holistic Caching for Iterative Data ProcessingProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629558(370-386)Online publication date: 22-Apr-2024
  • (2023)PikePlace: Generating Intelligence for Marketplace DatasetsProceedings of the VLDB Endowment10.14778/3611540.361163216:12(4106-4109)Online publication date: 12-Sep-2023
  • (2023)GEqO: ML-Accelerated Semantic Equivalence DetectionProceedings of the ACM on Management of Data10.1145/36267101:4(1-25)Online publication date: 12-Dec-2023
  • (2023)Survey on performance optimization for database systemsScience China Information Sciences10.1007/s11432-021-3578-666:2Online publication date: 11-Jan-2023
  • (2022)HippoProceedings of the VLDB Endowment10.14778/3510397.351040215:5(1038-1052)Online publication date: 18-May-2022
  • (2022)Accelerating container-based deep learning hyperparameter optimization workloadsProceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning10.1145/3533028.3533309(1-10)Online publication date: 12-Jun-2022
  • (2022)Deploying a Steered Query Optimizer in Production at MicrosoftProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526052(2299-2311)Online publication date: 10-Jun-2022
  • (2022)AutoView: An Autonomous Materialized View Management System with Encoder-ReducerIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3163195(1-1)Online publication date: 2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media