research-article

JetScope: reliable and interactive analytics at cloud scale

Editors: Chen Li, Volker Markl Authors:

Jaliya Ekanayake,

Jingren ZhouAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 8, Issue 12

Pages 1680 - 1691

https://doi.org/10.14778/2824032.2824066

Published: 01 August 2015 Publication History

Abstract

Interactive, reliable, and rich data analytics at cloud scale is a key capability to support low latency data exploration and experimentation over terabytes of data for a wide range of business scenarios. Besides the challenges in massive scalability and low latency distributed query processing, it is imperative to achieve all these requirements with effective fault tolerance and efficient recovery, as failures and fluctuations are the norm in such a distributed environment.

We present a cloud scale interactive query processing system, called JetScope, developed at Microsoft. The system has a SQL-like declarative scripting language and delivers massive scalability and high performance through advanced optimizations. In order to achieve low latency, the system leverages various access methods, optimizes delivering first rows, and maximizes network and scheduling efficiency. The system also provides a fine-grained fault tolerance mechanism which is able to efficiently detect and mitigate failures without significantly impacting the query latency and user experience. JetScope has been deployed to hundreds of servers in production at Microsoft, serving a few million queries every day.

References

[1]

L. Abraham, J. Allen, O. Barykin, V. R. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into Data at Facebook. PVLDB, 6(11):1057--1067, 2013.

[2]

A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 169--180, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[3]

A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, F. Naumann, M. Peters, A. Rheinländer, M. J. Sax, S. Schelter, M. Höger, K. Tzoumas, and D. Warneke. The Stratosphere Platform for Big Data Analytics. The VLDB Journal, 23(6):939--964, Dec. 2014.

[4]

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM.

[5]

E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 285--300, Broomfield, CO, Oct. 2014. USENIX Association.

[6]

Y. Chen, S. Alspaugh, and R. H. Katz. Interactive Query Processing in Big Data Systems: A Cross Industry Study of MapReduce Workloads. Technical Report UCB/EECS-2012-37, EECS Department, University of California, Berkeley, Apr 2012.

[7]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. Technical Report UCB/EECS-2009-136, EECS Department, University of California, Berkeley, Oct 2009.

[8]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008.

[9]

M. K. et.al. Impala: A Modern, Open-Source SQL Engine for Hadoop. In Proceedings of the Conference on Innovative Data Systems Research 2015, CIDR '15, 2015.

[10]

A. Floratou, U. F. Minhas, and F. Ozcan. SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures. Proceedings of the VLDB Endowment, 7(12), 2014.

[11]

G. Graefe. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.

[12]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM.

[13]

M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A Modern, Open-Source SQL Engine for Hadoop. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings, 2015.

[14]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In Proc. of the 36th Int'l Conf on Very Large Data Bases, pages 330--339, 2010.

[15]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-so-foreign Language for Data Processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08, pages 1099--1110, New York, NY, USA, 2008. ACM.

[16]

K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 69--84, New York, NY, USA, 2013. ACM.

[17]

M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems (EuroSys), pages 351--364, Prague, Czech Republic, 2013.

[18]

J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins, M. Oancea, K. Little?eld, D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte. F1: A Distributed SQL Database That Scales. In VLDB, 2013.

[19]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. Proceedings of The Vldb Endowment, 2: 1626--1629, 2009.

[20]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 13--24, New York, NY, USA, 2013. ACM.

[21]

J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet MapReduce. VLDB J., 21(5):611--636, 2012.

[22]

J. Zhou, P. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, pages 1060--1071, 2010.

Cited By

Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Chen YWang JLu YHan YLv ZMin XCai HZhang WFan HLi CGuan TLin WJia YZhou J(2021)FangornProceedings of the VLDB Endowment10.14778/3476311.347637614:12(2972-2985)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476376
Mustard CGoswami SGharavi NNider JBeschastnikh IFedorova AWassermann BMalka MChidambaram VRaz D(2021)JumpgateProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463770(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463770
Show More Cited By

Index Terms

JetScope: reliable and interactive analytics at cloud scale

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 8, Issue 12

Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

August 2015

728 pages

ISSN:2150-8097

Editors:
Chen Li
University of California, Irvine
,
Volker Markl
TU Berlin

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015

Published in PVLDB Volume 8, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhu YSen RHorton RAgosta J(2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588921
Chen YWang JLu YHan YLv ZMin XCai HZhang WFan HLi CGuan TLin WJia YZhou J(2021)FangornProceedings of the VLDB Endowment10.14778/3476311.347637614:12(2972-2985)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476376
Mustard CGoswami SGharavi NNider JBeschastnikh IFedorova AWassermann BMalka MChidambaram VRaz D(2021)JumpgateProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463770(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463770
Zhu YKrishnan SKaranasos KTarte IPower CModi AKumar MZhang DMuthyala KJurgens NSakalanaga SDarbha SIyer MAgarwal ACurino CLi GLi ZIdreos SSrivastava D(2021)KEA: Tuning an Exabyte-Scale Data InfrastructureProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457569(2667-2680)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457569
Moeller JYe ZLin KLang WLi GLi ZIdreos SSrivastava D(2021)Toto Benchmarking the Efficiency of a Cloud ServiceProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457555(2543-2556)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457555
Chen RShao ZLiu DFeng ZLi T(2019)Towards Efficient NVDIMM-based Heterogeneous Storage Hierarchy Management for Big Data WorkloadsProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3352460.3358266(849-860)Online publication date: 12-Oct-2019
https://dl.acm.org/doi/10.1145/3352460.3358266
Yint ZSun JLi MEkanayake JLin HFriedman MBlakeley JSzyperski CDevanur N(2018)Bubble executionProceedings of the VLDB Endowment10.14778/3192965.319296711:7(746-758)Online publication date: 1-Mar-2018
https://dl.acm.org/doi/10.14778/3192965.3192967
Picado JLang WThayer EDas GJermaine CBernstein P(2018)Survivability of Cloud Databases - Factors and PredictionProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3190651(811-823)Online publication date: 27-May-2018
https://dl.acm.org/doi/10.1145/3183713.3190651
Hassani MCuzzocrea ASpaus PSeidl TVarela CCastro HBarrios C(2016)I-HASTREAMProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.102(656-665)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.102

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents