research-article

Procella: unifying serving and analytical data at YouTube

Proceedings of the VLDB Endowment, Volume 12, Issue 12

Pages 2022 - 2034

https://doi.org/10.14778/3352063.3352121

Published: 01 August 2019 Publication History

Abstract

Large organizations like YouTube are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.

At YouTube, we solved this problem by building a new SQL query engine - Procella. Procella implements a superset of capabilities required to address all of the four use cases above, with high scale and performance, in a single product. Today, Procella serves hundreds of billions of queries per day across all four workloads at YouTube and several other Google product areas.

References

[1]

D. Abadi, P. Boncz, and S. o. Harizopoulos. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197--280, 2013.

[2]

D. Abadi, Madden, et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682. ACM, 2006.

Digital Library

[3]

S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, VLDB '00, pages 496--505, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

Digital Library

[4]

H. Ahmadi. In-memory query execution in Google BigQuery, 2016.

[5]

M. Armbrust, R. S. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, pages 1383--1394, 2015.

Digital Library

[6]

C. Barthels, I. Müller, et al. Distributed Join Algorithms on Thousands of Cores. PVLDB, 10(5):517--528, 2017.

Digital Library

[7]

T. A. Bjørklund, J. Gehrke, and Øystein Torbjørnsen. A Confluence of Column Stores and Search Engines: Opportunities and Challenges, 2016.

[8]

B. Bramas. Inastemp: A Novel Intrinsics-as-Template Library for Portable SIMD-Vectorization. Scientific Programming, 2017.

[9]

N. Bruno, Y. Kwon, and M.-C. Wu. Advanced Join Strategies for Large-scale Distributed Computation. PVLDB, 7(13):1484--1495, 2014.

Digital Library

[10]

S. Chambi, Lemire, et al. Better bitmap performance with roaring bitmaps. Software: practice and experience, 46(5):709--719, 2016.

Digital Library

[11]

F. Chang, J. Dean, et al. Bigtable: A distributed storage system for structured data. TOCS, 26(2):4, 2008.

Digital Library

[12]

J. C. Corbett, J. Dean, et al. Spanner: Google's globally distributed database. TOCS, 31(3):8, 2013.

Digital Library

[13]

B. Dageville, T. Cruanes, et al. The Snowflake Elastic Data Warehouse. In SIGMOD, SIGMOD '16, pages 215--226, New York, NY, USA, 2016. ACM.

Digital Library

[14]

J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM, 56:74--80, 2013.

Digital Library

[15]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[16]

A. Deshpande, Z. Ives, and V. Raman. Adaptive Query Processing. Found. Trends databases, 1(1):1--140, 2007.

Digital Library

[17]

K. G. Dhaval Patel, Xaing Fu and P. N. Naga. Real-time Analytics at Massive Scale with Pinot, 2014.

[18]

R. Ebenstein and G. Agrawal. Dsdquery dsi-querying scientific data repositories with structured operators. In 2015 IEEE International Conference on Big Data (Big Data), pages 485--492. IEEE, 2015.

Digital Library

[19]

R. Ebenstein and G. Agrawal. Distriplan: An optimized join execution framework for geo-distributed scientific data. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, page 25. ACM, 2017.

Digital Library

[20]

R. Ebenstein, G. Agrawal, J. Wang, J. Boley, and R. Kettimuthu. Fdq: Advance analytics over real scientific array datasets. In 2018 IEEE 14th International Conference on e-Science (e-Science), pages 453--463. IEEE, 2018.

[21]

R. Ebenstein, N. Kamat, and A. Nandi. Fluxquery: An execution framework for highly interactive query workloads. In Proceedings of the 2016 International Conference on Management of Data, pages 1333--1345. ACM, 2016.

Digital Library

[22]

Facebook Inc. Presto: Distributed SQL Query Engine for Big Data, 2015.

[23]

Facebook Inc. Beringei: A high-performance time series storage engine, 2016.

[24]

A. Fikes. Storage Architecture and Challenges, 2010.

[25]

Google, Inc. Vitess: Database clustering system for horizontal scaling of MySQL, 2003.

[26]

A. Gupta, D. Agarwal, et al. Amazon redshift and the case for simpler data warehouses. In SIGMOD, SIGMOD '15, pages 1917--1923, New York, NY, USA, 2015. ACM.

Digital Library

[27]

A. Gupta, F. Yang, et al. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. PVLDB, 7(12):1259--1270, 2014.

Digital Library

[28]

A. Hall, O. Bachmann, et al. Processing a Trillion Cells per Mouse Click. PVLDB, 5:1436--1446, 2012.

Digital Library

[29]

S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In EDBT, pages 683--692, Genoa, Italy, 2013.

Digital Library

[30]

InfluxData Inc. InfluxDB: The Time Series Database in the TICK Stack, 2013.

[31]

Lohman, Guy. Is query optimization a "solved" problem?, 2014.

[32]

R. Lupi. Monarch, Google's Planet Scale Monitoring Infrastructure, 2016.

[33]

S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010.

Digital Library

[34]

Mosha Pasumansky. Inside Capacitor, BigQuery's next-generation columnar storage format, 2016.

[35]

Nathan Marz. Lambda Architecture, 2013.

[36]

T. Pelkonen, S. Franklin, et al. Gorilla: A Fast, Scalable, In-memory Time Series Database. PVLDB, 8(12):1816--1827, 2015.

Digital Library

[37]

O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking SIMD Vectorization for In-Memory Databases. In SIGMOD, SIGMOD '15, pages 1493--1508, New York, NY, USA, 2015. ACM.

Digital Library

[38]

I. I. Prakash Das. Part 1: Add Spark to a Big Data Application with Text Search Capability, 2016.

[39]

B. Samwel, J. Cieslewicz, et al. F1 Query: Declarative Querying at Scale. PVLDB, 11(12):1835--1848, 2018.

Digital Library

[40]

S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1239--1242. ACM, 2008.

Digital Library

[41]

Varun Talwar. gRPC: a true internet-scale RPC framework is now 1.0 and ready for production deployments, 2016.

[42]

A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In EuroSys, page 18, Bordeaux, France, 2015. ACM.

Digital Library

[43]

F. Yang, E. Tschetter, et al. Druid: A Real-time Analytical Data Store. In SIGMOD, SIGMOD '14, pages 157--168, New York, NY, USA, 2014. ACM.

Digital Library

Cited By

Sakka LPedreira PErling OBasmanova MWilfong KHe WMeng XPai KVig B(2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685836
Shute JBales SBrown MBrowne JDolphin BKudtarkar RLitvinov AMa JMorcos JShen MWilhite DWu XYu L(2024)SQL Has Problems. We Can Fix Them: Pipe Syntax In SQLProceedings of the VLDB Endowment10.14778/3685800.368582617:12(4051-4063)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685826
Levandoski JCasto GDeng MDesai REdara PHottelier THormati AJohnson AJohnson JKurzyniec DMcVeety SRamanathan PSaxena GShanmugan VVolobuev YBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)BigLake: BigQuery's Evolution toward a Multi-Cloud LakehouseCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653388(334-346)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653388
Show More Cited By

Procella: unifying serving and analytical data at YouTube

Recommendations

Big Data Analytics with R and Hadoop
Big Data Analytics
Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 12, Issue 12

August 2019

547 pages

ISSN:2150-8097

Editors:
Lei Chen,
Fatma Özcan

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2019

Published in PVLDB Volume 12, Issue 12

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
276
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)18

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sakka LPedreira PErling OBasmanova MWilfong KHe WMeng XPai KVig B(2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685836
Shute JBales SBrown MBrowne JDolphin BKudtarkar RLitvinov AMa JMorcos JShen MWilhite DWu XYu L(2024)SQL Has Problems. We Can Fix Them: Pipe Syntax In SQLProceedings of the VLDB Endowment10.14778/3685800.368582617:12(4051-4063)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685826
Levandoski JCasto GDeng MDesai REdara PHottelier THormati AJohnson AJohnson JKurzyniec DMcVeety SRamanathan PSaxena GShanmugan VVolobuev YBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)BigLake: BigQuery's Evolution toward a Multi-Cloud LakehouseCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653388(334-346)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3653388
Zeng XHui YShen JPavlo AMcKinney WZhang H(2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626298
Afroozeh AKuffo LBoncz P(2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626717
Sun YMeehan TSchlussel RXie WBasmanova MErling ORosa AFan SZhong RThirupathi ACollooru NWang KAgarwal SGupta ALogothetis DXirogiannopoulos KDutta AGajjala VJain RPalakuzhy APandian PPershin SSaikia AShankhdhar PSomanchi NTailor STan JViswanadha SWen ZChattopadhyay BFan BMajeti DPandit A(2023)Presto: A Decade of SQL Analytics at MetaProceedings of the ACM on Management of Data10.1145/35897691:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589769
Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Prout AWang SVictor JSun ZLi YChen JBergeron EHanson EWalzer RGomes RShamgunov NIves ZBonifati AEl Abbadi A(2022)Cloud-Native Transactions and Analytics in SingleStoreProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526055(2340-2352)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526055
Agiwal ALai KManoharan GRoy ISankaranarayanan JZhang HZou TChen MChen ZDai MDo TGao HGeng HGrover RHuang BHuang YLi ZLiang JLin TLiu LLiu YMao XMeng YMishra PPatel JR. RRaman VRoy SShishodia MSun TTang YTatemura JTrehan SVadali RVenkatasubramanian PZhang GZhang KZhang YZhuang ZGraefe GAgrawal DNaughton JKosalge SHacıgümüş H(2021)NapaProceedings of the VLDB Endowment10.14778/3476311.347637714:12(2986-2997)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476377
Fu YSoman CLi GLi ZIdreos SSrivastava D(2021)Real-time Data Infrastructure at UberProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457552(2503-2516)Online publication date: 9-Jun-2021
https://dl.acm.org/doi/10.1145/3448016.3457552
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents