Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Procella: unifying serving and analytical data at YouTube

Published: 01 August 2019 Publication History

Abstract

Large organizations like YouTube are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.
At YouTube, we solved this problem by building a new SQL query engine - Procella. Procella implements a superset of capabilities required to address all of the four use cases above, with high scale and performance, in a single product. Today, Procella serves hundreds of billions of queries per day across all four workloads at YouTube and several other Google product areas.

References

[1]
D. Abadi, P. Boncz, and S. o. Harizopoulos. The design and implementation of modern column-oriented database systems. Foundations and Trends® in Databases, 5(3):197--280, 2013.
[2]
D. Abadi, Madden, et al. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682. ACM, 2006.
[3]
S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB, VLDB '00, pages 496--505, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[4]
H. Ahmadi. In-memory query execution in Google BigQuery, 2016.
[5]
M. Armbrust, R. S. Xin, et al. Spark SQL: Relational data processing in Spark. In SIGMOD, pages 1383--1394, 2015.
[6]
C. Barthels, I. Müller, et al. Distributed Join Algorithms on Thousands of Cores. PVLDB, 10(5):517--528, 2017.
[7]
T. A. Bjørklund, J. Gehrke, and Øystein Torbjørnsen. A Confluence of Column Stores and Search Engines: Opportunities and Challenges, 2016.
[8]
B. Bramas. Inastemp: A Novel Intrinsics-as-Template Library for Portable SIMD-Vectorization. Scientific Programming, 2017.
[9]
N. Bruno, Y. Kwon, and M.-C. Wu. Advanced Join Strategies for Large-scale Distributed Computation. PVLDB, 7(13):1484--1495, 2014.
[10]
S. Chambi, Lemire, et al. Better bitmap performance with roaring bitmaps. Software: practice and experience, 46(5):709--719, 2016.
[11]
F. Chang, J. Dean, et al. Bigtable: A distributed storage system for structured data. TOCS, 26(2):4, 2008.
[12]
J. C. Corbett, J. Dean, et al. Spanner: Google's globally distributed database. TOCS, 31(3):8, 2013.
[13]
B. Dageville, T. Cruanes, et al. The Snowflake Elastic Data Warehouse. In SIGMOD, SIGMOD '16, pages 215--226, New York, NY, USA, 2016. ACM.
[14]
J. Dean and L. A. Barroso. The Tail at Scale. Communications of the ACM, 56:74--80, 2013.
[15]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[16]
A. Deshpande, Z. Ives, and V. Raman. Adaptive Query Processing. Found. Trends databases, 1(1):1--140, 2007.
[17]
K. G. Dhaval Patel, Xaing Fu and P. N. Naga. Real-time Analytics at Massive Scale with Pinot, 2014.
[18]
R. Ebenstein and G. Agrawal. Dsdquery dsi-querying scientific data repositories with structured operators. In 2015 IEEE International Conference on Big Data (Big Data), pages 485--492. IEEE, 2015.
[19]
R. Ebenstein and G. Agrawal. Distriplan: An optimized join execution framework for geo-distributed scientific data. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, page 25. ACM, 2017.
[20]
R. Ebenstein, G. Agrawal, J. Wang, J. Boley, and R. Kettimuthu. Fdq: Advance analytics over real scientific array datasets. In 2018 IEEE 14th International Conference on e-Science (e-Science), pages 453--463. IEEE, 2018.
[21]
R. Ebenstein, N. Kamat, and A. Nandi. Fluxquery: An execution framework for highly interactive query workloads. In Proceedings of the 2016 International Conference on Management of Data, pages 1333--1345. ACM, 2016.
[22]
Facebook Inc. Presto: Distributed SQL Query Engine for Big Data, 2015.
[23]
Facebook Inc. Beringei: A high-performance time series storage engine, 2016.
[24]
A. Fikes. Storage Architecture and Challenges, 2010.
[25]
Google, Inc. Vitess: Database clustering system for horizontal scaling of MySQL, 2003.
[26]
A. Gupta, D. Agarwal, et al. Amazon redshift and the case for simpler data warehouses. In SIGMOD, SIGMOD '15, pages 1917--1923, New York, NY, USA, 2015. ACM.
[27]
A. Gupta, F. Yang, et al. Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing. PVLDB, 7(12):1259--1270, 2014.
[28]
A. Hall, O. Bachmann, et al. Processing a Trillion Cells per Mouse Click. PVLDB, 5:1436--1446, 2012.
[29]
S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. In EDBT, pages 683--692, Genoa, Italy, 2013.
[30]
InfluxData Inc. InfluxDB: The Time Series Database in the TICK Stack, 2013.
[31]
Lohman, Guy. Is query optimization a "solved" problem?, 2014.
[32]
R. Lupi. Monarch, Google's Planet Scale Monitoring Infrastructure, 2016.
[33]
S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive Analysis of Web-Scale Datasets. PVLDB, 3(1):330--339, 2010.
[34]
Mosha Pasumansky. Inside Capacitor, BigQuery's next-generation columnar storage format, 2016.
[35]
Nathan Marz. Lambda Architecture, 2013.
[36]
T. Pelkonen, S. Franklin, et al. Gorilla: A Fast, Scalable, In-memory Time Series Database. PVLDB, 8(12):1816--1827, 2015.
[37]
O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking SIMD Vectorization for In-Memory Databases. In SIGMOD, SIGMOD '15, pages 1493--1508, New York, NY, USA, 2015. ACM.
[38]
I. I. Prakash Das. Part 1: Add Spark to a Big Data Application with Text Search Capability, 2016.
[39]
B. Samwel, J. Cieslewicz, et al. F1 Query: Declarative Querying at Scale. PVLDB, 11(12):1835--1848, 2018.
[40]
S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. Hambrusch, and R. Shah. Orion 2.0: native support for uncertain data. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1239--1242. ACM, 2008.
[41]
Varun Talwar. gRPC: a true internet-scale RPC framework is now 1.0 and ready for production deployments, 2016.
[42]
A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In EuroSys, page 18, Bordeaux, France, 2015. ACM.
[43]
F. Yang, E. Tschetter, et al. Druid: A Real-time Analytical Data Store. In SIGMOD, SIGMOD '14, pages 157--168, New York, NY, USA, 2014. ACM.

Cited By

View all
  • (2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
  • (2024)SQL Has Problems. We Can Fix Them: Pipe Syntax In SQLProceedings of the VLDB Endowment10.14778/3685800.368582617:12(4051-4063)Online publication date: 1-Aug-2024
  • (2024)BigLake: BigQuery's Evolution toward a Multi-Cloud LakehouseCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653388(334-346)Online publication date: 9-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 12, Issue 12
August 2019
547 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2019
Published in PVLDB Volume 12, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)18
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Simple (yet Efficient) Function Authoring for Vectorized EnginesProceedings of the VLDB Endowment10.14778/3685800.368583617:12(4187-4199)Online publication date: 1-Aug-2024
  • (2024)SQL Has Problems. We Can Fix Them: Pipe Syntax In SQLProceedings of the VLDB Endowment10.14778/3685800.368582617:12(4051-4063)Online publication date: 1-Aug-2024
  • (2024)BigLake: BigQuery's Evolution toward a Multi-Cloud LakehouseCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653388(334-346)Online publication date: 9-Jun-2024
  • (2023)An Empirical Evaluation of Columnar Storage FormatsProceedings of the VLDB Endowment10.14778/3626292.362629817:2(148-161)Online publication date: 1-Oct-2023
  • (2023)ALP: Adaptive Lossless floating-Point CompressionProceedings of the ACM on Management of Data10.1145/36267171:4(1-26)Online publication date: 12-Dec-2023
  • (2023)Presto: A Decade of SQL Analytics at MetaProceedings of the ACM on Management of Data10.1145/35897691:2(1-25)Online publication date: 20-Jun-2023
  • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
  • (2022)Cloud-Native Transactions and Analytics in SingleStoreProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526055(2340-2352)Online publication date: 10-Jun-2022
  • (2021)NapaProceedings of the VLDB Endowment10.14778/3476311.347637714:12(2986-2997)Online publication date: 1-Jul-2021
  • (2021)Real-time Data Infrastructure at UberProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457552(2503-2516)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media