Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Scuba: diving into data at facebook

Published: 01 August 2013 Publication History

Abstract

Facebook takes performance monitoring seriously. Performance issues can impact over one billion users so we track thousands of servers, hundreds of PB of daily network traffic, hundreds of daily code changes, and many other metrics. We require latencies of under a minute from events occuring (a client request on a phone, a bug report filed, a code change checked in) to graphs showing those events on developers' monitors.
Scuba is the data management system Facebook uses for most real-time analysis. Scuba is a fast, scalable, distributed, in-memory database built at Facebook. It currently ingests millions of rows (events) per second and expires data at the same rate. Scuba stores data completely in memory on hundreds of servers each with 144 GB RAM. To process each query, Scuba aggregates data from all servers. Scuba processes almost a million queries per day. Scuba is used extensively for interactive, ad hoc, analysis queries that run in under a second over live data. In addition, Scuba is the workhorse behind Facebook's code regression analysis, bug report monitoring, ads revenue monitoring, and performance debugging.

References

[1]
Cloudera Impala: Real-time queries in Apache Hadoop, for real. http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/.
[2]
Druid. https://github.com/metamx/druid/wiki.
[3]
MRTG: Multi-router traffic grapher. http://oss.oetiker.ch/mrtg/.
[4]
RRDTool. http://oss.oetiker.ch/rrdtool/.
[5]
Scribe. https://github.com/facebook/scribe.
[6]
Splunk. http://www.splunk.com.
[7]
Aditya Agarwal, Mark Slee, and Marc Kwiatkowski. Thrift: Scalable cross-language services implementation. Technical report, Facebook, 2007. http://thrift.apache.org/static/files/thrift-20070401.pdf.
[8]
Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. Breaking the memory wall in monetdb. Communications of the ACM, 51(12):77-85, 2008.
[9]
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. Shark: fast data analysis using coarse-grained distributed memory. In SIGMOD, pages 689-692, 2012.
[10]
Alexander Hall, Olaf Bachmann, Robert Büssow, Silviu Ganceanu, and Marc Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5(11):1436-1446, July 2012.
[11]
A. Kemper and T. Neumann. Hyper: A hybrid OLTP-OLAP main memory database system based on virtual memory snapshots. In ICDE, pages 195-206, 2011.
[12]
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330-339, 2010.
[13]
Raghotham Murthy and Rajat Goel. Peregrine: Low-latency queries on hive warehouse data. XRDS, 19(1):40-43, September 2012.
[14]
Joshua Rosen, Neoklis Polyzotis, Vinayak Borkar, Yingyi Bu, Michael J. Carey, Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. Iterative MapReduce for Large Scale Machine Learning. Technical report, 03 2013. http://arxiv.org/abs/1303.3517.
[15]
Vishal Sikka, Franz Färber, Wolfgang Lehner, Sang Kyun Cha, Thomas Peh, and Christof Bornhövd. Efficient transaction processing in sap hana database: the end of a column store myth. In SIGMOD, pages 731-742, 2012.
[16]
Richard Snodgrass. The temporal query language TQuel. ACM Transactions on Database Systems, 12(2):247-298, June 1987.
[17]
Richard Snodgrass. A relational approach to monitoring complex systems. ACM Transactions on Computing Systems, 6(2):157-195, May 1988.
[18]
Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-Store: A Column-Oriented DBMS. In VLDB, pages 553-564, 2005.
[19]
Jason Taylor. Disaggregation and next-generation systems design, 2013. http://www.opencompute.org/ocp-summit-iv-agenda/#keynote.
[20]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626-1629, 2009.
[21]
Till Westmann, Donald Kossmann, Sven Helmer, and Guido Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55-67, September 2000.
[22]
Reynold Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. Shark: Sql and rich analytics at scale. Technical report, UC Berkeley, 2012. http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf.

Cited By

View all
  • (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
  • (2024)Trinity: A Fast Compressed Multi-attribute Data StoreProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650072(405-420)Online publication date: 22-Apr-2024
  • (2024)Exploring the Asynchrony of Slow Memory Filesystem with EasyIOProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629586(624-640)Online publication date: 22-Apr-2024
  • Show More Cited By

Index Terms

  1. Scuba: diving into data at facebook
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 6, Issue 11
    August 2013
    237 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2013
    Published in PVLDB Volume 6, Issue 11

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 20 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Performance Optimization in Distributed SQL Environments : A Comprehensive Analysis of Presto Query EngineInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2410617310:6(241-253)Online publication date: 8-Nov-2024
    • (2024)Trinity: A Fast Compressed Multi-attribute Data StoreProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650072(405-420)Online publication date: 22-Apr-2024
    • (2024)Exploring the Asynchrony of Slow Memory Filesystem with EasyIOProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629586(624-640)Online publication date: 22-Apr-2024
    • (2024)An exploratory study on visualizing big data in the internetof things2ND INTERNATIONAL CONFERENCE SERIES ON SCIENCE, ENGINEERING, AND TECHNOLOGY (ICSSET) 202210.1063/5.0221663(030011)Online publication date: 2024
    • (2023)DecLog: Decentralized Logging in Non-Volatile Memory for Time Series Database SystemsProceedings of the VLDB Endowment10.14778/3617838.361783917:1(1-14)Online publication date: 1-Sep-2023
    • (2023)Quancurrent: A Concurrent Quantiles SketchProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591074(15-25)Online publication date: 17-Jun-2023
    • (2023)Characterization of Data Compression in Datacenters2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00010(1-12)Online publication date: Apr-2023
    • (2023)An Integrated Solution to Improve Performance of In-Memory Data Caching With an Efficient Item Retrieving Mechanism and a Near-Memory AcceleratorIEEE Access10.1109/ACCESS.2023.329258211(78726-78736)Online publication date: 2023
    • (2023)Towards AIOps enabled services in continuously evolving software‐intensive embedded systemsJournal of Software: Evolution and Process10.1002/smr.259236:5Online publication date: 14-Jun-2023
    • (2022)Meta's next-generation realtime monitoring and analytics platformProceedings of the VLDB Endowment10.14778/3554821.355484115:12(3522-3534)Online publication date: 1-Aug-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media