Nothing Special   »   [go: up one dir, main page]

skip to main content
article

MillWheel: fault-tolerant stream processing at internet scale

Published: 01 August 2013 Publication History

Abstract

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees.
This paper describes MillWheel's programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel's features are used. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

References

[1]
D. J. Abadi, Y. Ahmad, M. Balazinska, M. Cherniack, J. hyon Hwang, W. Lindner, A. S. Maskey, E. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the borealis stream processing engine. In In CIDR, pages 277-289, 2005.
[2]
D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003.
[3]
A. Adya, J. Dunagan, and A. Wolman. Centrifuge: Integrated lease management and partitioning for cloud services. In NSDI, pages 1-16. USENIX Association, 2010.
[4]
Apache. Apache hadoop. http://hadoop.apache.org, 2012.
[5]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1-16. ACM, 2002.
[6]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. Telegraphcq: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 668-668. ACM, 2003.
[7]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26:4:1-4:26, June 2008.
[8]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. Technical report, University of California, Berkeley, 2009.
[9]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. To appear in Proceedings of OSDI, page 1, 2012.
[10]
C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: High performance network monitoring with an sql interface. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 623-623. ACM, 2002.
[11]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107-113, Jan. 2008.
[12]
E. Deelman and B. K. Szymanski. Continuously monitored global virtual time. Technical report, in Intern. Conf. Parallel and Distributed Processing Techniques and Applications, Las Vegas, NV, 1996.
[13]
Google. Protocol buffers. http://code.google.com/p/protobuf/, 2012.
[14]
J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pages 779-790. IEEE, 2005.
[15]
D. R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7:404-425, 1985.
[16]
T. Johnson, S. Muthukrishnan, V. Shkapenyuk, and O. Spatscheck. A heartbeat mechanism and its application in gigascope. In Proceedings of the 31st international conference on Very large data bases, pages 1079-1088. VLDB Endowment, 2005.
[17]
Y. Kwon, M. Balazinska, and A. Greenberg. Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 1(1):574-585, 2008.
[18]
L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558-565, July 1978.
[19]
J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: a new architecture for high-performance stream systems. Proceedings of the VLDB Endowment, 1(1):274-288, 2008.
[20]
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Proceedings of the 1st ACM symposium on Cloud computing, pages 51-62. ACM, 2010.
[21]
S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 555-566. IEEE, 2002.
[22]
N. Marz. Trident. https://github.com/nathanmarz/storm/wiki/Trident-tutorial, 2012.
[23]
N. Marz. Twitter storm. https://github.com/nathanmarz/storm/wiki, 2012.
[24]
R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, resource management, and approximation in a data stream management system. Technical Report 2002-41, Stanford InfoLab, 2002.
[25]
D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, page 9. USENIX Association, 2011.
[26]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170-177, dec. 2010.
[27]
D. Peng, F. Dabek, and G. Inc. Large-scale incremental processing using distributed transactions and notifications. In 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
[28]
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 25-36. IEEE, 2003.
[29]
U. Srivastava and J. Widom. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263-274. ACM, 2004.
[30]
M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34(4):42-47, 2005.
[31]
P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. Knowledge and Data Engineering, IEEE Transactions on, 15(3):555-568, 2003.
[32]
F. Yang, Z. Qian, X. Chen, I. Beschastnikh, L. Zhuang, L. Zhou, and J. Shen. Sonora: A platform for continuous mobile-cloud computing. Technical report, Technical Report. Microsoft Research Asia.
[33]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2011.
[34]
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pages 10-10. USENIX Association, 2012.

Cited By

View all
  • (2024)Texera: A System for Collaborative and Interactive Data Analytics Using WorkflowsProceedings of the VLDB Endowment10.14778/3681954.368202217:11(3580-3588)Online publication date: 30-Aug-2024
  • (2024)Incremental Sliding Window Connectivity over Streaming GraphsProceedings of the VLDB Endowment10.14778/3675034.367504017:10(2473-2486)Online publication date: 6-Aug-2024
  • (2024)MVLevelDB+: Meeting Relative Consistency Requirements of Temporal Queries in Sensor Stream DatabasesACM Transactions on Embedded Computing Systems10.1145/369478724:1(1-26)Online publication date: 5-Oct-2024
  • Show More Cited By

Recommendations

Reviews

Peng Li

In this paper, the authors introduce MillWheel, a framework for processing real-time streamed data that is used at Google. It is based on the paradigm of streaming computing, in which users specify a computation graph and provide code for individual nodes. After the computation starts, the application receives continuous data and processes it in a real-time fashion to reduce latency. MillWheel targets latency-sensitive applications, such as the detection of query hikes or query dips, and the delivery of ads. The MillWheel model features persistent storage, low watermarks, and duplicate prevention. The persistent storage stores the state of each node so that if a node fails, the system can resume the computation without losing data by reading the stored state. The low watermark subsystem tracks the timestamps of processed data at each node, providing the basis for fault tolerance and duplication prevention. With those features, MillWheel is capable of processing continuous data streams that consist of data tokens with keys and timestamps. If the user wants to trigger a specific event at a future time, he or she can use the provided timer mechanism (which is optional) to register the event. Fault tolerance is a key feature of MillWheel. Since the framework might run on thousands of nodes continuously, the chance of node failure is high. When a node fails, it is required that computed results are saved; if the node keeps state, the state can be resumed so that future computations are still correct. After the node restarts from the failure, duplication of computed results should also be prevented. To achieve these goals, MillWheel ensures exactly once delivery with the help of persistent storage and data acknowledgment. For stateful computations (in which repeated computations with the same input may have different results), MillWheel provides a mechanism called “strong productions” to checkpoint produced data before changing the node state. This form of fault tolerance, however, might incur unnecessary cost for stateless computations (in which computations with the same input always yield the same output). MillWheel provides an option to turn off strong productions and use weak productions to improve the performance of stateless computations. The state of computations is stored in both disks and memories. While disks provide space for huge amounts of state data, memories have a speed advantage. To ensure consistency, all state write operations are wrapped in per-key atomic operations. However, in case of work migration or node restart, there might be zombie writers and network remnants that issue stale writes. To prevent this, all write operations are assigned with a unique sequence. The user can customize the granularity of state modification operations for performance benefit according to the failure probabilities. MillWheel has been implemented on Google clusters. Streams are delivered via remote procedure call. Load distribution is handled by a replicated master. Persistent state is maintained by a database like BigTable or Spanner. Low watermarks are computed conservatively by a central authority, but interested consumers should compute the low watermark according to their own records and those of subscribed senders. Experiments show that with weak productions, the median record delay of a node is several milliseconds. With strong productions and exactly-once delivery enabled, the median delay increases to tens of milliseconds, which is still within human reaction time. Framework-level cache is also used to reduce traffic between storage layers. MillWheel has been used in various Google internal systems, such as ad delivery and image processing for Google Street View. However, the authors point out that there are applications that MillWheel does not suit well. For example, if an application cannot be parallelized well among different keys, there could be bottleneck stages that slow down the whole computation. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 11
August 2013
237 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013
Published in PVLDB Volume 6, Issue 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)21
Reflects downloads up to 03 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Texera: A System for Collaborative and Interactive Data Analytics Using WorkflowsProceedings of the VLDB Endowment10.14778/3681954.368202217:11(3580-3588)Online publication date: 30-Aug-2024
  • (2024)Incremental Sliding Window Connectivity over Streaming GraphsProceedings of the VLDB Endowment10.14778/3675034.367504017:10(2473-2486)Online publication date: 6-Aug-2024
  • (2024)MVLevelDB+: Meeting Relative Consistency Requirements of Temporal Queries in Sensor Stream DatabasesACM Transactions on Embedded Computing Systems10.1145/369478724:1(1-26)Online publication date: 5-Oct-2024
  • (2024)Reactive Dataflow for Inflight Error Handling in ML WorkflowsProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663333(51-61)Online publication date: 9-Jun-2024
  • (2024)μWheel: Aggregate Management for Streams and QueriesProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666031(54-65)Online publication date: 24-Jun-2024
  • (2024)Snatch: Online Streaming Analytics at the Network EdgeProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629577(349-369)Online publication date: 22-Apr-2024
  • (2024)An Overview of Continuous Querying in (Modern) Data SystemsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654679(605-612)Online publication date: 9-Jun-2024
  • (2024)M4: A Framework for Per-Flow Quantile Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00364(4787-4800)Online publication date: 13-May-2024
  • (2024)CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00309(4030-4043)Online publication date: 13-May-2024
  • (2024)A Predictive Profiling and Performance Modeling Approach for Distributed Stream Processing in Edge2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00125(1520-1532)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media