Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3225058.3225120acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Dual-Paradigm Stream Processing

Published: 13 August 2018 Publication History

Abstract

Existing stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable data processing requirements, this dilemma brings the selection and deployment of stream processing frameworks into an embarrassing situation. Moreover, current data stream or operation stream paradigms cannot handle data burst efficiently, which probably results in noticeable performance degradation. This paper introduces a dual-paradigm stream processing, called DO (Data and Operation) that can adapt to stream data volatility. It enables data to be processed in micro-batches (i.e., operation stream) when data burst occurs to achieve high throughput, while data is processed record by record (i.e., data stream) in the remaining time to sustain low latency. DO embraces a method to detect data bursts, identify the main operations affected by the data burst and switch paradigms accordingly. Our insight behind DO's design is that the trade-off between latency and throughput of stream processing frameworks can be dynamically achieved according to data communication among operations in a fine-grained manner (i.e., operation level) instead of framework level. We implement a prototype stream processing framework that adopts DO. Our experimental results show that our framework with DO can achieve 5x speedup over operation stream under low data stream sizes, and outperforms data stream on throughput by 2.1x to 3.2x under data burst.

References

[1]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proceedings of the VLDB Endowment. 6, 11 (Aug. 2013), 1033--1044.
[2]
Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni. 2013. Adaptive Online Scheduling in Storm. In Proceedings of the 7th ACM International Conference on Distributed Event-based Systems (DEBS '13). ACM, 207--218.
[3]
Brian Babcock, Shivnath Babu, Rajeev Motwani, and Mayur Datar. 2003. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, 253--264.
[4]
Brian Babcock, Mayur Datar, and Rajeev Motwani. 2003. Load shedding techniques for data stream systems. In Proceedings of the 2003 Management and Processing of Data Streams Workshop (MPDS '03), Vol. 577. ACM.
[5]
Yahoo! Streaming Benchmark. 2015. https://yahooeng.tumblr.com/post/135321837876/
[6]
Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for Incremental Computations. In Proceedings of the 2011 ACM Symposium on Cloud Computing (SOCC '11). ACM, Article 7, 14 pages.
[7]
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10). USENIX, 313--328.
[8]
Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive Stream Processing Using Dynamic Batch Sizing. In Proceedings of the 2014 ACM Symposium on Cloud Computing (SOCC '14). ACM, Article 16, 13 pages.
[9]
Structured Streaming Programming Guide. 2018. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
[10]
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM Comput. Surv. 46, 4, Article 46 (March2014), 34 pages.
[11]
JStorm. 2015. http://jstorm.io/
[12]
Asterios Katsifodimos and Sebastian Schelter. 2016. Apache Flink: Stream Analytics at Scale. In Proceedings of the 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW '16). IEEE, 193--193.
[13]
Wilhelm Kleiminger, Evangelia Kalyvianaki, and Peter Pietzuch. 2011. Balancing load in stream processing with the cloud. In Proceedings of the 27th IEEE International Conference on Data Engineering Workshops (ICDEW '11). IEEE, 16--21.
[14]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, 239--250.
[15]
Julie Letchner, Christopher Ré, Magdalena Balazinska, and Matthai Philipose. 2010. Approximation trade-offs in Markovian stream processing: An empirical study. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE '10). IEEE, 936--939.
[16]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 439--455.
[17]
Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed Stream Computing Platform. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW '10). IEEE, 170--177.
[18]
Christopher Olston, Greg Chiou, Laukik Chitnis, Francis Liu, Yiping Han, Mattias Larsson, Andreas Neumann, Vellanki B.N. Rao, Vijayanand Sankarasubramanian, Siddharth Seth, Chao Tian, Topher ZiCornell, and Xiaodan Wang. 2011. Nova: Continuous Pig/Hadoop Workflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD '11). ACM, 1081--1090.
[19]
Samza. 2015. http://samza.apache.org/
[20]
Michael Stonebraker, Uğur Çetintemel, and Stan Zdonik. 2005. The 8 Requirements of Real-time Stream Processing. SIGMOD Rec. 34, 4 (Dec. 2005), 42--47.
[21]
Jaspar Subhlok and Gary Vondran. 1996. Optimal Latency-throughput Tradeoffs for Data Parallel Pipelines. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96). ACM, 62--71.
[22]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, 147--156.
[23]
Trident. 2015. http://storm.apache.org/releases/1.1.0/Trident-tutorial.html
[24]
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and Adaptable Stream Processing at Scale. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17). ACM, 374--389.
[25]
Naga Vydyanathan, Umit Catalyurek, Tahsin Kurc, Ponnuswamy Sadayappan, and Joel Saltz. 2011. Optimizing latency and throughput of application workflows on clusters. Parallel Comput. 37, 10 (2011), 694--712.
[26]
Walter Willinger, Murad S. Taqqu, and Ashok Erramilli. 1996. A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Speed Networks. Stochastic Networks Theory and Applications (1996), 339--366.
[27]
Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-Aware Online Scheduling in Storm. In Proceedings of the 34th IEEE International Conference on Distributed Computing Systems (ICDCS '14). IEEE, 535--544.
[28]
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 423--438.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data burst
  2. Dual-paradigm approach
  3. High throughput
  4. Low latency
  5. Stream processing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 187
    Total Downloads
  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media