Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3650203.3663333acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Reactive Dataflow for Inflight Error Handling in ML Workflows

Published: 09 June 2024 Publication History

Abstract

Modern data analytics pipelines comprise traditional data transformation operations and pre-trained ML models deployed as user-defined functions (UDFs). Such pipelines, which we call ML workflows, generally produce erroneous results due to data errors inadvertently introduced by ML models. Model errors are one of the main obstacles to improved accuracy of ML workflows. In this paper, we present Popper, a dataflow system---for expressing ML workflows---that natively supports inflight error handling. Users can extend ML workflows expressed in Popper by plugging in error handlers to improve accuracy. We propose reactive dataflow, a novel cyclic graph-based dataflow model that provides convenient abstractions for interleaving dataflow operators with user-defined error handlers for detecting and correcting errors on the fly. We also propose an efficient execution strategy amenable to pipeline parallel execution of reactive dataflow. We discuss open research challenges for making error handling a first-class citizen in dataflow systems and present preliminary evaluation of our prototypical system, which shows the effectiveness and benefits of inflight error handling in ML workflows.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, 265--283.
[2]
Martín Abadi, Frank McSherry, and Gordon D. Plotkin. 2015. Foundations of Differential Dataflow. Lecture Notes in Computer Science, Vol. 9034. Springer Berlin Heidelberg, 71--83. https://doi.org/10.1007/978-3-662-46678-0_5
[3]
Ankur Agiwal, Kevin Lai, Gokul Nath Babu Manoharan, Indrajit Roy, Jagan Sankaranarayanan, Hao Zhang, Tao Zou, Min Chen, Jim Chen, Ming Dai, Thanh Do, Haoyu Gao, Haoyan Geng, Raman Grover, Bo Huang, Yanlai Huang, Adam Li, Jianyi Liang, Tao Lin, Li Liu, Yao Liu, Xi Mao, Maya Meng, Prashant Mishra, Jay Patel, Rajesh S R, Vijayshankar Raman, Sourashis Roy, Mayank Singh Shishodia, Tianhang Sun, Justin Tang, Junichi Tatemura, Sagar Trehan, Ramkumar Vadali, Prasanna Venkatasubramanian, Joey Zhang, Kefei Zhang, Yupu Zhang, Zeleng Zhuang, Goetz Graefe, Divyakanth Agrawal, Jeff Naughton, Sujata Sunil Kosalge, and Hakan Hacigümüş. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. Proceedings of the VLDB Endowment (PVLDB) 14 (12) (2021), 2986--2998.
[4]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11 (2013), 1033--1044.
[5]
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, out-of-Order Data Processing. Proc. VLDB Endow. 8, 12 (aug 2015), 1792--1803. https://doi.org/10.14778/2824032.2824076
[6]
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, 601--613. https://doi.org/10.1145/3183713.3190664
[7]
Jaeho Bang, Gaurav Tarlok Kakkar, Pramod Chunduri, Subrata Mitra, and Joy Arulraj. 2023. Seiden: Revisiting Query Processing in Video Database Systems. Proc. VLDB Endow. 16, 9 (may 2023), 2289--2301. https://doi.org/10.14778/3598581.3598599
[8]
Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently Updating Materialized Views. SIGMOD Rec. 15, 2 (jun 1986), 61--71. https://doi.org/10.1145/16856.16861
[9]
Konstantin Bulatov, Ekaterina Emelianova, Daniil Tropin, Natalya Skoryukina, Yulia Chernyshova, Alexander Sheshkus, Sergey Usilin, Zuheng Ming, Jean-Christophe Burie, Muzzamil Luqman, and Vladimir Arlazarov. 2021. MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis. https://doi.org/10.48550/arXiv.2107.00396
[10]
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621--11631.
[11]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[12]
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1, 2 (2008), 1265--1276.
[13]
Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. 2023. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14572--14581.
[14]
Confluent [n.d.]. Confluent's ksqldb. https://www.confluent.io/product/ksqldb/.
[15]
Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, Zoi Kaoudi, and Saravanan Thirumuruganathan. 2019. Tagsniff: Simplified big data debugging for dataflow jobs. In Proceedings of the ACM Symposium on Cloud Computing. 453--464.
[16]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters (OSDI'04). USENIX Association, 10.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248--255.
[18]
EasyOCR [n.d.]. EasyOCR. https://github.com/JaidedAI/EasyOCR.
[19]
Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning Fast Iterative Data Flows. Proceedings of the VLDB Endowment 5, 11 (2012).
[20]
Heng Fan and Haibin Ling. 2019. Parallel Tracking and Verifying. IEEE Transactions on Image Processing 28, 8 (aug 2019), 4130--4144. https://doi.org/10.1109/tip.2019.2904789
[21]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering (CVPR'17). 6904--6913.
[22]
Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. Bigdebug: Debugging primitives for interactive big data processing in spark. In Proceedings of the 38th International Conference on Software Engineering. 784--795.
[23]
Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. Bigsift: automated debugging of big data analytics in data-intensive scalable computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 863--866.
[24]
Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, 269--286.
[25]
HuggingFace [n. d.]. HuggingFace Pipelines. https://huggingface.co/docs/transformers/main_classes/pipelines.
[26]
Robert Ikeda, Semih Salihoglu, and Jennifer Widom. 2011. Provenance-Based Refresh in Data-Oriented Workflows. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM '11). Association for Computing Machinery, 1659--1668. https://doi.org/10.1145/2063576.2063816
[27]
Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. Proc. VLDB Endow. 9, 3 (nov 2015), 216--227. https://doi.org/10.14778/2850583.2850595
[28]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.
[29]
Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet, et al. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38.
[30]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for Natural Language Understanding. CoRR abs/1909.10351 (2019). arXiv:1909.10351 http://arxiv.org/abs/1909.10351
[31]
Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model assertions for monitoring and improving ML models. Proceedings of Machine Learning and Systems 2 (2020), 481--496.
[32]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.
[33]
Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (jan 2020), 740--753. https://doi.org/10.14778/3377369.3377381
[34]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.
[35]
Xi Li, Weiming Hu, Chunhua Shen, Zhongfei Zhang, Anthony Dick, and Anton Van Den Hengel. 2013. A survey of appearance models in visual object tracking. ACM transactions on Intelligent Systems and Technology (TIST) 4, 4 (2013), 1--48.
[36]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
[37]
Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP 13). 439--455.
[38]
Milos Nikolic, Mohammad Dashti, and Christoph Koch. 2016. How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates. In Proceedings of the 2016 International Conference on Management of Data. ACM, 511--526. https://doi.org/10.1145/2882903.2915246
[39]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 784--789. https://doi.org/10.18653/v1/P18-2124
[40]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.
[41]
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 9895--9901. https://aclanthology.org/2021.emnlp-main.779
[42]
Shreya Shankar, Rolando Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. 2022. Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125 [cs.SE]
[43]
Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629--633.
[44]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, Vol. 139. 10347--10357.
[45]
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]
[46]
Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2015. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 1834--1848. https://api.semanticscholar.org/CorpusID:15287463
[47]
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding.
[48]
Yan Yan, Yuxing Mao, and Bo Li. 2018. Second: Sparsely embedded convolutional detection. Sensors 18, 10 (2018), 3337.
[49]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, 1 pages.
[50]
Yue Zhuge, Héctor García-Molina, Joachim Hammer, and Jennifer Widom. 1995. View Maintenance in a Warehousing Environment. SIGMOD Rec. 24, 2 (may 1995), 316--327. https://doi.org/10.1145/568271.223848

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning
June 2024
89 pages
ISBN:9798400706110
DOI:10.1145/3650203
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

DEEM '24 Paper Acceptance Rate 12 of 17 submissions, 71%;
Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 119
    Total Downloads
  • Downloads (Last 12 months)119
  • Downloads (Last 6 weeks)34
Reflects downloads up to 03 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media