research-article

Open access

Reactive Dataflow for Inflight Error Handling in ML Workflows

Authors:

Abhilash Jindal,

Kaustubh Beedkar,

J. Nausheen Mohammed,

Keerti ChoudharyAuthors Info & Claims

DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning

Pages 51 - 61

https://doi.org/10.1145/3650203.3663333

Published: 09 June 2024 Publication History

Abstract

Modern data analytics pipelines comprise traditional data transformation operations and pre-trained ML models deployed as user-defined functions (UDFs). Such pipelines, which we call ML workflows, generally produce erroneous results due to data errors inadvertently introduced by ML models. Model errors are one of the main obstacles to improved accuracy of ML workflows. In this paper, we present Popper, a dataflow system---for expressing ML workflows---that natively supports inflight error handling. Users can extend ML workflows expressed in Popper by plugging in error handlers to improve accuracy. We propose reactive dataflow, a novel cyclic graph-based dataflow model that provides convenient abstractions for interleaving dataflow operators with user-defined error handlers for detecting and correcting errors on the fly. We also propose an efficient execution strategy amenable to pipeline parallel execution of reactive dataflow. We discuss open research challenges for making error handling a first-class citizen in dataflow systems and present preliminary evaluation of our prototypical system, which shows the effectiveness and benefits of inflight error handling in ML workflows.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, 265--283.

Digital Library

[2]

Martín Abadi, Frank McSherry, and Gordon D. Plotkin. 2015. Foundations of Differential Dataflow. Lecture Notes in Computer Science, Vol. 9034. Springer Berlin Heidelberg, 71--83. https://doi.org/10.1007/978-3-662-46678-0_5

[3]

Ankur Agiwal, Kevin Lai, Gokul Nath Babu Manoharan, Indrajit Roy, Jagan Sankaranarayanan, Hao Zhang, Tao Zou, Min Chen, Jim Chen, Ming Dai, Thanh Do, Haoyu Gao, Haoyan Geng, Raman Grover, Bo Huang, Yanlai Huang, Adam Li, Jianyi Liang, Tao Lin, Li Liu, Yao Liu, Xi Mao, Maya Meng, Prashant Mishra, Jay Patel, Rajesh S R, Vijayshankar Raman, Sourashis Roy, Mayank Singh Shishodia, Tianhang Sun, Justin Tang, Junichi Tatemura, Sagar Trehan, Ramkumar Vadali, Prasanna Venkatasubramanian, Joey Zhang, Kefei Zhang, Yupu Zhang, Zeleng Zhuang, Goetz Graefe, Divyakanth Agrawal, Jeff Naughton, Sujata Sunil Kosalge, and Hakan Hacigümüş. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. Proceedings of the VLDB Endowment (PVLDB) 14 (12) (2021), 2986--2998.

Digital Library

[4]

Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. Millwheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment 6, 11 (2013), 1033--1044.

Digital Library

[5]

Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, out-of-Order Data Processing. Proc. VLDB Endow. 8, 12 (aug 2015), 1792--1803. https://doi.org/10.14778/2824032.2824076

Digital Library

[6]

Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). Association for Computing Machinery, 601--613. https://doi.org/10.1145/3183713.3190664

Digital Library

[7]

Jaeho Bang, Gaurav Tarlok Kakkar, Pramod Chunduri, Subrata Mitra, and Joy Arulraj. 2023. Seiden: Revisiting Query Processing in Video Database Systems. Proc. VLDB Endow. 16, 9 (may 2023), 2289--2301. https://doi.org/10.14778/3598581.3598599

Digital Library

[8]

Jose A. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently Updating Materialized Views. SIGMOD Rec. 15, 2 (jun 1986), 61--71. https://doi.org/10.1145/16856.16861

Digital Library

[9]

Konstantin Bulatov, Ekaterina Emelianova, Daniil Tropin, Natalya Skoryukina, Yulia Chernyshova, Alexander Sheshkus, Sergey Usilin, Zuheng Ming, Jean-Christophe Burie, Muzzamil Luqman, and Vladimir Arlazarov. 2021. MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis. https://doi.org/10.48550/arXiv.2107.00396

[10]

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. 2020. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11621--11631.

[11]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).

[12]

Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1, 2 (2008), 1265--1276.

Digital Library

[13]

Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. 2023. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14572--14581.

[14]

Confluent [n.d.]. Confluent's ksqldb. https://www.confluent.io/product/ksqldb/.

[15]

Bertty Contreras-Rojas, Jorge-Arnulfo Quiané-Ruiz, Zoi Kaoudi, and Saravanan Thirumuruganathan. 2019. Tagsniff: Simplified big data debugging for dataflow jobs. In Proceedings of the ACM Symposium on Cloud Computing. 453--464.

Digital Library

[16]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters (OSDI'04). USENIX Association, 10.

[17]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248--255.

[18]

EasyOCR [n.d.]. EasyOCR. https://github.com/JaidedAI/EasyOCR.

[19]

Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning Fast Iterative Data Flows. Proceedings of the VLDB Endowment 5, 11 (2012).

Digital Library

[20]

Heng Fan and Haibin Ling. 2019. Parallel Tracking and Verifying. IEEE Transactions on Image Processing 28, 8 (aug 2019), 4130--4144. https://doi.org/10.1109/tip.2019.2904789

[21]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering (CVPR'17). 6904--6913.

[22]

Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim. 2016. Bigdebug: Debugging primitives for interactive big data processing in spark. In Proceedings of the 38th International Conference on Software Engineering. 784--795.

Digital Library

[23]

Muhammad Ali Gulzar, Siman Wang, and Miryung Kim. 2018. Bigsift: automated debugging of big data analytics in data-intensive scalable computing. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 863--866.

Digital Library

[24]

Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18). USENIX Association, 269--286.

[25]

HuggingFace [n. d.]. HuggingFace Pipelines. https://huggingface.co/docs/transformers/main_classes/pipelines.

[26]

Robert Ikeda, Semih Salihoglu, and Jennifer Widom. 2011. Provenance-Based Refresh in Data-Oriented Workflows. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM '11). Association for Computing Machinery, 1659--1668. https://doi.org/10.1145/2063576.2063816

Digital Library

[27]

Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data Provenance Support in Spark. Proc. VLDB Endow. 9, 3 (nov 2015), 216--227. https://doi.org/10.14778/2850583.2850595

Digital Library

[28]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. 59--72.

Digital Library

[29]

Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet, et al. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38.

[30]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for Natural Language Understanding. CoRR abs/1909.10351 (2019). arXiv:1909.10351 http://arxiv.org/abs/1909.10351

[31]

Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model assertions for monitoring and improving ML models. Proceedings of Machine Learning and Systems 2 (2020), 481--496.

[32]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.

[33]

Avinash Kumar, Zuozhi Wang, Shengquan Ni, and Chen Li. 2020. Amber: A Debuggable Dataflow System Based on the Actor Model. Proc. VLDB Endow. 13, 5 (jan 2020), 740--753. https://doi.org/10.14778/3377369.3377381

Digital Library

[34]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.

[35]

Xi Li, Weiming Hu, Chunhua Shen, Zhongfei Zhang, Anthony Dick, and Anton Van Den Hengel. 2013. A survey of appearance models in visual object tracking. ACM transactions on Intelligent Systems and Technology (TIST) 4, 4 (2013), 1--48.

[36]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692

[37]

Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP 13). 439--455.

Digital Library

[38]

Milos Nikolic, Mohammad Dashti, and Christoph Koch. 2016. How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates. In Proceedings of the 2016 International Conference on Management of Data. ACM, 511--526. https://doi.org/10.1145/2882903.2915246

Digital Library

[39]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 784--789. https://doi.org/10.18653/v1/P18-2124

[40]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.

[41]

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 9895--9901. https://aclanthology.org/2021.emnlp-main.779

[42]

Shreya Shankar, Rolando Garcia, Joseph M. Hellerstein, and Aditya G. Parameswaran. 2022. Operationalizing Machine Learning: An Interview Study. arXiv:2209.09125 [cs.SE]

[43]

Ray Smith. 2007. An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007), Vol. 2. IEEE, 629--633.

[44]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, Vol. 139. 10347--10357.

[45]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]

[46]

Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2015. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (2015), 1834--1848. https://api.semanticscholar.org/CorpusID:15287463

Digital Library

[47]

Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding.

[48]

Yan Yan, Yuxing Mao, and Bo Li. 2018. Second: Sparsely embedded convolutional detection. Sensors 18, 10 (2018), 3337.

[49]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, 1 pages.

Digital Library

[50]

Yue Zhuge, Héctor García-Molina, Joachim Hammer, and Jennifer Widom. 1995. View Maintenance in a Warehousing Environment. SIGMOD Rec. 24, 2 (may 1995), 316--327. https://doi.org/10.1145/568271.223848

Digital Library

Recommendations

Automatically detecting error handling bugs using error specifications
SEC'16: Proceedings of the 25th USENIX Conference on Security Symposium

Incorrect error handling in security-sensitive code often leads to severe security vulnerabilities. Implementing correct error handling is repetitive and tedious especially in languages like C that do not support any exception handling primitives. This ...
Error handling as an aspect
BPAOSD '07: Proceedings of the 2nd workshop on Best practices in applying aspect-oriented software development

One of the fundamental motivations for employing exception handling in the development of robust applications is to lexically separate error handling code from the normal code so that they can be independently modified. However, experience has shown ...
Detecting error-handling bugs without error specification input
ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

Most software systems frequently encounter errors when interacting with their environments. When errors occur, error-handling code must execute flawlessly to facilitate system recovery. Implementing correct error handling is repetitive but non-trivial, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DEEM '24: Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning

June 2024

89 pages

ISBN:9798400706110

DOI:10.1145/3650203

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9, 2024

AA, Santiago, Chile

Acceptance Rates

DEEM '24 Paper Acceptance Rate 12 of 17 submissions, 71%;

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
119
Total Downloads

Downloads (Last 12 months)119
Downloads (Last 6 weeks)34

Reflects downloads up to 03 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents