Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1109/ICSE-SEIP52600.2021.00034acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

On the experiences of adopting automated data validation in an industrial machine learning project

Published: 17 December 2021 Publication History

Abstract

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML models using bad data, research and industrial practice suggest incorporating a data validation process and tool in ML system development process.
Aim: The study investigates the adoption of a data validation process and tool in industrial ML projects. The data validation process demands significant engineering resources for tool development and maintenance. Thus, it is important to identify the best practices for their adoption especially by development teams that are in the early phases of deploying ML-enabled software systems.
Method: Action research was conducted at a large-software intensive organization in telecommunications, specifically within the analytics R&D organization for an ML use case of classifying faults from returned hardware telecommunication devices.
Results: Based on the evaluation results and learning from our action research, we identified three best practices, three benefits, and two barriers to adopting the data validation process and tool in ML projects. We also propose a data validation framework (DVF) for systematizing the adoption of a data validation process.
Conclusions: The results show that adopting a data validation process and tool in ML projects is an effective approach of testing ML-enabled software systems. It requires having an overview of the level of data (feature, dataset, cross-dataset, data stream) at which certain data quality tests can be applied.

References

[1]
S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software engineering for machine learning: A case study," in Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. New York: IEEE, 2019, pp. 291--300.
[2]
E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, "Data validation for machine learning," in Proceedings of the 2nd SysML Conference. Online, 2019.
[3]
S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger, "Automating large-scale data quality verification," Proc. VLDB Endow., vol. 11, no. 12, pp. 1781--1794, 2018.
[4]
S. Krishnan, M. J. Franklin, K. Goldberg, J. Wang, and E. Wu, "Activeclean: An interactive data cleaning framework for modern machine learning," in Proceedings of the 2016 International Conference on Management of Data. New York: ACM, 2016, pp. 2117--2120.
[5]
A. Swami, S. Vasudevan, and J. Huyn, "Data sentinel: A declarative production-scale data validation platform," in 36th International Conference on Data Engineering (ICDE). New York: IEEE, 2020, pp. 1579--1590.
[6]
S. Schelter, F. Biessmann, T. Januschowski, D. Salinas, S. Seufert, G. Szarvas, M. Vartak, S. Madden, H. Miao, A. Deshpande et al., "On challenges in machine learning model management." IEEE Data Eng. Bull., vol. 41, no. 4, pp. 5--15, 2018.
[7]
J. Nurminen, T. Halvari, J. Harviainen, J. Mylläri, A. Röyskö, S. Juuso, and T. Mikkonen, "Software framework for data fault injection to test machine learning systems," in Proceedings of 30th Annual IEEE International Symposium on Software Reliability Engineering. New York: IEEE, 2019, pp. 294--299.
[8]
S. Pizonka, T. Kehrer, and M. Weidlich, "Domain model-based data stream validation for internet of things applications." in MODELS Workshops. Germany: CEUR-WS.org, 2018, pp. 503--508.
[9]
Z. Qi, H. Wang, J. Li, and H. Gao, "Impacts of dirty data: and experimental evaluation," in arXiv preprint arXiv:1803.06071. arXiv, 2018, pp. 1--22.
[10]
L. Ehrlinger, V. Haunschmid, D. Palazzini, and C. Lettner, "A DaQL to monitor data quality in machine learning applications," in Database and Expert Systems Applications, S. Hartmann, J. Küng, S. Chakravarthy, G. Anderst-Kotsis, A. M. Tjoa, and I. Khalil, Eds. Springer, 2019, pp. 227--237.
[11]
N. Hynes, D. Sculley, and M. Terry, "The data linter: Lightweight, automated sanity checking for ml data sets," in Advances in neural information processing system(NIPS 2017), 2017, pp. 1--7.
[12]
Y. Guo, K. Ashmawy, E. Huang, and W. Zeng, "Under the hood of uber atg's machine learning infrastructure and versioning control platform for self-driving vehicles," Mar 2020. [Online]. Available: https://eng.uber.com/machine-learning-model-life-cycle-version-control/
[13]
L. Ehrlinger, E. Rusz, and W. Wöß, "A survey of data quality measurement and monitoring tools," in arXiv preprint arXiv:1907.08138. arXiv, 2019, pp. 1--30.
[14]
H. B. Braiek and F. Khomh, "On testing machine learning programs," Journal of Systems and Software, vol. 164, p. 110542, 2020.
[15]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, "Applied machine learning at facebook: A datacenter infrastructure perspective," in International Symposium on High Performance Computer Architecture. New York: IEEE, 2018, pp. 620--629.
[16]
C. Sutton, T. Hobson, J. Geddes, and R. Caruana, "Data diff: Interpretable, executable summaries of changes in distributions for data wrangling," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2018, pp. 2279--2288.
[17]
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Y. Koo, L. Lew, C. Mewald, A. N. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich, "Tfx: A tensorflow-based production-scale machine learning platform," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2017, pp. 1387--1395.
[18]
K. Petersen, C. Gencel, N. Asghari, D. Baca, and S. Betz, "Action research as a model for industry-academia collaboration in the software engineering context," in Proceedings of the International Workshop on Long-term Industrial Collaboration on Software Engineering. New York: ACM, 2014, pp. 55--62.
[19]
J. McKay and P. Marshall, "The dual imperatives of action research," Information Technology & People, vol. 14, no. 1, pp. 46--59, 2001.
[20]
M. Staron, Action Research as Research Methodology in Software Engineering. Springer International Publishing, 2020, pp. 15--36.
[21]
X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang, "Data cleaning: Overview and emerging challenges," in Proceedings of the 2016 International Conference on Management of Data. New York: ACM, 2016, pp. 2201--2206.
[22]
P. Márquez-Neila and R. Sznitman, "Image data validation for medical systems," in Medical Image Computing and Computer Assisted Intervention - MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds. Springer International Publishing, 2019, pp. 329--337.

Cited By

View all
  • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
  • (2024)Data Quality Assessment in the Wild: Findings from GitHubProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661213(120-129)Online publication date: 18-Jun-2024
  • (2024)What About the Data? A Mapping Study on Data Engineering for AI SystemsProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644954(43-52)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice
May 2021
405 pages
ISBN:9780738146690

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 17 December 2021

Check for updates

Author Tags

  1. data errors
  2. data quality
  3. data validation
  4. machine learning
  5. software engineering

Qualifiers

  • Research-article

Conference

ICSE '21
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code SearchProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686664(1-12)Online publication date: 24-Oct-2024
  • (2024)Data Quality Assessment in the Wild: Findings from GitHubProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661213(120-129)Online publication date: 18-Jun-2024
  • (2024)What About the Data? A Mapping Study on Data Engineering for AI SystemsProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644954(43-52)Online publication date: 14-Apr-2024
  • (2024)Towards Automatic Translation of Machine Learning Visual Insights to Analytical AssertionsProceedings of the Third ACM/IEEE International Workshop on NL-based Software Engineering10.1145/3643787.3648032(29-32)Online publication date: 20-Apr-2024
  • (2024)Security for Machine Learning-based Software Systems: A Survey of Threats, Practices, and ChallengesACM Computing Surveys10.1145/363853156:6(1-38)Online publication date: 23-Feb-2024
  • (2023)A Case Study on Data Science Processes in an Academia-Industry CollaborationProceedings of the XXII Brazilian Symposium on Software Quality10.1145/3629479.3629514(1-10)Online publication date: 7-Nov-2023
  • (2023)Automatic and Precise Data Validation for Machine LearningProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614786(2198-2207)Online publication date: 21-Oct-2023
  • (2022)Data smells in public datasetsProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528621(205-216)Online publication date: 16-May-2022
  • (2022)Data sovereignty for AI pipelinesProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528593(193-204)Online publication date: 16-May-2022
  • (2022)Data smellsProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528590(229-239)Online publication date: 16-May-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media