research-article

On the experiences of adopting automated data validation in an industrial machine learning project

Authors:

Lucy Ellen Lwakatare,

Ellinor Rånge,

Ivica Crnkovic,

Jan BoschAuthors Info & Claims

ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

Pages 248 - 257

https://doi.org/10.1109/ICSE-SEIP52600.2021.00034

Published: 17 December 2021 Publication History

Get Access

Abstract

Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML models using bad data, research and industrial practice suggest incorporating a data validation process and tool in ML system development process.

Aim: The study investigates the adoption of a data validation process and tool in industrial ML projects. The data validation process demands significant engineering resources for tool development and maintenance. Thus, it is important to identify the best practices for their adoption especially by development teams that are in the early phases of deploying ML-enabled software systems.

Method: Action research was conducted at a large-software intensive organization in telecommunications, specifically within the analytics R&D organization for an ML use case of classifying faults from returned hardware telecommunication devices.

Results: Based on the evaluation results and learning from our action research, we identified three best practices, three benefits, and two barriers to adopting the data validation process and tool in ML projects. We also propose a data validation framework (DVF) for systematizing the adoption of a data validation process.

Conclusions: The results show that adopting a data validation process and tool in ML projects is an effective approach of testing ML-enabled software systems. It requires having an overview of the level of data (feature, dataset, cross-dataset, data stream) at which certain data quality tests can be applied.

References

[1]

S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software engineering for machine learning: A case study," in Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. New York: IEEE, 2019, pp. 291--300.

Abstract

References

Cited By

Index Terms

Recommendations

Risk-based data validation in machine learning-based software systems

TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines

Automating large-scale data quality verification

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations