Improving ML-based information retrieval software with user-driven functional testing and defect class analysis

Published: 09 November 2022 Publication History


Machine Learning (ML) has become the cornerstone of information retrieval (IR) software, as it can drive better user experience by leveraging information-rich data and complex models. However, evaluating the emergent behavior of ML-based IR software can be challenging with traditional software testing approaches: when developers modify the software, they cannot often extract useful information from individual test instances; rather, they seek to holistically verify whether—and where—their modifications caused significant regressions or improvements at scale. In this paper, we introduce not only such a holistic approach to evaluate the system-level behavior of the software, but also the concept of a defect class, which represents a partition of the input space on which the ML-based software does measurably worse for an existing feature or on which the ML task is more challenging for a new feature. We leverage large volumes of functional test cases, automatically obtained, to derive these defect classes, and propose new ways to improve the IR software from an end-user’s perspective. Applying our approach on a real production Search-AutoComplete system that contains a query interpretation ML component, we demonstrate that (1) our holistic metrics successfully identified two regressions and one improvement, where all 3 were independently verified with retrospective A/B experiments, (2) the automatically obtained defect classes provided actionable insights during early-stage ML development, and (3) we also detected defect classes at the finer sub-component level for which there were significant regressions, which we blocked prior to different releases.


ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2022
Published: 09 November 2022


