Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3540250.3558941acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Improving ML-based information retrieval software with user-driven functional testing and defect class analysis

Published: 09 November 2022 Publication History

Abstract

Machine Learning (ML) has become the cornerstone of information retrieval (IR) software, as it can drive better user experience by leveraging information-rich data and complex models. However, evaluating the emergent behavior of ML-based IR software can be challenging with traditional software testing approaches: when developers modify the software, they cannot often extract useful information from individual test instances; rather, they seek to holistically verify whether—and where—their modifications caused significant regressions or improvements at scale. In this paper, we introduce not only such a holistic approach to evaluate the system-level behavior of the software, but also the concept of a defect class, which represents a partition of the input space on which the ML-based software does measurably worse for an existing feature or on which the ML task is more challenging for a new feature. We leverage large volumes of functional test cases, automatically obtained, to derive these defect classes, and propose new ways to improve the IR software from an end-user’s perspective. Applying our approach on a real production Search-AutoComplete system that contains a query interpretation ML component, we demonstrate that (1) our holistic metrics successfully identified two regressions and one improvement, where all 3 were independently verified with retrospective A/B experiments, (2) the automatically obtained defect classes provided actionable insights during early-stage ML development, and (3) we also detected defect classes at the finer sub-component level for which there were significant regressions, which we blocked prior to different releases.

References

[1]
Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 291–300. https://doi.org/10.1109/ICSE-SEIP.2019.00042
[2]
Houssem Ben Braiek and Foutse Khomh. 2018. On testing machine learning programs. arXiv preprint arXiv:1812.02257.
[3]
Hsinchun Chen. 1995. Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms. Journal of the American society for Information Science, 46, 3 (1995), 194–216.
[4]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 364–375. https://doi.org/10.1109/ASE.2019.00042
[5]
Yujun Chen, Xian Yang, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, Feng Gao, Zhangwei Xu, and Dongmei Zhang. 2020. Identifying Linked Incidents in Large-Scale Online Service Systems. Association for Computing Machinery, New York, NY, USA. 304–314. isbn:9781450370431 https://doi.org/10.1145/3368089.3409768
[6]
Zhenpeng Chen, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. 2020. A Comprehensive Study on Challenges in Deploying Deep Learning Based Software. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA. 750–762. isbn:9781450370431 https://doi.org/10.1145/3368089.3409759
[7]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, 51, 1 (2008), 107–113. https://doi.org/10.1145/1327452.1327492
[8]
Li Deng and Dong Yu. 2014. Deep learning: methods and applications. Foundations and trends in signal processing, 7, 3–4 (2014), 197–387. https://doi.org/10.1561/2000000039
[9]
Leon Derczynski. 2016. Complementarity, F-score, and NLP Evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia. 261–266.
[10]
Michael Desmond, Evelyn Duesterwald, Kristina Brimijoin, Michelle Brachman, and Qian Pan. 2021. Semi-Automated Data Labeling. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, Hugo Jair Escalante and Katja Hofmann (Eds.) (Proceedings of Machine Learning Research, Vol. 133). PMLR, 156–169. https://proceedings.mlr.press/v133/desmond21a.html
[11]
Prem Devanbu, Matthew B. Dwyer, Sebastian G. Elbaum, Michael Lowry, Kevin Moran, Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, and Xiangyu Zhang. 2020. Deep Learning & Software Engineering: State of Research and Future Directions. CoRR, abs/2009.08525 (2020), https://doi.org/10.48550/ARXIV.2009.08525
[12]
Daniel Di Nardo, Nadia Alshahwan, Lionel C Briand, Elizabeta Fourneret, Tomislav Nakić-Alfirević, and Vincent Masquelier. 2013. Model based test validation and oracles for data acquisition systems. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). 540–550. https://doi.org/10.1109/ASE.2013.6693111
[13]
Georges E. Dupret and Benjamin Piwowarski. 2008. A User Browsing Model to Predict Search Engine Click Data from Past Observations. In Proceedings of SIGIR ’08. Association for Computing Machinery, New York, NY, USA. 331–338. isbn:9781605581644 https://doi.org/10.1145/1390334.1390392
[14]
Saikat Dutta, August Shi, Rutvik Choudhary, Zhekun Zhang, Aryaman Jain, and Sasa Misailovic. 2020. Detecting Flaky Tests in Probabilistic and Machine Learning Applications. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2020). 211–224. isbn:9781450380089 https://doi.org/10.1145/3395363.3397366
[15]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. 1, Springer series in statistics Springer, Berlin. https://doi.org/10.1007/978-0-387-84858-7
[16]
Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. 2011. Query Segmentation Revisited. In Proceedings of the 20th International Conference on World Wide Web. 97–106. https://doi.org/10.1145/1963405.1963423
[17]
Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online Evaluation for Information Retrieval. Found. Trends Inf. Retr., 10, 1 (2016), jun, 1–117. issn:1554-0669 https://doi.org/10.1561/1500000051
[18]
Max Hort, Jie M. Zhang, Federica Sarro, and Mark Harman. 2021. Fairea: A Model Behaviour Mutation Approach to Benchmarking Bias Mitigation Methods. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). 994–1006. isbn:9781450385626 https://doi.org/10.1145/3468264.3468565
[19]
Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proceedings of the National Academy of Sciences of the United States of America, 33, 2 (1947), 25.
[20]
Wing Lam, Stefan Winter, Anjiang Wei, Tao Xie, Darko Marinov, and Jonathan Bell. 2020. A Large-Scale Longitudinal Study of Flaky Tests. Proc. ACM Program. Lang., Article 202, nov, 29 pages. https://doi.org/10.1145/3428270
[21]
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2020. Mining of massive data sets. Cambridge university press.
[22]
Grace A. Lewis, Stephany Bellomo, and Ipek Ozkaya. 2021. Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems. CoRR, abs/2103.14101 (2021), https://doi.org/10.48550/arXiv.2103.14101
[23]
Lucy Ellen Lwakatare, Aiswarya Raj, Ivica Crnkovic, Jan Bosch, and Helena Holmström Olsson. 2020. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Information and Software Technology, 127 (2020), 106368. issn:0950-5849 https://doi.org/10.1016/j.infsof.2020.106368
[24]
Dusica Marijan, Arnaud Gotlieb, and Mohit Kumar Ahuja. 2019. Challenges of testing machine learning based systems. In AITest. 101–102. https://doi.org/10.1109/AITest.2019.00010
[25]
S Masuda, K Ono, T Yasue, and N Hosokawa. 2018. A Survey of Software Quality for Machine Learning Applications. In ICST. 279–284. isbn:VO - https://doi.org/10.1109/ICSTW.2018.00061
[26]
Glenford J Myers, Corey Sandler, and Tom Badgett. 2011. The art of software testing. John Wiley & Sons.
[27]
Ph Mylonas, David Vallet, Pablo Castells, Miriam Fernández, and Yannis Avrithis. 2008. Personalized information retrieval based on context and ontological knowledge. The Knowledge Engineering Review, 23, 1 (2008), 73–100. https://doi.org/10.1017/S0269888907001282
[28]
David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 1 (2007), 3–26. https://doi.org/10.48550/ARXIV.2109.11406
[29]
Alexandra Olteanu, Fernando Diaz, and Gabriella Kazai. 2020. When Are Search Completion Suggestions Problematic? In Computer Supported Collaborative Work and Social Computing (CSCW). https://doi.org/10.1145/3415242
[30]
Khalid Saleh and Ayat Shukairy. 2010. Conversion optimization: The art and science of converting prospects to customers. " O’Reilly Media, Inc.".
[31]
P. Santhanam, Eitan Farchi, and Victor Pankratius. 2019. Engineering Reliable Deep Learning Systems. CoRR, abs/1910.12582 (2019), https://doi.org/10.48550/ARXIV.1910.12582
[32]
Tefko Saracevic. 1995. Evaluation of evaluation in information retrieval. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. 138–146. https://doi.org/10.1145/215206.215351
[33]
Milad Shokouhi and Kira Radinsky. 2012. Time-Sensitive Query Auto-Completion. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’12). Association for Computing Machinery, New York, NY, USA. 601–610. isbn:9781450314725 https://doi.org/10.1145/2348283.2348364
[34]
Dan Siroker and Pete Koomen. 2013. A/B testing: The most powerful way to turn clicks into customers. John Wiley & Sons.
[35]
Foivos Tsimpourlas, Ajitha Rajan, and Miltiadis Allamanis. 2021. Supervised learning over test executions as a test oracle. In Proceedings of the 36th Annual ACM Symposium on Applied Computing. 1521–1531. https://doi.org/10.1145/3412841.3442027
[36]
Kristen R Walcott, Mary Lou Soffa, Gregory M Kapfhammer, and Robert S Roos. 2006. Timeaware test suite prioritization. In Proceedings of the 2006 international symposium on Software testing and analysis. 1–12. https://doi.org/10.1145/1146238.1146240
[37]
Xuanhui Wang. 2020. Query Segmentation and Tagging. In Query Understanding for Search Engines. Springer, 43–67.
[38]
Hironori Washizaki, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. 2019. Studying Software Engineering Patterns for Designing Machine Learning Systems. In 2019 10th International Workshop on Empirical Software Engineering in Practice (IWESEP). 49–495. https://doi.org/10.1109/IWESEP49350.2019.00017
[39]
Elaine J Weyuker. 1982. On testing non-testable programs. Comput. J., 25, 4 (1982), 465–470.
[40]
Qing Xie and Atif M. Memon. 2007. Designing and comparing automated test oracles for GUI-based software applications. ACM Trans. Softw. Eng. Methodol., 16, 1 (2007), 4. https://doi.org/10.1145/1189748.1189752
[41]
Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE TSE, https://doi.org/10.1109/TSE.2019.2962027
[42]
Wei Zheng, Guoliang Liu, Manqing Zhang, Xiang Chen, and Wenqiao Zhao. 2021. Research Progress of Flaky Tests. In IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 639–646. https://doi.org/10.1109/SANER50967.2021.00081
[43]
Wujie Zheng, Wenyu Wang, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315. https://doi.org/10.1109/ICSE-Companion.2019.00131
[44]
Zhi Quan Zhou, Shaowen Xiang, and Tsong Yueh Chen. 2016. Metamorphic Testing for Software Quality Assessment: A Study of Search Engines. IEEE Transactions on Software Engineering, 42, 3 (2016), 264–284. https://doi.org/10.1109/TSE.2015.2478001
[45]
Junjie Zhu, Teng Long, and Atif Memon. 2021. Automatically authoring regression tests for machine-learning based systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 374–383. https://doi.org/10.1109/ICSE-SEIP52600.2021.00049

Index Terms

  1. Improving ML-based information retrieval software with user-driven functional testing and defect class analysis
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
    November 2022
    1822 pages
    ISBN:9781450394130
    DOI:10.1145/3540250
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. AutoComplete Search
    2. Information Retrieval System Testing
    3. Machine Learning Testing
    4. Query Interpretation
    5. Relevance Search

    Qualifiers

    • Research-article

    Conference

    ESEC/FSE '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 237
      Total Downloads
    • Downloads (Last 12 months)59
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media