tutorial

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Authors:

Pavel Dmitriev,

Xin FuAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3189 - 3190

https://doi.org/10.1145/3292500.3332297

Published: 25 July 2019 Publication History

Abstract

A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [18, 22]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.

Supplementary Material

Part 1 of 4 (p3189-shi_part1.mp4)

Download
6853.08 MB

Part 2 of 4 (p3189-shi_part2.mp4)

Download
3161.73 MB

Part 3 of 4 (p3189-shi_part3.mp4)

Download
6226.32 MB

Part 4 of 4 (p3189-shi_part4.mp4)

Download
2551.32 MB

References

[1]

A/B Testing at Scale Tutorial Strata 2018: https://exp-platform.com/2018StrataABtutorial/. Accessed: 2019-02-05.

[2]

AdvancedTopic_SpeedMatters.docx - Microsoft Word Online: https://onedrive.live.com/view.aspx?resid=8612090E610871E4!286184&ithint=file%2Cdocx&app=Word&authkey=!AGm2aNDnCqKOsYc. Accessed: 2019-02--20.

[3]

Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, P.D. Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for Practitioners. Accepted to appear in KDD 2019.

Digital Library

[4]

Budylin, R. et al. 2018. Online Evaluation for Effective Web Service Development. WWW 2018 (Sep. 2018).

[5]

Chapelle, O. et al. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems. 30, 1 (Feb. 2012), 1--41.

Digital Library

[6]

Chen, A.C. and Fu, X. 2017. Data + Intuition. Proceedings of the 26th International Conference on World Wide Web Companion - WWW '17 Companion (New York, New York, USA, 2017), 617--625.

Digital Library

[7]

Deng, A. et al. 2017. A/B testing at scale: Accelerating software innovation. SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017).

Digital Library

[8]

Deng, A. et al. 2016. Concise Summarization of Heterogeneous Treatment Effect Using Total Variation Regularized Regression. (Oct. 2016).

[9]

Deng, A. et al. 2016. Continuous monitoring of A/B tests without pain: Optional stopping in Bayesian testing. Proceedings - 3rd IEEE International Conference on Data Science and Advanced Analytics, DSAA 2016. 2, 3 (2016), 243--252.

[10]

Deng, A. et al. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. Proceedings of the sixth ACM international conference on Web search and data mining - WSDM '13 (New York, New York, USA, 2013), 123.

Digital Library

[11]

Deng, A. 2015. Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments. Proceedings of the 24th International Conference on World Wide Web - WWW '15 Companion (New York, New York, USA, 2015), 923--928.

Digital Library

[12]

Deng, A. and Shi, X. 2016. Data-Driven Metric Development for Online Controlled Experiments. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16 (2016), 77--86.

Digital Library

[13]

Dmitriev, P. et al. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '17 (Halifax, Nova Scotia, Canada, 2017).

Digital Library

[14]

Dmitriev, P. et al. 2016. Pitfalls of long-term online controlled experiments. 2016 IEEE International Conference on Big Data (Big Data) (Washington, DC, USA, Dec. 2016), 1367--1376.

[15]

Dmitriev, P. and Wu, X. 2016. Measuring Metrics. Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16 (2016), 429--437.

Digital Library

[16]

Fabijan, A. et al. 2018. Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing. Proceedings of the 2018 44rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (Prague, Czechia., 2018).

[17]

Fabijan, A. et al. 2018. The Online Controlled Experiment Lifecycle. IEEE Software. (2018), 1--1.

[18]

Fabijan, A. et al. 2019. Three Key Checklists and Remedies for Trustworthy Analysis of Online Controlled Experiments at Scale. to appear in the proceedings of 2019 IEEE/ACM 39th International Conference on Software Engineering (ICSE) Software Engineering in Practice (SEIP) (Montreal, Canada, 2019).

Digital Library

[19]

Fu, X. and Asorey, H. 2015. Data-Driven Product Innovation. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15 (New York, New York, USA, 2015), 2311--2312.

Digital Library

[20]

Gupchup, J. et al. 2018. Trustworthy Experimentation Under Telemetry Loss. to appear in: Proceedings of the 27th ACM International on Conference on Information and Knowledge Management - CIKM '18 (Lingotto, Turin, 2018).

Digital Library

[21]

Gupta, S. et al. 2019. A/B Testing at Scale: Accelerating Software Innovation. Companion Proceedings of The 2019 World Wide Web Conference on - WWW '19 (New York, New York, USA, 2019), 1299--1300.

Digital Library

[22]

Gupta, S. et al. 2018. The Anatomy of a Large-Scale Experimentation Platform. 2018 IEEE International Conference on Software Architecture (ICSA) (Seattle, USA, Apr. 2018), 1--109.

[23]

Gupta, S. et al. 2019. Top Challenges from the first Practical Online Controlled Experiments Summit. ACM SIGKDD Explorations Newsletter. 21, 1 (May 2019), 20--35.

Digital Library

[24]

Hassan, A. et al. 2013. Beyond clicks. Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13 (New York, New York, USA, 2013), 2019--2028.

Digital Library

[25]

Hohnhold, H. et al. 2015. Focusing on the Long-term. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15 (New York, New York, USA, 2015), 1849--1858.

Digital Library

[26]

Jiang, J. et al. 2015. Understanding and Predicting Graded Search Satisfaction. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15 (New York, New York, USA, 2015), 57--66.

Digital Library

[27]

Kharitonov, E. et al. 2017. Learning Sensitive Combinations of A/B Test Metrics. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining - WSDM '17. (2017), 651--659.

Digital Library

[28]

Kohavi, R. et al. 2012. Trustworthy online controlled experiments: Five Puzzling Outcomes Explained. Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '12 (New York, New York, USA, 2012), 786.

Digital Library

[29]

Kohavi, R. and Ron 2015. Online Controlled Experiments. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15 (New York, New York, USA, 2015), 1--1.

Digital Library

[30]

Machmouchi, W. and Buscher, G. 2016. Principles for the Design of Online A/B Metrics. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval - SIGIR '16 (New York, New York, USA, 2016), 589--590.

Digital Library

[31]

Rodden, K. et al. 2010. Measuring the User Experience on a Large Scale?: User-Centered Metrics for Web Applications. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. (2010), 2395--2398.

Digital Library

[32]

Song, Y. et al. 2013. Evaluating and predicting user engagement change with degraded search relevance. Proceedings of the 22nd international conference on World Wide Web - WWW '13 (New York, New York, USA, 2013), 1213--1224.

Digital Library

[33]

Xie, Y. et al. 2018. False Discovery Rate Controlled Heterogeneous Treatment Effect Detection for Online Controlled Experiments. (2018), 876--885.

Digital Library

[34]

Xu, Y. et al. 2015. From Infrastructure to Culture. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15 (New York, New York, USA, 2015), 2227--2236.

Digital Library

[35]

Zhao, Z. et al. 2016. Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (Oct. 2016), 498--507.

Cited By

Gupta SShi XDmitriev PFu X(2020)Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled ExperimentsCompanion Proceedings of the Web Conference 202010.1145/3366424.3383117(317-319)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3383117
Gupta SShi XDmitriev PFu XMukherjee ACaverlee JHu XLalmas MWang W(2020)Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled ExperimentsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371871(877-880)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1145/3336191.3371871

Index Terms

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments

Recommendations

Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments
WWW '20: Companion Proceedings of the Web Conference 2020

A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex ...
Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled Experiments
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex ...
Data-Driven Metric Development for Online Controlled Experiments: Seven Lessons Learned
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
327
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gupta SShi XDmitriev PFu X(2020)Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled ExperimentsCompanion Proceedings of the Web Conference 202010.1145/3366424.3383117(317-319)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3383117
Gupta SShi XDmitriev PFu XMukherjee ACaverlee JHu XLalmas MWang W(2020)Challenges, Best Practices and Pitfalls in Evaluating Results of Online Controlled ExperimentsProceedings of the 13th International Conference on Web Search and Data Mining10.1145/3336191.3371871(877-880)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1145/3336191.3371871

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents