Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3437963.3441742acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Published: 08 March 2021 Publication History

Abstract

Effectively measuring, understanding, and improving mobile app performance is of paramount importance for mobile app developers. Across the mobile Internet landscape, companies run online controlled experiments (A/B tests) with thousands of performance metrics in order to understand how app performance causally impacts user retention and to guard against service or app regressions that degrade user experiences. To capture certain characteristics particular to performance metrics, such as enormous observation volume and high skewness in distribution, an industry-standard practice is to construct a performance metric as a quantile over all performance events in control or treatment buckets in A/B tests. In our experience with thousands of A/B tests provided by Snap, we have discovered some pitfalls in this industry-standard way of calculating performance metrics that can lead to unexplained movements in performance metrics and unexpected misalignment with user engagement metrics. In this paper, we discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps. One arises from strong heterogeneity in both mobile devices and user engagement, and the other arises from self-selection bias caused by post-treatment user engagement changes. To remedy these two pitfalls, we introduce several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values. We have extensively evaluated these methods on both simulation data and real A/B tests, and have deployed them into Snap's in-house experimentation platform.

References

[1]
L. Altmayer. Hot-deck imputation: A simple data step approach. Proceedings of the 2002 Northeast SAS User's Group, pages 773--780, 2002.
[2]
Joshua Angrist and Jorn-Steffen Pischke. Mostly Harmless Econometrics. Princeton University Press, Princeton, NJ, USA, 2009.
[3]
Cédric Arisdakessian, Olivier Poirion, Breck Yunits, Xun Zhu, and Lana X. Garmire. Deepimpute: an accurate, fast, and scalable deep neural network method to impute single-cell rna-seq data. Genome Biology, 20(1):211, 2019.
[4]
F. Biessmann, T. Rukat, P. Schmidt, P. Naidu, S. Schelter, A. Taptunov, D. Lange, and D. Salinas. Datawig: Missing value imputation for tables. Journal of Machine Learning Research, 20:1--6, 2019.
[5]
C. Breunig. Testing missing at random using instrumental variables. Journal of Business $&$ Economic Statistics, 37:223--234, 2019.
[6]
M. Alan Brookhart, Sebastian Schneeweiss, Kenneth J. Rothman, Robert J. Glynn, Jerry Avorn, and Til Stürmer. Variable Selection for Propensity Score Models. American Journal of Epidemiology, 163(12):1149--1156, 04 2006.
[7]
Duc Hoang Bui, Yunxin Liu, Hyosu Kim, Insik Shin, and Feng Zhao. Rethinking energy-performance trade-off in mobile web page loading. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, page 14--26, New York, NY, USA, 2015. Association for Computing Machinery.
[8]
Nanyu Chen, Min Liu, and Ya Xu. Automatic Detection and Diagnosis of Biased Online Experiments. arXiv e-prints, page arXiv:1808.00114, Jul 2018.
[9]
Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, page 1427--1436, 2017.
[10]
Craig Enders. The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data. Psychological methods, 6:352--70, 01 2002.
[11]
Aleksander Fabijan, Pavel Dmitriev, Colin McFarland, Lukas Vermeer, Helena Holmström Olsson, and Jan Bosch. Experimentation growth: Evolving trustworthy a/b testing capabilities in online software companies. Journal of Software: Evolution and Process, 30(12):e2113, 2018. e2113 JSME-17-0210.R2.
[12]
Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '19, page 2156--2164, 2019.
[13]
J.J. Faraway. Linear Models with R. Chapman $&$ Hall, 2005.
[14]
J. W. Graham. Missing Data: Analysis and Design. Springer, 2012.
[15]
Somit Gupta, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy, Niall Cardin, Sumita Chandran, Nanyu Chen, Dominic Coey, and et al. Top challenges from the first practical online controlled experiments summit. SIGKDD Explor. Newsl., 21(1):20--35, 2019.
[16]
Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25--46, 2012.
[17]
Sebastien Haneuse, Jonathan Schildcrout, Paul Crane, Joshua Sonnen, John Breitner, and E Larson. Adjustment for selection bias in observational studies with application to the analysis of autopsy data. Neuroepidemiology, 32:229--239, 2009.
[18]
James Heckman. Selection Bias and Self-Selection, pages 242--266. Palgrave Macmillan, London, 01 2010.
[19]
James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979.
[20]
Daniel E. Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3):199--236, 2007.
[21]
Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945--960, 1986.
[22]
Joel L. Horowitz and Charles F. Manski. Nonparametric analysis of randomized experiments with missing covariate and outcome data. Journal of the American Statistical Association, 95(449):77--84, 2000.
[23]
Stefano M. Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1):1--24, 2012.
[24]
Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B, 76(1):243--263, 2014.
[25]
Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 2004.
[26]
Manfred Jaeger. On testing the missing at random assumption. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Machine Learning: ECML 2006, pages 671--678, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
[27]
Joseph D. Y. Kang and Joseph L. Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4):523--539, 2007.
[28]
Stas Kolenikov. Post-stratification or a non-response adjustment? Survey Practice, 9:1--12, 08 2016.
[29]
D. Lanning and D. Berry. An alternative to proc mi for large samples. SAS Users Group International (SUGI) 28, 2003.
[30]
David S. Lee. Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects. The Review of Economic Studies, 76(3):1071--1102, 07 2009.
[31]
P. L. Li, P. Dmitriev, H. M. Hu, X. Chai, Z. Dimov, B. Paddock, Y. Li, A. Kirshenbaum, I. Niculescu, and T. Thoresen. Experimentation in the operating system: The windows experimentation platform. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 21--30, May 2019.
[32]
Roderick J. A. Little. Regression with missing x's: A review. Journal of the American Statistical Association, 87(420):1227--1237, 1992.
[33]
Roderick J A Little and Donald B Rubin. Statistical Analysis with Missing Data. John Wiley $&$ Sons, Inc., USA, 1986.
[34]
Min Liu, Xiaohui Sun, Maneesh Varshney, and Ya Xu. Large-Scale Online Experimentation with Quantile Metrics. arXiv e-prints, 2019.
[35]
Tapabrata Maiti, Curtis P. Miller, and Pushpal K. Mukhopadhyay. Neural network imputation: An experience with the national resources inventory survey. Journal of Agricultural, Biological, and Environmental Statistics, 13(3):255--269, 2008.
[36]
Stephen L. Morgan and David J. Harding. Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods $&$ Research, 35(1):3--60, 2006.
[37]
Whitney K. Newey, James L. Powell, and James R. Walker. Semiparametric estimation of selection models: Some empirical results. The American Economic Review, 80(2):324--328, 1990.
[38]
T. Raghunathan. Missing Data Analysis in Practice. Chapman $&$ Hall, 2016.
[39]
Paul R. Rosenbaum. Observational Studies. Springer-Verlag, 2002.
[40]
Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.
[41]
D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.
[42]
Donald B. Rubin. Matched Sampling for Causal Effects. Cambridge University Press, 2006.
[43]
L.J. Schafer. Analysis of Incomplete Multivariate Data. Chapman $&$ Hall, 1997.
[44]
Xiaolin Shi, Pavel Dmitriev, Somit Gupta, and Xin Fu. Challenges, best practices and pitfalls in evaluating results of online controlled experiments. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '19, page 3189--3190, New York, NY, USA, 2019. Association for Computing Machinery.
[45]
Marek 'Smieja, Ł ukasz Struski, Jacek Tabor, Bartosz Zieli'nski, and Przemysł aw Spurek. Processing of missing data by neural networks. In Advances in Neural Information Processing Systems 31, pages 2719--2729. Curran Associates, Inc., 2018.
[46]
Magorzata Wojty?, Giampiero Marra, and Rosalba Radice. Copula regression spline sample selection models: The r package semiparsamplesel. Journal of Statistical Software, Articles, 71(6):1--66, 2016.
[47]
Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. False discovery rate controlled heterogeneous treatment effect detection for online controlled experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '18, page 876--885, New York, NY, USA, 2018.
[48]
Z. Zhao, M. Chen, D. Matheson, and M. Stone. Online experimentation diagnosis and troubleshooting beyond aa validation. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 498--507, Oct 2016.
[49]
Mikhail Zhelonkin, Marc Genton, and Elvezio Ronchetti. Robust inference in sample selection models. Journal of the Royal Statistical Society: Series B, 78, 2015.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining
March 2021
1192 pages
ISBN:9781450382977
DOI:10.1145/3437963
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. online controlled experiment
  2. performance metrics
  3. sample ratio mismatch
  4. self-selection bias

Qualifiers

  • Research-article

Conference

WSDM '21

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)4
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media