research-article

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Authors:

Xiaolin ShiAuthors Info & Claims

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Pages 949 - 957

https://doi.org/10.1145/3437963.3441742

Published: 08 March 2021 Publication History

Abstract

Effectively measuring, understanding, and improving mobile app performance is of paramount importance for mobile app developers. Across the mobile Internet landscape, companies run online controlled experiments (A/B tests) with thousands of performance metrics in order to understand how app performance causally impacts user retention and to guard against service or app regressions that degrade user experiences. To capture certain characteristics particular to performance metrics, such as enormous observation volume and high skewness in distribution, an industry-standard practice is to construct a performance metric as a quantile over all performance events in control or treatment buckets in A/B tests. In our experience with thousands of A/B tests provided by Snap, we have discovered some pitfalls in this industry-standard way of calculating performance metrics that can lead to unexplained movements in performance metrics and unexpected misalignment with user engagement metrics. In this paper, we discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps. One arises from strong heterogeneity in both mobile devices and user engagement, and the other arises from self-selection bias caused by post-treatment user engagement changes. To remedy these two pitfalls, we introduce several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values. We have extensively evaluated these methods on both simulation data and real A/B tests, and have deployed them into Snap's in-house experimentation platform.

References

[1]

L. Altmayer. Hot-deck imputation: A simple data step approach. Proceedings of the 2002 Northeast SAS User's Group, pages 773--780, 2002.

[2]

Joshua Angrist and Jorn-Steffen Pischke. Mostly Harmless Econometrics. Princeton University Press, Princeton, NJ, USA, 2009.

[3]

Cédric Arisdakessian, Olivier Poirion, Breck Yunits, Xun Zhu, and Lana X. Garmire. Deepimpute: an accurate, fast, and scalable deep neural network method to impute single-cell rna-seq data. Genome Biology, 20(1):211, 2019.

[4]

F. Biessmann, T. Rukat, P. Schmidt, P. Naidu, S. Schelter, A. Taptunov, D. Lange, and D. Salinas. Datawig: Missing value imputation for tables. Journal of Machine Learning Research, 20:1--6, 2019.

[5]

C. Breunig. Testing missing at random using instrumental variables. Journal of Business $&$ Economic Statistics, 37:223--234, 2019.

[6]

M. Alan Brookhart, Sebastian Schneeweiss, Kenneth J. Rothman, Robert J. Glynn, Jerry Avorn, and Til Stürmer. Variable Selection for Propensity Score Models. American Journal of Epidemiology, 163(12):1149--1156, 04 2006.

[7]

Duc Hoang Bui, Yunxin Liu, Hyosu Kim, Insik Shin, and Feng Zhao. Rethinking energy-performance trade-off in mobile web page loading. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, page 14--26, New York, NY, USA, 2015. Association for Computing Machinery.

Digital Library

[8]

Nanyu Chen, Min Liu, and Ya Xu. Automatic Detection and Diagnosis of Biased Online Experiments. arXiv e-prints, page arXiv:1808.00114, Jul 2018.

[9]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. A dirty dozen: Twelve common metric interpretation pitfalls in online controlled experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, page 1427--1436, 2017.

Digital Library

[10]

Craig Enders. The impact of nonnormality on full information maximum-likelihood estimation for structural equation models with missing data. Psychological methods, 6:352--70, 01 2002.

[11]

Aleksander Fabijan, Pavel Dmitriev, Colin McFarland, Lukas Vermeer, Helena Holmström Olsson, and Jan Bosch. Experimentation growth: Evolving trustworthy a/b testing capabilities in online software companies. Journal of Software: Evolution and Process, 30(12):e2113, 2018. e2113 JSME-17-0210.R2.

Digital Library

[12]

Aleksander Fabijan, Jayant Gupchup, Somit Gupta, Jeff Omhover, Wen Qin, Lukas Vermeer, and Pavel Dmitriev. Diagnosing sample ratio mismatch in online controlled experiments: A taxonomy and rules of thumb for practitioners. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '19, page 2156--2164, 2019.

Digital Library

[13]

J.J. Faraway. Linear Models with R. Chapman $&$ Hall, 2005.

[14]

J. W. Graham. Missing Data: Analysis and Design. Springer, 2012.

[15]

Somit Gupta, Ronny Kohavi, Diane Tang, Ya Xu, Reid Andersen, Eytan Bakshy, Niall Cardin, Sumita Chandran, Nanyu Chen, Dominic Coey, and et al. Top challenges from the first practical online controlled experiments summit. SIGKDD Explor. Newsl., 21(1):20--35, 2019.

Digital Library

[16]

Jens Hainmueller. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1):25--46, 2012.

[17]

Sebastien Haneuse, Jonathan Schildcrout, Paul Crane, Joshua Sonnen, John Breitner, and E Larson. Adjustment for selection bias in observational studies with application to the analysis of autopsy data. Neuroepidemiology, 32:229--239, 2009.

[18]

James Heckman. Selection Bias and Self-Selection, pages 242--266. Palgrave Macmillan, London, 01 2010.

[19]

James J. Heckman. Sample selection bias as a specification error. Econometrica, 47(1):153--161, 1979.

[20]

Daniel E. Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis, 15(3):199--236, 2007.

[21]

Paul W. Holland. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945--960, 1986.

[22]

Joel L. Horowitz and Charles F. Manski. Nonparametric analysis of randomized experiments with missing covariate and outcome data. Journal of the American Statistical Association, 95(449):77--84, 2000.

[23]

Stefano M. Iacus, Gary King, and Giuseppe Porro. Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1):1--24, 2012.

[24]

Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B, 76(1):243--263, 2014.

[25]

Guido Imbens. Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 2004.

[26]

Manfred Jaeger. On testing the missing at random assumption. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Machine Learning: ECML 2006, pages 671--678, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.

Digital Library

[27]

Joseph D. Y. Kang and Joseph L. Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4):523--539, 2007.

[28]

Stas Kolenikov. Post-stratification or a non-response adjustment? Survey Practice, 9:1--12, 08 2016.

[29]

D. Lanning and D. Berry. An alternative to proc mi for large samples. SAS Users Group International (SUGI) 28, 2003.

[30]

David S. Lee. Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects. The Review of Economic Studies, 76(3):1071--1102, 07 2009.

[31]

P. L. Li, P. Dmitriev, H. M. Hu, X. Chai, Z. Dimov, B. Paddock, Y. Li, A. Kirshenbaum, I. Niculescu, and T. Thoresen. Experimentation in the operating system: The windows experimentation platform. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 21--30, May 2019.

Digital Library

[32]

Roderick J. A. Little. Regression with missing x's: A review. Journal of the American Statistical Association, 87(420):1227--1237, 1992.

[33]

Roderick J A Little and Donald B Rubin. Statistical Analysis with Missing Data. John Wiley $&$ Sons, Inc., USA, 1986.

Digital Library

[34]

Min Liu, Xiaohui Sun, Maneesh Varshney, and Ya Xu. Large-Scale Online Experimentation with Quantile Metrics. arXiv e-prints, 2019.

[35]

Tapabrata Maiti, Curtis P. Miller, and Pushpal K. Mukhopadhyay. Neural network imputation: An experience with the national resources inventory survey. Journal of Agricultural, Biological, and Environmental Statistics, 13(3):255--269, 2008.

[36]

Stephen L. Morgan and David J. Harding. Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods $&$ Research, 35(1):3--60, 2006.

[37]

Whitney K. Newey, James L. Powell, and James R. Walker. Semiparametric estimation of selection models: Some empirical results. The American Economic Review, 80(2):324--328, 1990.

[38]

T. Raghunathan. Missing Data Analysis in Practice. Chapman $&$ Hall, 2016.

[39]

Paul R. Rosenbaum. Observational Studies. Springer-Verlag, 2002.

[40]

Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41--55, 1983.

[41]

D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.

[42]

Donald B. Rubin. Matched Sampling for Causal Effects. Cambridge University Press, 2006.

[43]

L.J. Schafer. Analysis of Incomplete Multivariate Data. Chapman $&$ Hall, 1997.

[44]

Xiaolin Shi, Pavel Dmitriev, Somit Gupta, and Xin Fu. Challenges, best practices and pitfalls in evaluating results of online controlled experiments. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '19, page 3189--3190, New York, NY, USA, 2019. Association for Computing Machinery.

Digital Library

[45]

Marek 'Smieja, Ł ukasz Struski, Jacek Tabor, Bartosz Zieli'nski, and Przemysł aw Spurek. Processing of missing data by neural networks. In Advances in Neural Information Processing Systems 31, pages 2719--2729. Curran Associates, Inc., 2018.

[46]

Magorzata Wojty?, Giampiero Marra, and Rosalba Radice. Copula regression spline sample selection models: The r package semiparsamplesel. Journal of Statistical Software, Articles, 71(6):1--66, 2016.

[47]

Yuxiang Xie, Nanyu Chen, and Xiaolin Shi. False discovery rate controlled heterogeneous treatment effect detection for online controlled experiments. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery $&$ Data Mining, KDD '18, page 876--885, New York, NY, USA, 2018.

Digital Library

[48]

Z. Zhao, M. Chen, D. Matheson, and M. Stone. Online experimentation diagnosis and troubleshooting beyond aa validation. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 498--507, Oct 2016.

[49]

Mikhail Zhelonkin, Marc Genton, and Elvezio Ronchetti. Robust inference in sample selection models. Journal of the Royal Statistical Society: Series B, 78, 2015.

Cited By

Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011

Index Terms

How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments

Recommendations

An Explorative Study of the Mobile App Ecosystem from App Developers' Perspective
WWW '17: Proceedings of the 26th International Conference on World Wide Web

With the prevalence of smartphones, app markets such as Apple App Store and Google Play has become the center stage in the mobile app ecosystem, with millions of apps developed by tens of thousands of app developers in each major market. This paper ...
App Store: IPod Touch, Apple Inc., ITunes Store, IPhone OS, Itunes, Piper Jaffray, List of digital distribution platforms for mobile devices, I Am Rich
Learn Android App Development

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining

March 2021

1192 pages

ISBN:9781450382977

DOI:10.1145/3437963

General Chairs:
Liane Lewin-Eytan
Amazon, Israel
,
David Carmel
Amazon, Israel
,
Elad Yom-Tov
Microsoft, Israel
,
Program Chairs:
Eugene Agichtein
Emory University and Amazon, USA
,
Evgeniy Gabrilovich
Google Health, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '21

Sponsor:

WSDM '21: The Fourteenth ACM International Conference on Web Search and Data Mining

March 8 - 12, 2021

Virtual Event, Israel

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)4

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents