Abstract
In many applications involving regression analysis, explanatory variables (or covariates) may be imprecisely measured or may contain missing values. Although there exists a vast literature on measurement error modeling to account for errors-in-variables, and on missing data methodology to handle missingness, very few methods have been developed to simultaneously address both. In this paper, we consider likelihood-based multiple imputation to handle missing data, and combine this with two well-known functional measurement error methods: simulation-extrapolation and corrected score. This unified approach has several appealing characteristics: the model fitting procedure is easy to understand and off-the-shelf software can be incorporated into the modeling framework; no calibration data or a validation subset is required in the model fitting procedure; and the missing data component of the proposed approach is likelihood-based which allows standard likelihood machinery. We demonstrate our methods on simulated datasets and apply them to daily ozone pollution measurements in Los Angeles where observed covariates consist of missing data and imprecise measurements. We conclude that the proposed methods substantially reduce bias and mean squared errors in regression coefficients, in comparison to methods that ignore either measurement error or missingness in covariates.
Similar content being viewed by others
References
Armstrong B (1985) Measurement error in the generalised linear model. Commun Stat Simul Comput 14(3):529–544
Arunajadai SG, Rauh VA (2012) Handling covariates subject to limits of detection in regression. Environ Ecol Stat 19(3):369–391
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598
Buzas JS, Stefanski LA (1996) A note on corrected-score estimation. Stat Probab Lett 28(1):1–8
Carroll RJ, Freedman L, Pee D (1997) Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics 53(4):1440–1457
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear models: a modern perspective. Chapman & Hall CRC Press, Boca Raton
Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101(473):157–167
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82
Chen J, Hanfelt JJ, Huang Y (2015) A simple corrected score for logistic regression with errors-in-covariates. Commun Stat Theory Methods 44(10):2024–2036
Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J Am Stat Assoc 89(428):1314–1328
Diebolt J, Celeux G (1993) Asymptotic properties of a Stochastic EM algorithm for estimating mixing proportions. Stoch Models 9:599–613
Diebolt J, Ip EHS (1996) Stochastic EM: method and application. In: Gilks WR, Richardson S, Spiegelhalter DJ (eds) Markov Chain Monte Carlo in practice. Chapman & Hall, London, pp 259–273
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75
Eugster MJ, Leisch F (2011) Weighted and robust archetypal analysis. Comput Stat Data Anal 55(3):1215–1225
Fuller WA (1987) Measurement error models. Wiley, New York
Huang Y, Wang CY (2000) Cox regression with accurate covariates unascertainable: a nonparametric-correction approach. J Am Stat Assoc 95:1209–1219
Huang Y, Wang CY (2001) Consistent functional methods for logistic regression with errors in covariates. J Am Stat Assoc 96:1469–1482
Lederer W, Kuchenhoff H (2006) A short introduction to the SIMEX and MCSIMEX. R News 6(4):26–31
Liang H, Wang S, Carroll RJ (2007) Partially linear models with missing response variables and error-prone covariates. Biometrika 94(1):185–198
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B Stat Methodol 44(2):226–233
Miller A (2002) Subset selection in regression. Chapman & Hall CRC Press, Boca Raton
Nakamura T (1990) Corrected score function for errors-in-variables models: methodology and application to generalized linear models. Biometrika 77(1):127–137
Nicoletti C, Peracchi F, Foliano F (2011) Estimating income poverty in the presence of missing data and measurement error. J Bus Econ Stat 29(1):61–72
Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6(3):457–489
Novick SJ, Stefanski LA (2002) Corrected score estimation via complex variable simulation extrapolation. J Am Stat Assoc 97(458):472–481
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Schafer DW (1987) Covariate measurement error in generalized linear models. Biometrika 72:385–391
Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15
Seaman SR, White IR (2011) Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 22(3):278–295
Shen CW, Chen YH (2016) Model selection for marginal regression analysis of longitudinal data with missing observations and covariate measurement error. Biostatistics 16(4):740–753
Song X, Wang CY (2019) GMM nonparametric correction methods for logistic regression with error-contaminated covariates and partially observed instrumental variables. Scand J Stat 46(3):898–919
Stefanski LA (1989) Unbiased estimation of a nonlinear function a normal mean with application to measurement error for models. Commun Stat Theory Methods 18(12):4335–4358
Stefanski LA, Carroll RJ (1985) Covariate measurement error in logistic regression. Ann Stat 13(4):1335–1351
Stefanski LA, Cook JR (1995) Simulation-extrapolation: the measurement error jackknife. J Am Stat Assoc 90(432):1247–1256
Stoklosa J, Daly C, Foster SD, Ashcroft MB, Warton DI (2015) A climate of uncertainty: accounting for error in climate variables for species distribution models. Methods Ecol Evol 6(4):412–423
Stoklosa J, Lee SH, Hwang WH (2019) Closed-population capture-recapture models with measurement error and missing observations in covariates. Stat Sin 29(2):589–610
Van Buuren S (2012) Flexible imputation of missing data. Chapman & Hall CRC Press, Boca Raton
van Buuren S, Groothuis-Oudshoorn K (2011) MICE: Multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
von Hippel PT (2013) The bias and efficiency of incomplete-data estimators in small univariate normal samples. Social Methods Res 42(4):531–558
Wang N, Robins JM (1998) Large-sample theory for parametric multiple imputation procedures. Biometrika 85:935–948
Wang CY, Huang Y, Chao EC, Jeffcoat MK (2008) Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics 64(1):85–95
Wang M, Sun X, Lu T (2015) Bayesian structured variable selection in linear regression models. Comput Stat 30(1):205–229
Yi GY, Ma Y, Carroll RJ (2012) A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika 99(1):151–165
Acknowledgements
This work was supported by an Australian Government Research Training Program Scholarship. We thank the associate editor and the three anonymous reviewers for their constructive comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
Supplementary material
The following plots, Figs. 8 and 9, show the results of the simulation study in Sect. 3.1 in terms of the magnitute of bias when \(n=50\) and \(n=1000\). Figures 8, 9, 10 and 11 show the accuracy of comparing methods in parameter estimation when \(X \sim N(0,1)\) and \(X \sim U(-1,1)\), respectively, for each n. Figures 8 and 9 are consistent with Fig. 1 (see Sect. 3.1) when \(n=100\) and \(X \sim N(0,1)\). They show that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and \(\sigma ^2_u\) are increased. Also, Figs. 8 and 9 are comparable with Fig. 2 in Sect. 3.1 when \(n=100\) and \(X \sim U(-1,1)\) in a similar way.
The following plots, Figs. 12 and 13, show the results of the simulation study in Sect. 3.1 in terms of MSE when \(n=50\) and \(n=1000\). Figures 12 and 13 show the predictive performances of the comparing methods when \(n=50\) and \(n=1000\), respectively. Comparing Fig. 3 (see Sect. 3.1) with these two plots suggests that MI-CS had the best predictive performance as the missingness and \(\sigma ^2_u\) increased. MI-SIMEX showed relatively high MSE for \(n=50\), however, the predictive performance of MI-CS improved substantially and equaled MI-SIMEX as n increased. The predictive performances of the comparing methods were similar when \(X \sim N(0,1)\) and \(X \sim U(-1,1)\).
The following plot, Fig. 14, shows the results of the Poisson log-linear simulation study in Sect. 3.2 when \(X \sim U(-1,1)\). Figure 14 is consistent with Fig. 4 (see Sect. 3.2) for when \(X \sim N(0,1)\). It shows that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and \(\sigma ^2_u\) increased.
Rights and permissions
About this article
Cite this article
Noghrehchi, F., Stoklosa, J. & Penev, S. Multiple imputation and functional methods in the presence of measurement error and missingness in explanatory variables. Comput Stat 35, 1291–1317 (2020). https://doi.org/10.1007/s00180-020-00976-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00976-2