Multiple imputation and functional methods in the presence of measurement error and missingness in explanatory variables

Firouzeh Noghrehchi¹,
Jakub Stoklosa^1,2 &
Spiridon Penev¹

389 Accesses
3 Citations
Explore all metrics

Abstract

In many applications involving regression analysis, explanatory variables (or covariates) may be imprecisely measured or may contain missing values. Although there exists a vast literature on measurement error modeling to account for errors-in-variables, and on missing data methodology to handle missingness, very few methods have been developed to simultaneously address both. In this paper, we consider likelihood-based multiple imputation to handle missing data, and combine this with two well-known functional measurement error methods: simulation-extrapolation and corrected score. This unified approach has several appealing characteristics: the model fitting procedure is easy to understand and off-the-shelf software can be incorporated into the modeling framework; no calibration data or a validation subset is required in the model fitting procedure; and the missing data component of the proposed approach is likelihood-based which allows standard likelihood machinery. We demonstrate our methods on simulated datasets and apply them to daily ozone pollution measurements in Los Angeles where observed covariates consist of missing data and imprecise measurements. We conclude that the proposed methods substantially reduce bias and mean squared errors in regression coefficients, in comparison to methods that ignore either measurement error or missingness in covariates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Article 15 July 2022

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

Improving the Robustness of Parametric Imputation

References

Armstrong B (1985) Measurement error in the generalised linear model. Commun Stat Simul Comput 14(3):529–544
MATH Google Scholar
Arunajadai SG, Rauh VA (2012) Handling covariates subject to limits of detection in regression. Environ Ecol Stat 19(3):369–391
MathSciNet Google Scholar
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598
MathSciNet MATH Google Scholar
Buzas JS, Stefanski LA (1996) A note on corrected-score estimation. Stat Probab Lett 28(1):1–8
MathSciNet MATH Google Scholar
Carroll RJ, Freedman L, Pee D (1997) Design aspects of calibration studies in nutrition, with analysis of missing data in linear measurement error models. Biometrics 53(4):1440–1457
MATH Google Scholar
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear models: a modern perspective. Chapman & Hall CRC Press, Boca Raton
MATH Google Scholar
Casella G, Moreno E (2006) Objective Bayesian variable selection. J Am Stat Assoc 101(473):157–167
MathSciNet MATH Google Scholar
Celeux G, Diebolt J (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comput Stat 2:73–82
Google Scholar
Chen J, Hanfelt JJ, Huang Y (2015) A simple corrected score for logistic regression with errors-in-covariates. Commun Stat Theory Methods 44(10):2024–2036
MathSciNet MATH Google Scholar
Cook JR, Stefanski LA (1994) Simulation-extrapolation estimation in parametric measurement error models. J Am Stat Assoc 89(428):1314–1328
MATH Google Scholar
Diebolt J, Celeux G (1993) Asymptotic properties of a Stochastic EM algorithm for estimating mixing proportions. Stoch Models 9:599–613
MathSciNet MATH Google Scholar
Diebolt J, Ip EHS (1996) Stochastic EM: method and application. In: Gilks WR, Richardson S, Spiegelhalter DJ (eds) Markov Chain Monte Carlo in practice. Chapman & Hall, London, pp 259–273
Google Scholar
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat Sci 1(1):54–75
MathSciNet MATH Google Scholar
Eugster MJ, Leisch F (2011) Weighted and robust archetypal analysis. Comput Stat Data Anal 55(3):1215–1225
MathSciNet MATH Google Scholar
Fuller WA (1987) Measurement error models. Wiley, New York
MATH Google Scholar
Huang Y, Wang CY (2000) Cox regression with accurate covariates unascertainable: a nonparametric-correction approach. J Am Stat Assoc 95:1209–1219
MathSciNet MATH Google Scholar
Huang Y, Wang CY (2001) Consistent functional methods for logistic regression with errors in covariates. J Am Stat Assoc 96:1469–1482
MathSciNet MATH Google Scholar
Lederer W, Kuchenhoff H (2006) A short introduction to the SIMEX and MCSIMEX. R News 6(4):26–31
Google Scholar
Liang H, Wang S, Carroll RJ (2007) Partially linear models with missing response variables and error-prone covariates. Biometrika 94(1):185–198
MathSciNet MATH Google Scholar
Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
MATH Google Scholar
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat Soc Ser B Stat Methodol 44(2):226–233
MathSciNet MATH Google Scholar
Miller A (2002) Subset selection in regression. Chapman & Hall CRC Press, Boca Raton
MATH Google Scholar
Nakamura T (1990) Corrected score function for errors-in-variables models: methodology and application to generalized linear models. Biometrika 77(1):127–137
MathSciNet MATH Google Scholar
Nicoletti C, Peracchi F, Foliano F (2011) Estimating income poverty in the presence of missing data and measurement error. J Bus Econ Stat 29(1):61–72
MathSciNet MATH Google Scholar
Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6(3):457–489
MathSciNet MATH Google Scholar
Novick SJ, Stefanski LA (2002) Corrected score estimation via complex variable simulation extrapolation. J Am Stat Assoc 97(458):472–481
MathSciNet MATH Google Scholar
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
MATH Google Scholar
Schafer DW (1987) Covariate measurement error in generalized linear models. Biometrika 72:385–391
MathSciNet MATH Google Scholar
Schafer JL (1999) Multiple imputation: a primer. Stat Methods Med Res 8(1):3–15
MathSciNet Google Scholar
Seaman SR, White IR (2011) Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res 22(3):278–295
MathSciNet Google Scholar
Shen CW, Chen YH (2016) Model selection for marginal regression analysis of longitudinal data with missing observations and covariate measurement error. Biostatistics 16(4):740–753
MathSciNet Google Scholar
Song X, Wang CY (2019) GMM nonparametric correction methods for logistic regression with error-contaminated covariates and partially observed instrumental variables. Scand J Stat 46(3):898–919
MathSciNet MATH Google Scholar
Stefanski LA (1989) Unbiased estimation of a nonlinear function a normal mean with application to measurement error for models. Commun Stat Theory Methods 18(12):4335–4358
MathSciNet MATH Google Scholar
Stefanski LA, Carroll RJ (1985) Covariate measurement error in logistic regression. Ann Stat 13(4):1335–1351
MathSciNet MATH Google Scholar
Stefanski LA, Cook JR (1995) Simulation-extrapolation: the measurement error jackknife. J Am Stat Assoc 90(432):1247–1256
MathSciNet MATH Google Scholar
Stoklosa J, Daly C, Foster SD, Ashcroft MB, Warton DI (2015) A climate of uncertainty: accounting for error in climate variables for species distribution models. Methods Ecol Evol 6(4):412–423
Google Scholar
Stoklosa J, Lee SH, Hwang WH (2019) Closed-population capture-recapture models with measurement error and missing observations in covariates. Stat Sin 29(2):589–610
MathSciNet MATH Google Scholar
Van Buuren S (2012) Flexible imputation of missing data. Chapman & Hall CRC Press, Boca Raton
MATH Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) MICE: Multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Google Scholar
von Hippel PT (2013) The bias and efficiency of incomplete-data estimators in small univariate normal samples. Social Methods Res 42(4):531–558
MathSciNet Google Scholar
Wang N, Robins JM (1998) Large-sample theory for parametric multiple imputation procedures. Biometrika 85:935–948
MathSciNet MATH Google Scholar
Wang CY, Huang Y, Chao EC, Jeffcoat MK (2008) Expected estimating equations for missing data, measurement error, and misclassification, with application to longitudinal nonignorable missing data. Biometrics 64(1):85–95
MathSciNet MATH Google Scholar
Wang M, Sun X, Lu T (2015) Bayesian structured variable selection in linear regression models. Comput Stat 30(1):205–229
MathSciNet MATH Google Scholar
Yi GY, Ma Y, Carroll RJ (2012) A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika 99(1):151–165
MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work was supported by an Australian Government Research Training Program Scholarship. We thank the associate editor and the three anonymous reviewers for their constructive comments.

Author information

Authors and Affiliations

School of Mathematics and Statistics, The University of New South Wales, Sydney, NSW, 2052, Australia
Firouzeh Noghrehchi, Jakub Stoklosa & Spiridon Penev
Ecology Research Centre, The University of New South Wales, Sydney, NSW, 2052, Australia
Jakub Stoklosa

Authors

Firouzeh Noghrehchi
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Stoklosa
View author publications
You can also search for this author in PubMed Google Scholar
Spiridon Penev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Firouzeh Noghrehchi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary material

The following plots, Figs. 8 and 9, show the results of the simulation study in Sect. 3.1 in terms of the magnitute of bias when $n=50$ and $n=1000$. Figures 8, 9, 10 and 11 show the accuracy of comparing methods in parameter estimation when $X \sim N(0,1)$ and $X \sim U(-1,1)$, respectively, for each n. Figures 8 and 9 are consistent with Fig. 1 (see Sect. 3.1) when $n=100$ and $X \sim N(0,1)$. They show that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and $\sigma ^2_u$ are increased. Also, Figs. 8 and 9 are comparable with Fig. 2 in Sect. 3.1 when $n=100$ and $X \sim U(-1,1)$ in a similar way.

The following plots, Figs. 12 and 13, show the results of the simulation study in Sect. 3.1 in terms of MSE when $n=50$ and $n=1000$. Figures 12 and 13 show the predictive performances of the comparing methods when $n=50$ and $n=1000$, respectively. Comparing Fig. 3 (see Sect. 3.1) with these two plots suggests that MI-CS had the best predictive performance as the missingness and $\sigma ^2_u$ increased. MI-SIMEX showed relatively high MSE for $n=50$, however, the predictive performance of MI-CS improved substantially and equaled MI-SIMEX as n increased. The predictive performances of the comparing methods were similar when $X \sim N(0,1)$ and $X \sim U(-1,1)$.

The following plot, Fig. 14, shows the results of the Poisson log-linear simulation study in Sect. 3.2 when $X \sim U(-1,1)$. Figure 14 is consistent with Fig. 4 (see Sect. 3.2) for when $X \sim N(0,1)$. It shows that MI-CS and MI-SIMEX outperform the other comparing methods in terms of accuracy as the missingness and $\sigma ^2_u$ increased.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Noghrehchi, F., Stoklosa, J. & Penev, S. Multiple imputation and functional methods in the presence of measurement error and missingness in explanatory variables. Comput Stat 35, 1291–1317 (2020). https://doi.org/10.1007/s00180-020-00976-2

Download citation

Received: 13 May 2018
Accepted: 10 March 2020
Published: 18 March 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00180-020-00976-2