Abstract
Zero-inflated Poisson (ZIP) regression is widely applied to model effects of covariates on an outcome count with excess zeros. In some applications, covariates in a ZIP regression model are partially observed. Based on the imputed data generated by applying the multiple imputation (MI) schemes developed by Wang and Chen (Ann Stat 37:490–517, 2009), two methods are proposed to estimate the parameters of a ZIP regression model with covariates missing at random. One, proposed by Rubin (in: Proceedings of the survey research methods section of the American Statistical Association, 1978), consists of obtaining a unified estimate as the average of estimates from all imputed datasets. The other, proposed by Fay (J Am Stat Assoc 91:490–498, 1996), consists of averaging the estimating scores from all imputed data sets to solve the imputed estimating equation. Moreover, it is shown that the two proposed estimation methods are asymptotically equivalent to the semiparametric inverse probability weighting method. A modified formula is proposed to estimate the variances of the MI estimators. An extensive simulation study is conducted to investigate the performance of the estimation methods. The practicality of the methodology is illustrated with a dataset of motorcycle survey of traffic regulations.
Similar content being viewed by others
References
Barry SC, Welsh AH (2002) Generalized additive modeling and zero-inflated count data. Ecol Model 157:179–188
Bohning D, Dietz E, Schlattmann P, Mendonca L, Kirchner U (1999) The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J R Stat Soc Ser A 162:195–209
Cameron AC, Trivedi PK (2013) Regression analysis of count data, 2nd edn. Cambridge University Press, New York
Clayton D, Spiegelhalter D, Dunn G, Pickles A (1998) Analysis of longitudinal binary data from multiphase sampling (with discussion). J R Stat Soc Ser B 60:71–87
Chen XD, Fu YZ (2011) Model selection for zero-inflated regression with missing covariates. Comput Stat Data Anal 55:765–773
Cheung YB (2002) Zero-inflated models for regression analysis of count data, a study of growth and development. Stat Med 21:1461–1469
Creemers A, Aerts M, Hens N, Molenberghs G (2012) A nonparametric approach to weighted estimating equations for regression analysis with missing covariates. Comput Stat Data Anal 56:100–113
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Deng D, Paul SR (2000) Score tests for zero inflation in generalized linear models. Can J Stat 27:563–570
Deng D, Paul SR (2005) Score tests for zero-inflation and over-dispersion in generalized linear models. Stat Sin 15:257–276
Dietz K, Böhning D (1997) The use of two-component mixture models with one completely or partly known component. Comput Stat 12:219–234
Fay RE (1996) Alternative paradigms for the analysis of imputed survey data. J Am Stat Assoc 91:490–498
Hall DB, Shen J (2010) Robust estimation for zero-inflated Poisson regression. Scand J Stat 37:237–252
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685
Hsieh SH, Lee SM, Shen PS (2009) Semiparametric analysis of randomized response data with missing covariates in logistic regression. Comput Stat Data Anal 53:2673–2692
Hsieh SH, Lee SM, Shen PS (2010) Logistic regression analysis of randomized response data with missing covariates. J Stat Plan Inference 140:927–940
Huang L, Zheng D, Zalkikar J, Tiwari R (2017) Zero-inflated Poisson model based likelihood ratio test for drug safety signal detection. Stat Methods Med Res 26:471–488
Jansakul N, Hinde JP (2002) Score tests for zero-inflated Poisson models. Comput Stat Data Anal 40:75–96
Johnson NL, Kemp AW, Kotz S (2005) Univariate discrete distributions, 3rd edn. Wiley, New York
Lambert D (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34:1–14
Lee SM, Gee MJ, Hsieh SH (2011) Semiparametric methods in the proportional odds model for ordinal response data with missing covariates. Biometrics 67:788–798
Lee JH, Han G, Fulp WJ, Giuliano AR (2012a) Analysis of overdispersed count data: application to the human papillomavirus infection in men (HIM) study. Epidemiol Infect 140:1087–1094
Lee SM, Li CS, Hsieh SH, Huang LH (2012b) Semiparametric estimation of logistic regression model with missing covariates and outcome. Metrika 75:621–653
Lee SM, Hwang WH, Tapsoba JD (2016) Estimation in closed capture-recapture models when covariates are missing at random. Biometrics 72:1294–1304
Li CS (2011) A Lack-of-fit test for parametric zero-inflated Poisson models. J Stat Comput Simul 81:1081–1098
Li CS (2012) Score test for semiparametric zero-inflated Poisson model. Int J Stat Probab 1:1–7
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87:1227–1237
Liu H, Powers DA (2007) Growth curve models for zero-inflated count data: An application to smoking behavior. Struct Equ Model Multidiscip J 14:247–279
Lu SE, Lin Y, Shih WCJ (2004) Analyzing excessive no changes in clinical trials with clustered data. Biometrics 60:257–267
Lukusa TM, Lee SM, Li CS (2016) Semiparametric estimation of a zero-inflated Poisson regression model with missing covariates. Metrika 79:457–483
Lukusa TM, Lee SM, Li CS (2017) Review of zero-inflated models with missing data. Curr Res Biostat 7:1–12
Mullahy J (1986) Specification and testing of some modified count data models. J Econ 33:341–365
Pahel BT, Preisser JS, Stearns SC, Rozier RG (2011) Multiple imputation of dental caries data using a zero-inflated Poisson regression model. J Public Health Dent 71:71–78
Reilly M, Pepe MS (1995) A mean score method for missing and auxiliary covariates data in regression methods. Biometrika 82:299–314
Ridout M, Demetrio CGB, Hinde J (1998) Models for count data with many zeros. In: 19th international biometric conference, Cape Town, pp 179–192
Righi P, Falorsi S, Fasulo A (2014) Methods for variance estimation under random hot deck imputation in business surveys. Rivista Di Statistica Ufficiale N 1–2(2014):45–64
Robins JM, Wang N (2000) Inference for imputation estimators. Biometrika 87:113–124
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89:846–866
Rubin DB (1976) Inference and missing data. Biometrika 63:581–592
Rubin DB (1978) Multiple imputations in sample surveys: a phenomenological Bayesian approach to nonresponse. In: Proceedings of the survey research methods section of the American Statistical Association, vol. 1. American Statistical Association, Boston, pp 20-28
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Rubin DB (1996) Multiple imputation after 18+ years. J Am Stat Assoc 91:473–489
Rubin DB, Schenker N (1986) Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J Am Stat Assoc 81:366–374
Samani EB, Ganjali M, Amirian Y (2012) Zero-inflated power series joint model to analyze count data with missing responses. J Stat Theor Pract 6:334–343
Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods 7:147–177
Singh S (1963) A note on inflated Poisson distribution. J Indian Stat Assoc 1:140–144
Van den Broek J (1995) A score test for zero inflation in a Poisson distribution. Biometrics 51:738–743
Wang S, Wang CY (2001) A note on kernel assisted estimators in missing covariate regression. Stat Probab Lett 55:439–449
Wang D, Chen SX (2009) Empirical likelihood for estimating equations with missing values. Ann Stat 37:490–517
Wang CY, Wang S, Zhao LP, Ou ST (1997) Weighted semiparametric estimation in regression with missing covariate data. J Am Stat Assoc 92:512–525
Wang CY, Chen JC, Lee SM, Ou ST (2002) Joint conditional likelihood estimator in logistic regression with missing covariate data. Stat Sin 12:555–574
Xiang L, Lee AH, Yau KKW, McLachlan GJ (2007) A score test for overdispersion in zero-inflated Poisson mixed regression model. Stat Med 26:1608–1622
Yau KKW, Lee AH (2001) Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Stat Med 20:2907–2920
Zhao LP, Lipsitz S (1992) Designs and analysis of two-stage studies. Stat Med 11:769–782
Acknowledgements
The authors are very grateful for two referees’ helpful comments and suggestions that improved the presentation. This work was supported by the Ministry of Science and Technology of Taiwan (S.M. Lee).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Proof of Theorem 1
It can be obtained from the empirical CDF \(\hat{F}(x|Y_i,\varvec{V}_i)\) in (10) that for \(i=1,\dots ,n\) and \(v=1,\dots ,M\),
By using the expression of \(U_v({\varvec{\theta }})\) in (11), the expression of \(E_{\hat{F}}(\tilde{S}_{iv}({\varvec{\theta }})|Y_i,\varvec{V}_i)\) in (18), and the fact that
\(v=1,\ldots ,M\), we can have \(E_{\hat{F}}(U_v({\varvec{\theta }})|\mathcal {O})=n^{-1/2}\sum _{i=1}^n[\delta _i/\hat{\pi }(Y_i,\varvec{V}_i)]S_i({\varvec{\theta }}) =U_w({\varvec{\theta }},\hat{\varvec{\pi }})\), \(v=1,\ldots ,M\). Similarly, it can be shown that \(E_{\hat{F}}(\partial {U}_v({\varvec{\theta }})/{\varvec{\theta }}|\mathcal {O}) =\partial {U}_w({\varvec{\theta }},\hat{\varvec{\pi }})/\partial {\varvec{\theta }}\) and, hence, \(E(\partial {U}_v({\varvec{\theta }})/\partial {\varvec{\theta }}) =E(\partial {U}_w({\varvec{\theta }},\hat{\varvec{\pi }})/\partial {\varvec{\theta }})\), \(v=1,\ldots ,M\).
Recall that \(S_i^*({\varvec{\theta }})=E(S_i({\varvec{\theta }})|Y_i,\varvec{V}_i)\), \(i=1,\dots ,n\). As given in (16), \(U_{m2}({\varvec{\theta }})\) can be expressed as follows:
Note that the second term of the expression of \(U_{m2}({\varvec{\theta }})\) in (20) can be reformulated as
where \(\tilde{S}_i({\varvec{\theta }})=\sum _{v=1}^{M}\tilde{S}_{iv}({\varvec{\theta }})/M\). The third term of the expression of \(U_{m2}({\varvec{\theta }})\) in (20) can be rewritten as follows:
Hence, from (21) and (22), \(U_{m2}({\varvec{\theta }})\) can be re-expressed as
Because the first term is the sum of independent and identically distributed random vectors, it can be shown by the multivariate central limit theorem that \(U_{m2}({\varvec{\theta }}){\mathop {\rightarrow }\limits ^{d}}\mathcal {N}(\varvec{0},M({\varvec{\theta }},\varvec{\pi }))\) as \(n,M\rightarrow \infty \), where \(M({\varvec{\theta }},\varvec{\pi })\) is given in (15). In addition, \(U_{m2}({\varvec{\theta }})\) in (23) can be expressed as
Because \(\hat{\varvec{\theta }}_{m2}^{(M)}\) is the solution of \(U_{m2}({\varvec{\theta }})=\varvec{0}\), it follows by a Taylor’s expansion of \(U_{m2}(\hat{\varvec{\theta }}_{m2}^{(M)})\) at \({\varvec{\theta }}\) and the expression of \(U_{m2}({\varvec{\theta }})\) in (23) that
where \(G({\varvec{\theta }},\varvec{\pi })=E[-\partial {U}_w({\varvec{\theta }},\varvec{\pi })/(\sqrt{n}\partial {\varvec{\theta }})]\). Therefore, it can be obtained that
This implies that \(\sqrt{n}(\hat{\varvec{\theta }}_{m2}^{(M)}-{\varvec{\theta }}){\mathop {\rightarrow }\limits ^{d}}\mathcal {N}(\varvec{0},\Delta _m({\varvec{\theta }}))\) as \(n,M\rightarrow \infty \), where \(\Delta _m({\varvec{\theta }})=G^{-1}({\varvec{\theta }},\varvec{\pi })M({\varvec{\theta }},\varvec{\pi })[G^{-1}({\varvec{\theta }},\varvec{\pi })]^T\).
Let \(\hat{\varvec{\theta }}_v\) be the solution to the estimating equations \(U_v({\varvec{\theta }})=\varvec{0}\). We have by a Taylor’s expansion of \(U_v(\hat{\varvec{\theta }}_v)\) at \({\varvec{\theta }}\) that
Hence, it follows that \(\sqrt{n}(\hat{\varvec{\theta }}_v-{\varvec{\theta }})=G^{-1}({\varvec{\theta }},\varvec{\pi })U_v({\varvec{\theta }})+o_p(1)\). Because \(\hat{\varvec{\theta }}_{m1}^{(M)}=\sum _{v=1}^{M}\hat{\varvec{\theta }}_v/M\), using the above result and the expressions for \(U_{m2}({\varvec{\theta }})\) in (13) and (23), we can have
and it is shown easily that \(\sqrt{n}(\hat{\varvec{\theta }}_{m1}^{(M)}-{\varvec{\theta }}){\mathop {\rightarrow }\limits ^{d}}{N}(\varvec{0},\Delta _{m}({\varvec{\theta }}))\) as \(n,M\rightarrow \infty \).
1.2 Proof of Theorem 2
Because \(\hat{\varvec{\theta }}_{m2}^{(M)}\) is the solution of \(U_{m2}({\varvec{\theta }})=M^{-1}\sum _{v=1}^{M}U_v({\varvec{\theta }})=\varvec{0}\), a Taylor’s expansion of \(U_{m2}(\hat{\varvec{\theta }}_{m2}^{(M)})\) at \({\varvec{\theta }}\) can lead to \(\varvec{0}=U_{m2}(\hat{\varvec{\theta }}_{m2}^{(M)}) =M^{-1}\sum _{v=1}^MU_v({\varvec{\theta }})-G({\varvec{\theta }},\varvec{\pi })\sqrt{n}(\hat{\varvec{\theta }}_{m2}^{(M)}-{\varvec{\theta }})+o_p(1)\), which implies that
Thus, it follows from (25) and (26) that \(\sqrt{n}(\hat{\varvec{\theta }}_{m2}^{(M)}-\hat{\varvec{\theta }}_{m1}^{(M)})=o_p(1)\). This shows that \(\sqrt{n}(\hat{\varvec{\theta }}_{m2}^{(M)}-\hat{\varvec{\theta }}_{m1}^{(M)})\) converges in probability to \(\varvec{0}\) as \(n,M\rightarrow \infty \).
Next, we show that the semiparametric IPW estimator and the second MI-type estimator are asymptotically equivalent. \(U_{m2}({\varvec{\theta }})\) can be expressed as
Using the fact given in (19), \(n^{-1/2}\sum _{i=1}^n[\delta _iS_i({\varvec{\theta }})+(1-\delta _i)E_{\hat{F}}(\tilde{S}_{i1}({\varvec{\theta }})|Y_i,\varvec{V}_i)]\) can be expressed as
Hence it can be obtained that
Recall \(\mathcal {B}({\varvec{\theta }};Y_i,\varvec{V}_i) =M^{-1/2}\sum _{v=1}^M\left[ \tilde{S}_{iv}({\varvec{\theta }})-E_{\hat{F}}(\tilde{S}_{i1}({\varvec{\theta }})|Y_i,\varvec{V}_i)\right] \), \(i=1,\ldots ,n\). Because
and \((1-\delta _i)\mathcal {B}({\varvec{\theta }};Y_i,\varvec{V}_i)\) are independent and identically distributed random vectors with mean \(\varvec{0}\) and covariance matrix \(E\{(1-\delta _1)[\mathcal {B}({\varvec{\theta }},Y_1,\varvec{V}_1)]^{\otimes 2}\}\), it implies by the multivariate central limit theorem that \(n^{-1/2}\sum _{i=1}^{n}(1-\delta _i)\mathcal {B}({\varvec{\theta }};Y_i,\varvec{V}_i)=O_p(1)\) and, hence,
Let \(\hat{\varvec{\theta }}_{m2}^{(M)}\) be the solution of \(U_{m2}({\varvec{\theta }})=\varvec{0}\). Because by a Taylor’s expansion of \(U_{m2}(\hat{\varvec{\theta }}_{m2}^{(M)})\) at \({\varvec{\theta }}\) and \(U_w(\hat{\varvec{\theta }}_{ws},\hat{\varvec{\pi }})\) at \({\varvec{\theta }}\), respectively, we can have that
and
it can be shown from (27) that
Therefore, it follows that \(\sqrt{n}(\hat{\varvec{\theta }}_{m2}^{(M)}-\hat{\varvec{\theta }}_{ws})\) converges in probability to \(\varvec{0}\) as \(n,M\rightarrow \infty \).
Rights and permissions
About this article
Cite this article
Lee, SM., Lukusa, T.M. & Li, CS. Estimation of a zero-inflated Poisson regression model with missing covariates via nonparametric multiple imputation methods. Comput Stat 35, 725–754 (2020). https://doi.org/10.1007/s00180-019-00930-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-019-00930-x