Abstract
Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase\(^{\textregistered }\)) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Once the stopping criterion is met, \(T=10\) extra iterations are performed to assure the convergence stability.
The tranexomic acid is an antifibrinolyic agent which reduces blood loss.
References
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Cai, T., Zhou, W.-X.: A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res. 14(1), 3619–3647 (2013)
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Candes, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98(6), 925–936 (2010)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)
Candès, E.J., Sing-Long, C.A., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Trans. Signal Process. 61(19), 4643–4657 (2013)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Statist. Soc. Ser. B (Methodol.) 39(1), 1–22 (1977)
Gavish, M., Donoho, D.L.: Optimal shrinkage of singular values. IEEE Trans. Inf. Theory 63(4), 2137–2152 (2017)
Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE Proc. F-radar Signal Process. 140, 107–113 (1993)
Harel, O., Schafer, J.L.: Partial and latent ignorability in missing-data problems. Biometrika 96(1), 37–50 (2009)
Hastie, T., Mazumder, R.: softImpute: Matrix Completion via Iterative Soft-Thresholded SVD (2015). https://CRAN.R-project.org/package=softImpute. R package version 1.4
Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1), 3367–3402 (2015)
Hay, S.I., Abajobir, A.A., Abate, K.H., Abbafati, C., Abbas, K.M., Abd-Allah, F., Abdulkader, R.S., Abdulle, A.M., Abebo, T.A., Abera, S.F., et al.: Global, regional, and national disability-adjusted life-years (dalys) for 333 diseases and injuries and healthy life expectancy (hale) for 195 countries and territories, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet 390(10100), 1260–1344 (2017)
Heckman, J.J.: Sample selection bias as a specification error. Econometrica 42, 679–94 (1974)
Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)
Josse, J., Prost, N., Scornet, E., Varoquaux, G.: On the consistency of supervised learning with missing values. arXiv:1902.06931 (2019)
Josse, J., Sardy, S., Wager, S.: denoiser: A package for low rank matrix estimation. J. Stat. Softw. (2016)
Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Statist. Data Anal. 56(6), 1869–1879 (2012)
Kallus, N., Mao, X., Udell, M.: Causal inference with noisy and missing covariates via matrix factorization. arXiv:1806.00811 (2018)
Kishore Kumar, N., Schneider, J.: Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11), 2212–2244 (2017)
Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3(9), e161 (2007)
Little, R.J.A.: Pattern-mixture models for multivariate incomplete data. J. Am. Statist. Assoc. 88(421), 125–134 (1993)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)
Liu, L.T., Dobriban, E., Singer, A., et al.: \( e \) PCA: high dimensional exponential family PCA. Ann. Appl. Statist. 12(4), 2121–2150 (2018)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)
Miao, W., Tchetgen, E.T.: Identification and inference with nonignorable missing covariate data. Statistica Sinica (2017)
Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583 (2018)
Mohan, K., Thoemmes, F., Pearl, J.: Estimation with incomplete data: the linear case. In: IJCAI, pp. 5082–5088 (2018)
Morikawa, K., Kim, J.K., Kano, Y.: Semiparametric maximum likelihood estimation with data missing not at random. Canadian J. Statist. 45(4), 393–409 (2017)
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. arXiv:1801.04058 (2018)
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)
Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R.: Main effects and interactions in mixed and incomplete data frames. arXiv:1806.09734 (2018)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)
Seaman, S., Galati, J., Jackson, D., Carlin, J.: What is meant by missing at random? Statist. Sci. 28(2), 257–268 (2013)
Tang, F., Ishwaran, H.: Random forest missing data algorithms. Statist. Anal. Data Min. ASA Data Sci. J. 10(6), 363–377 (2017)
Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recognit. Lett. 29(7), 950–956 (2008)
Udell, M., Townsend, A.: Nice latent variable models have log-rank. aXiv:1705.07474 (2017)
Udell, M., Horn, C., Zadeh, R., Boyd, S., et al.: Generalized low rank models. Found. Trends® Mach. Learn. 9(1), 1–118 (2016)
Verbanck, M., Josse, J., Husson, F.: Regularised pca to denoise and visualise data. Statist. Comput. 25(2), 471–486 (2015)
Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: Collaborative filtering for automl initialization. arXiv:1808.03233 (2018)
Acknowledgements
The authors are thankful for fruitful discussion with François Husson, Wei Jiang, Imke Mayer and Geneviève Robin.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Julie Josse was supported by the Data Analytics and Models for Insurance chair and Aude Sportisse was funded by a PEPS project (Projet Exploratoire Premier Soutien) of AMIES (the mathematical agency in interaction with companies and society).
Appendices
The FISTA algorithm
We first present the proximal gradient method. The following optimisation problem is considered:
where \(h_1\) is a convex function, \(h_2\) a differentiable and convex function and L the gradient Lipschitz of \(h_2\).
The main trick of the FISTA algorithm is to add a momentum term to the proximal gradient method, in order to yield smoother trajectory towards the convergence point. In addition, the proximal operator is performed on a specific linear combination of the previous two iterates, rather than on the previous iterate only.
In our specific model, to solve (2), \(h_1(\varTheta )=\Vert \varTheta \Vert _\star \) and \(h_2(\varTheta )=\Vert \varOmega \odot (\varTheta -Y) \Vert ^2_F\). Let us precise that:
Therefore,
and L is equal to 1.
SoftImpute
We start by describing softImpute.
The proximal operator of the nuclear norm of a matrix X consists in a soft-thresholding of its singular values: we perform the SVD of X and we obtain the matrices U, V and D. Then
\(D_{\lambda }\) is the diagonal matrix such that for all i,
, where the \((\sigma _{ii})\)’s are the singular values of X.
1.1 Equivalence between softImpute and the proximal gradient method
By using the same functions \(h_1\) and \(h_2\) as above, one has:
so that softImpute and the proximal gradient method are similar.
1.2 Equivalence between the EM algorithm and iterative SVD in the MAR case
We prove here that in the MAR setting, softImpute is similar to the EM algorithm. Let us recall that in the MAR setting the model of the joint distribution is not needed but only the one of the data distribution, so that the E-step is written as follows:
by using (3) and the independance of \(Y_{ij}, \, \forall i,j\)). Then, by splitting into the observed and the missing elements,
Therefore,
which implies \(Q(\varTheta |\hat{\varTheta }^{(t)})\propto \Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F\)
The M-step is then written as follows:
The proximal gradient method is applied with
Therefore, the EM algorithm in the MAR case is the same one as softImpute.
The EM algorithm in the MNAR case
For the sake of clarity, we present below the EM algorithm in the MNAR and low dimension case.
We already have given details for the stopping criterium.
We clarify the maximization step given by (8) and (9).
with:
For all \(j \in \{1,\ldots ,p\}\) such that \(\exists i \in \{1,\ldots ,n\}, \, \varOmega _{ij}=0\), estimating the coefficients \(\phi _{1j}\) and \(\phi _{2j}\) remains to fit a generalized linear model with the binomial link function for the matrix \(J^{(j)}\):
1.1 SIR
In the Monte Carlo approximation, the distribution of interest is \(\left[ y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)}\right] \). By using the Bayes rules:
Denoting the Gaussian density function of mean \(\varTheta ^{(t)}_{ij}\) and variance \(\sigma ^2\) by \(\varphi _{\varTheta ^{(t)}_{ij},\sigma ^2}\), if \(\sigma >(2\pi )^{-1/2}\), the following condition holds:
For M large, the SIR algorithm to simulate
is described as follows.
Details on the variables in TraumaBase\(^{\textregistered }\)
A description of the variables which are used in Sect. 5 is given. The indications given in parentheses ph (pre-hospital) and h (hospital) mean that the measures have been taken before the arrival at the hospital and at the hospital.
-
SBP.ph, DBP.ph, HR.ph: systolic and diastolic arterial pressure and heart rate during pre-hospital phase. (ph)
-
HemoCue.init: prehospital capillary hemoglobin concentration. (ph)
-
SpO2.min: peripheral oxygen saturation, measured by pulse oxymetry, to estimate oxygen content in the blood. (ph)
-
Cristalloid.volume: total amount of prehospital administered cristalloid fluid resuscitation (volume expansion). (ph)
-
Shock.index.ph: ratio of heart rate and systolic arterial pressure during pre-hospital phase. (ph)
-
Delta.shock.index: Difference of shock index between arrival at the hospital and arrival on the scene. (h)
-
Delta.hemoCue: Difference of hemoglobin level between arrival at the hospital and arrival on the scene. (h)
Rights and permissions
About this article
Cite this article
Sportisse, A., Boyer, C. & Josse, J. Imputation and low-rank estimation with Missing Not At Random data. Stat Comput 30, 1629–1643 (2020). https://doi.org/10.1007/s11222-020-09963-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-020-09963-5