Imputation and low-rank estimation with Missing Not At Random data

1303 Accesses
27 Citations
6 Altmetric
Explore all metrics

Abstract

Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase$^{\textregistered }$) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Model-based clustering with missing not at random data

Article 18 June 2024

A sparse linear regression model for incomplete datasets

Article 04 December 2019

Block tensor train decomposition for missing data estimation

Article 06 September 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Once the stopping criterion is met, $T=10$ extra iterations are performed to assure the convergence stability.
http://www.traumabase.eu/.
The tranexomic acid is an antifibrinolyic agent which reduces blood loss.

References

Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Cai, T., Zhou, W.-X.: A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res. 14(1), 3619–3647 (2013)
MathSciNet MATH Google Scholar
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet MATH Google Scholar
Candes, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98(6), 925–936 (2010)
Article Google Scholar
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)
Article MathSciNet MATH Google Scholar
Candès, E.J., Sing-Long, C.A., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Trans. Signal Process. 61(19), 4643–4657 (2013)
Article MathSciNet MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Statist. Soc. Ser. B (Methodol.) 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Gavish, M., Donoho, D.L.: Optimal shrinkage of singular values. IEEE Trans. Inf. Theory 63(4), 2137–2152 (2017)
Article MathSciNet MATH Google Scholar
Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE Proc. F-radar Signal Process. 140, 107–113 (1993)
Article Google Scholar
Harel, O., Schafer, J.L.: Partial and latent ignorability in missing-data problems. Biometrika 96(1), 37–50 (2009)
Article MathSciNet MATH Google Scholar
Hastie, T., Mazumder, R.: softImpute: Matrix Completion via Iterative Soft-Thresholded SVD (2015). https://CRAN.R-project.org/package=softImpute. R package version 1.4
Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1), 3367–3402 (2015)
MathSciNet MATH Google Scholar
Hay, S.I., Abajobir, A.A., Abate, K.H., Abbafati, C., Abbas, K.M., Abd-Allah, F., Abdulkader, R.S., Abdulle, A.M., Abebo, T.A., Abera, S.F., et al.: Global, regional, and national disability-adjusted life-years (dalys) for 333 diseases and injuries and healthy life expectancy (hale) for 195 countries and territories, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet 390(10100), 1260–1344 (2017)
Article Google Scholar
Heckman, J.J.: Sample selection bias as a specification error. Econometrica 42, 679–94 (1974)
Article Google Scholar
Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)
Article MathSciNet MATH Google Scholar
Josse, J., Prost, N., Scornet, E., Varoquaux, G.: On the consistency of supervised learning with missing values. arXiv:1902.06931 (2019)
Josse, J., Sardy, S., Wager, S.: denoiser: A package for low rank matrix estimation. J. Stat. Softw. (2016)
Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Statist. Data Anal. 56(6), 1869–1879 (2012)
Article MathSciNet MATH Google Scholar
Kallus, N., Mao, X., Udell, M.: Causal inference with noisy and missing covariates via matrix factorization. arXiv:1806.00811 (2018)
Kishore Kumar, N., Schneider, J.: Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11), 2212–2244 (2017)
Article MathSciNet MATH Google Scholar
Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3(9), e161 (2007)
Article Google Scholar
Little, R.J.A.: Pattern-mixture models for multivariate incomplete data. J. Am. Statist. Assoc. 88(421), 125–134 (1993)
MATH Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)
MATH Google Scholar
Liu, L.T., Dobriban, E., Singer, A., et al.: $ e $ PCA: high dimensional exponential family PCA. Ann. Appl. Statist. 12(4), 2121–2150 (2018)
Article MathSciNet MATH Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)
MathSciNet MATH Google Scholar
Miao, W., Tchetgen, E.T.: Identification and inference with nonignorable missing covariate data. Statistica Sinica (2017)
Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583 (2018)
Mohan, K., Thoemmes, F., Pearl, J.: Estimation with incomplete data: the linear case. In: IJCAI, pp. 5082–5088 (2018)
Morikawa, K., Kim, J.K., Kano, Y.: Semiparametric maximum likelihood estimation with data missing not at random. Canadian J. Statist. 45(4), 393–409 (2017)
Article MathSciNet MATH Google Scholar
Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. arXiv:1801.04058 (2018)
Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)
Article Google Scholar
Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R.: Main effects and interactions in mixed and incomplete data frames. arXiv:1806.09734 (2018)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet MATH Google Scholar
Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)
MATH Google Scholar
Seaman, S., Galati, J., Jackson, D., Carlin, J.: What is meant by missing at random? Statist. Sci. 28(2), 257–268 (2013)
Article MathSciNet MATH Google Scholar
Tang, F., Ishwaran, H.: Random forest missing data algorithms. Statist. Anal. Data Min. ASA Data Sci. J. 10(6), 363–377 (2017)
Article MathSciNet Google Scholar
Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recognit. Lett. 29(7), 950–956 (2008)
Article Google Scholar
Udell, M., Townsend, A.: Nice latent variable models have log-rank. aXiv:1705.07474 (2017)
Udell, M., Horn, C., Zadeh, R., Boyd, S., et al.: Generalized low rank models. Found. Trends® Mach. Learn. 9(1), 1–118 (2016)
Article MATH Google Scholar
Verbanck, M., Josse, J., Husson, F.: Regularised pca to denoise and visualise data. Statist. Comput. 25(2), 471–486 (2015)
Article MathSciNet MATH Google Scholar
Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: Collaborative filtering for automl initialization. arXiv:1808.03233 (2018)

Download references

Acknowledgements

The authors are thankful for fruitful discussion with François Husson, Wei Jiang, Imke Mayer and Geneviève Robin.

Author information

Authors and Affiliations

Laboratoire de Probabilités Statistique et Modélisation, Sorbonne Université, Paris, France
Aude Sportisse & Claire Boyer
Centre de Mathématiques Appliquées, Ecole Polytechnique, Palaiseau, France
Aude Sportisse & Julie Josse
Département de Mathématiques et applications, Ecole Normale Supérieure, Paris, France
Claire Boyer
XPOP, INRIA Saclay, Palaiseau, France
Julie Josse

Authors

Aude Sportisse
View author publications
You can also search for this author in PubMed Google Scholar
Claire Boyer
View author publications
You can also search for this author in PubMed Google Scholar
Julie Josse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aude Sportisse.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Julie Josse was supported by the Data Analytics and Models for Insurance chair and Aude Sportisse was funded by a PEPS project (Projet Exploratoire Premier Soutien) of AMIES (the mathematical agency in interaction with companies and society).

Appendices

The FISTA algorithm

We first present the proximal gradient method. The following optimisation problem is considered:

$$\begin{aligned} \hat{\varTheta } \in \text {argmin}_\varTheta \, h_1(\varTheta ) + h_2(\varTheta ), \end{aligned}$$

where $h_1$ is a convex function, $h_2$ a differentiable and convex function and L the gradient Lipschitz of $h_2$.

The main trick of the FISTA algorithm is to add a momentum term to the proximal gradient method, in order to yield smoother trajectory towards the convergence point. In addition, the proximal operator is performed on a specific linear combination of the previous two iterates, rather than on the previous iterate only.

In our specific model, to solve (2), $h_1(\varTheta )=\Vert \varTheta \Vert _\star $ and $h_2(\varTheta )=\Vert \varOmega \odot (\varTheta -Y) \Vert ^2_F$. Let us precise that:

$$\begin{aligned} \frac{\partial h_2(\varTheta )}{\partial \varTheta _{ij}}=\nabla _{\varTheta _{ij}} h_2(\varTheta )=\varOmega _{ij}\left( \varTheta _{ij} - Y_{ij} \right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \nabla h_2(\varTheta )=\varOmega \odot (\varTheta -Y) \end{aligned}$$

and L is equal to 1.

SoftImpute

We start by describing softImpute.

The proximal operator of the nuclear norm of a matrix X consists in a soft-thresholding of its singular values: we perform the SVD of X and we obtain the matrices U, V and D. Then

$$\begin{aligned} \text {prox}_{\lambda \Vert \cdot \Vert _{\star }}(X)=UD_{\lambda } V. \end{aligned}$$

$D_{\lambda }$ is the diagonal matrix such that for all i,

$$\begin{aligned} D_{\lambda ,ii}=\max ((\sigma _i-\lambda ),0) \end{aligned}$$

, where the $(\sigma _{ii})$’s are the singular values of X.

1.1 Equivalence between softImpute and the proximal gradient method

By using the same functions $h_1$ and $h_2$ as above, one has:

$$\begin{aligned} \hat{\varTheta }^{(t+1)}&=\text {prox}_{\lambda (1/L)h_1}(\hat{\varTheta }^{(t)}-(1/L) \nabla h_2(\hat{\varTheta }^{(t)})) \\&=\text {prox}_{\lambda \Vert \cdot \Vert _\star }(\hat{\varTheta }^{(t)}-\varOmega \odot (\hat{\varTheta }^{(t)} -Y)) \\&=\text {prox}_{\lambda \Vert .\Vert _\star }(\varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}), \end{aligned}$$

so that softImpute and the proximal gradient method are similar.

1.2 Equivalence between the EM algorithm and iterative SVD in the MAR case

We prove here that in the MAR setting, softImpute is similar to the EM algorithm. Let us recall that in the MAR setting the model of the joint distribution is not needed but only the one of the data distribution, so that the E-step is written as follows:

$$\begin{aligned} Q(\varTheta |\hat{\varTheta }^{(t)})&=\mathbb {E}_{Y_\text {mis}}\left[ \log (p(\varTheta ;y))|Y_\text {obs};\varTheta =\hat{\varTheta }^{(t)}\right] \\&\propto -\sum _{i=1}^{n} \sum _{j=1}^{p} \mathbb {E} \left[ \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 |\hat{\varTheta _{ij}}^{(t)}\right] , \end{aligned}$$

by using (3) and the independance of $Y_{ij}, \, \forall i,j$). Then, by splitting into the observed and the missing elements,

$$\begin{aligned}&Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n} \sum _{j, \, \varOmega _{ij}=0} \mathbb {E}\left[ \left( \frac{y_{ij} -\varTheta _{ij}}{\sigma }\right) ^2 |\hat{\varTheta _{ij}}^{(t)} \right] \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 \end{aligned}$$

Therefore,

$$\begin{aligned}&Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n} \sum _{j, \, \varOmega _{ij}=0}\mathbb {E}[y_{ij} ^2 -2\varTheta _{ij}y_{ij} +\varTheta _{ij}^2| \hat{\varTheta _{ij}}^{(t)} ]^2 \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2\\&\quad Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n}\\&\quad \sum _{j, \, \varOmega _{ij}=0} \left( \sigma ^2 + \hat{\varTheta _{ij}}^{(t)2}-2\hat{\varTheta _{ij}}^{(t)}\varTheta _{ij}+\varTheta _{ij}^2\right) \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 \end{aligned}$$

which implies $Q(\varTheta |\hat{\varTheta }^{(t)})\propto \Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F$

The M-step is then written as follows:

$$\begin{aligned} \hat{\varTheta }^{(t+1)} \in \text {argmin}_{\varTheta } \Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F + \lambda \Vert \varTheta \Vert _\star \end{aligned}$$

The proximal gradient method is applied with

$$\begin{aligned} h_1(\varTheta )=\lambda \Vert \varTheta \Vert _\star \text { and } h_2(\varTheta )=\Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F. \end{aligned}$$

Therefore, the EM algorithm in the MAR case is the same one as softImpute.

The EM algorithm in the MNAR case

For the sake of clarity, we present below the EM algorithm in the MNAR and low dimension case.

We already have given details for the stopping criterium.

We clarify the maximization step given by (8) and (9).

$$\begin{aligned}&\hat{\varTheta } \in \underset{\varTheta }{\text {argmin}} \sum _{i,j} \left( \frac{1}{N_s}\sum _{k=1}^{N_s} \frac{1}{2\sigma ^2}(v_{ij}^{(k)}-\varTheta _{ij})^2 \right) + \lambda \Vert \varTheta \Vert _\star \nonumber \\&\quad \in \underset{\varTheta }{\text {argmin}}\sum _{i,j}\left( \frac{1}{N_s\sigma ^2}\sum _{k=1}^{N_s}-v_{ij}^{(k)}\varTheta _{ij}+\frac{1}{2}\varTheta _{ij}^2 \right) +\lambda \Vert \varTheta \Vert _\star \nonumber \\&\quad \in \underset{\varTheta }{\text {argmin}} \Vert V-\varTheta \Vert _F^2 + \lambda \Vert \varTheta \Vert _\star , \text { where:} \nonumber \\&V=\begin{pmatrix} \frac{1}{N_s}\sum _{k=1}^{Ns}v_{11}^{(k)} \ldots \frac{1}{N_s}\sum _{k=1}^{Ns}v_{1p}^{(k)} \\ \vdots \ddots \vdots \\ \frac{1}{N_s}\sum _{k=1}^{Ns}v_{n1}^{(k)} \ldots \frac{1}{N_s}\sum _{k=1}^{Ns}v_{np}^{(k)} \end{pmatrix} \nonumber \\&\hat{\phi }^{(t+1)} \in \underset{\phi }{\text {argmin}} \sum _{i,j}\frac{1}{N_s} \sum _{k=1}^{N_s} (1-\varOmega _{ij})C_3 - \varOmega _{ij}C_4 \nonumber \\&\quad \in \underset{\phi }{\text {argmin}}\sum _{i,j} \frac{1}{N_s} \sum _{k=1}^{N_s} C_3 + \varOmega _{ij} \phi _{1j} (v_{ij}^{k}-\phi _{2j}) \end{aligned}$$

(17)

with:

$$\begin{aligned} C_3&=\log (1+e^{-\phi _{1j}(v_{ij}^{k}-\phi _{2j})}) \\ C_4&=\log (1-(1+e^{-\phi _{1j}(v_{ij}^{k}-\phi _{2j})})^{-1}) \end{aligned}$$

For all $j \in \{1,\ldots ,p\}$ such that $\exists i \in \{1,\ldots ,n\}, \, \varOmega _{ij}=0$, estimating the coefficients $\phi _{1j}$ and $\phi _{2j}$ remains to fit a generalized linear model with the binomial link function for the matrix $J^{(j)}$:

$$\begin{aligned} J^{(j)}=\begin{pmatrix} \varOmega _{1j} &{} v_{1j}^{(1)} \\ \vdots &{} \vdots \\ \varOmega _{nj} &{} v_{nj}^{(1)} \\ \vdots &{} \vdots \\ \varOmega _{1j} &{} v_{1j}^{(N_s)} \\ \vdots &{} \vdots \\ \varOmega _{nj} &{} v_{nj}^{(N_s)} \\ \end{pmatrix} \end{aligned}$$

(18)

1.1 SIR

In the Monte Carlo approximation, the distribution of interest is $\left[ y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)}\right] $. By using the Bayes rules:

$$\begin{aligned}&p \left( y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)} \right) =: f(y_{ij}) \\&\quad \propto p\left( y_{ij};\hat{\varTheta }_{ij}^{(t)}\right) p\left( \varOmega _{ij}|y_{ij};\hat{\phi }_j^{(t)}\right) =: g(y_{ij}) \end{aligned}$$

Denoting the Gaussian density function of mean $\varTheta ^{(t)}_{ij}$ and variance $\sigma ^2$ by $\varphi _{\varTheta ^{(t)}_{ij},\sigma ^2}$, if $\sigma >(2\pi )^{-1/2}$, the following condition holds:

$$\begin{aligned}f(y_{ij})=cg(y_{ij})\le \varphi _{\varTheta ^{(t)}_{ij},\sigma ^2}(x).\end{aligned}$$

For M large, the SIR algorithm to simulate

$$\begin{aligned} z \sim \left[ y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)}\right] \end{aligned}$$

is described as follows.

Details on the variables in TraumaBase$^{\textregistered }$

A description of the variables which are used in Sect. 5 is given. The indications given in parentheses ph (pre-hospital) and h (hospital) mean that the measures have been taken before the arrival at the hospital and at the hospital.

SBP.ph, DBP.ph, HR.ph: systolic and diastolic arterial pressure and heart rate during pre-hospital phase. (ph)
HemoCue.init: prehospital capillary hemoglobin concentration. (ph)
SpO2.min: peripheral oxygen saturation, measured by pulse oxymetry, to estimate oxygen content in the blood. (ph)
Cristalloid.volume: total amount of prehospital administered cristalloid fluid resuscitation (volume expansion). (ph)
Shock.index.ph: ratio of heart rate and systolic arterial pressure during pre-hospital phase. (ph)
Delta.shock.index: Difference of shock index between arrival at the hospital and arrival on the scene. (h)
Delta.hemoCue: Difference of hemoglobin level between arrival at the hospital and arrival on the scene. (h)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sportisse, A., Boyer, C. & Josse, J. Imputation and low-rank estimation with Missing Not At Random data. Stat Comput 30, 1629–1643 (2020). https://doi.org/10.1007/s11222-020-09963-5

Download citation

Received: 27 December 2018
Accepted: 07 July 2020
Published: 16 July 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11222-020-09963-5