Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Imputation and low-rank estimation with Missing Not At Random data

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Missing values challenge data analysis because many supervised and unsupervised learning methods cannot be applied directly to incomplete data. Matrix completion based on low-rank assumptions are very powerful solution for dealing with missing values. However, existing methods do not consider the case of informative missing values which are widely encountered in practice. This paper proposes matrix completion methods to recover Missing Not At Random (MNAR) data. Our first contribution is to suggest a model-based estimation strategy by modelling the missing mechanism distribution. An EM algorithm is then implemented, involving a Fast Iterative Soft-Thresholding Algorithm (FISTA). Our second contribution is to suggest a computationally efficient surrogate estimation by implicitly taking into account the joint distribution of the data and the missing mechanism: the data matrix is concatenated with the mask coding for the missing values; a low-rank structure for exponential family is assumed on this new matrix, in order to encode links between variables and missing mechanisms. The methodology that has the great advantage of handling different missing value mechanisms is robust to model specification errors. The performances of our methods are assessed on the real data collected from a trauma registry (TraumaBase\(^{\textregistered }\)) containing clinical information about over twenty thousand severely traumatized patients in France. The aim is then to predict if the doctors should administrate tranexomic acid to patients with traumatic brain injury, that would limit excessive bleeding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Once the stopping criterion is met, \(T=10\) extra iterations are performed to assure the convergence stability.

  2. http://www.traumabase.eu/.

  3. The tranexomic acid is an antifibrinolyic agent which reduces blood loss.

References

  • Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  • Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Cai, T., Zhou, W.-X.: A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res. 14(1), 3619–3647 (2013)

    MathSciNet  MATH  Google Scholar 

  • Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Candes, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98(6), 925–936 (2010)

    Article  Google Scholar 

  • Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Candès, E.J., Sing-Long, C.A., Trzasko, J.D.: Unbiased risk estimates for singular value thresholding and spectral estimators. IEEE Trans. Signal Process. 61(19), 4643–4657 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Statist. Soc. Ser. B (Methodol.) 39(1), 1–22 (1977)

    MathSciNet  MATH  Google Scholar 

  • Gavish, M., Donoho, D.L.: Optimal shrinkage of singular values. IEEE Trans. Inf. Theory 63(4), 2137–2152 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Gordon, N.J., Salmond, D.J., Smith, A.F.M.: Novel approach to nonlinear/non-gaussian bayesian state estimation. IEE Proc. F-radar Signal Process. 140, 107–113 (1993)

    Article  Google Scholar 

  • Harel, O., Schafer, J.L.: Partial and latent ignorability in missing-data problems. Biometrika 96(1), 37–50 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Mazumder, R.: softImpute: Matrix Completion via Iterative Soft-Thresholded SVD (2015). https://CRAN.R-project.org/package=softImpute. R package version 1.4

  • Hastie, T., Mazumder, R., Lee, J.D., Zadeh, R.: Matrix completion and low-rank SVD via fast alternating least squares. J. Mach. Learn. Res. 16(1), 3367–3402 (2015)

    MathSciNet  MATH  Google Scholar 

  • Hay, S.I., Abajobir, A.A., Abate, K.H., Abbafati, C., Abbas, K.M., Abd-Allah, F., Abdulkader, R.S., Abdulle, A.M., Abebo, T.A., Abera, S.F., et al.: Global, regional, and national disability-adjusted life-years (dalys) for 333 diseases and injuries and healthy life expectancy (hale) for 195 countries and territories, 1990–2016: a systematic analysis for the global burden of disease study 2016. Lancet 390(10100), 1260–1344 (2017)

    Article  Google Scholar 

  • Heckman, J.J.: Sample selection bias as a specification error. Econometrica 42, 679–94 (1974)

    Article  Google Scholar 

  • Ibrahim, J.G., Lipsitz, S.R., Chen, M.-H.: Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Statist. Soc. Ser. B (Statist. Methodol.) 61(1), 173–190 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Josse, J., Prost, N., Scornet, E., Varoquaux, G.: On the consistency of supervised learning with missing values. arXiv:1902.06931 (2019)

  • Josse, J., Sardy, S., Wager, S.: denoiser: A package for low rank matrix estimation. J. Stat. Softw. (2016)

  • Josse, J., Husson, F.: Selecting the number of components in principal component analysis using cross-validation approximations. Comput. Statist. Data Anal. 56(6), 1869–1879 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Kallus, N., Mao, X., Udell, M.: Causal inference with noisy and missing covariates via matrix factorization. arXiv:1806.00811 (2018)

  • Kishore Kumar, N., Schneider, J.: Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11), 2212–2244 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3(9), e161 (2007)

    Article  Google Scholar 

  • Little, R.J.A.: Pattern-mixture models for multivariate incomplete data. J. Am. Statist. Assoc. 88(421), 125–134 (1993)

    MATH  Google Scholar 

  • Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, vol. 333. Wiley, New York (2014)

    MATH  Google Scholar 

  • Liu, L.T., Dobriban, E., Singer, A., et al.: \( e \) PCA: high dimensional exponential family PCA. Ann. Appl. Statist. 12(4), 2121–2150 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  • Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)

    MathSciNet  MATH  Google Scholar 

  • Miao, W., Tchetgen, E.T.: Identification and inference with nonignorable missing covariate data. Statistica Sinica (2017)

  • Mohan, K., Pearl, J.: Graphical models for processing missing data. arXiv:1801.03583 (2018)

  • Mohan, K., Thoemmes, F., Pearl, J.: Estimation with incomplete data: the linear case. In: IJCAI, pp. 5082–5088 (2018)

  • Morikawa, K., Kim, J.K., Kano, Y.: Semiparametric maximum likelihood estimation with data missing not at random. Canadian J. Statist. 45(4), 393–409 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  • Murray, J.S.: Multiple imputation: a review of practical and theoretical findings. arXiv:1801.04058 (2018)

  • Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38(8), 904–909 (2006)

    Article  Google Scholar 

  • Robin, G., Klopp, O., Josse, J., Moulines, É., Tibshirani, R.: Main effects and interactions in mixed and incomplete data frames. arXiv:1806.09734 (2018)

  • Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)

    MATH  Google Scholar 

  • Seaman, S., Galati, J., Jackson, D., Carlin, J.: What is meant by missing at random? Statist. Sci. 28(2), 257–268 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Tang, F., Ishwaran, H.: Random forest missing data algorithms. Statist. Anal. Data Min. ASA Data Sci. J. 10(6), 363–377 (2017)

    Article  MathSciNet  Google Scholar 

  • Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recognit. Lett. 29(7), 950–956 (2008)

    Article  Google Scholar 

  • Udell, M., Townsend, A.: Nice latent variable models have log-rank. aXiv:1705.07474 (2017)

  • Udell, M., Horn, C., Zadeh, R., Boyd, S., et al.: Generalized low rank models. Found. Trends® Mach. Learn. 9(1), 1–118 (2016)

    Article  MATH  Google Scholar 

  • Verbanck, M., Josse, J., Husson, F.: Regularised pca to denoise and visualise data. Statist. Comput. 25(2), 471–486 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, C., Akimoto, Y., Kim, D.W., Udell, M.: Oboe: Collaborative filtering for automl initialization. arXiv:1808.03233 (2018)

Download references

Acknowledgements

The authors are thankful for fruitful discussion with François Husson, Wei Jiang, Imke Mayer and Geneviève Robin.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aude Sportisse.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Julie Josse was supported by the Data Analytics and Models for Insurance chair and Aude Sportisse was funded by a PEPS project (Projet Exploratoire Premier Soutien) of AMIES (the mathematical agency in interaction with companies and society).

Appendices

The FISTA algorithm

We first present the proximal gradient method. The following optimisation problem is considered:

$$\begin{aligned} \hat{\varTheta } \in \text {argmin}_\varTheta \, h_1(\varTheta ) + h_2(\varTheta ), \end{aligned}$$

where \(h_1\) is a convex function, \(h_2\) a differentiable and convex function and L the gradient Lipschitz of \(h_2\).

figure a

The main trick of the FISTA algorithm is to add a momentum term to the proximal gradient method, in order to yield smoother trajectory towards the convergence point. In addition, the proximal operator is performed on a specific linear combination of the previous two iterates, rather than on the previous iterate only.

figure b

In our specific model, to solve (2), \(h_1(\varTheta )=\Vert \varTheta \Vert _\star \) and \(h_2(\varTheta )=\Vert \varOmega \odot (\varTheta -Y) \Vert ^2_F\). Let us precise that:

$$\begin{aligned} \frac{\partial h_2(\varTheta )}{\partial \varTheta _{ij}}=\nabla _{\varTheta _{ij}} h_2(\varTheta )=\varOmega _{ij}\left( \varTheta _{ij} - Y_{ij} \right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \nabla h_2(\varTheta )=\varOmega \odot (\varTheta -Y) \end{aligned}$$

and L is equal to 1.

SoftImpute

We start by describing softImpute.

figure c

The proximal operator of the nuclear norm of a matrix X consists in a soft-thresholding of its singular values: we perform the SVD of X and we obtain the matrices U, V and D. Then

$$\begin{aligned} \text {prox}_{\lambda \Vert \cdot \Vert _{\star }}(X)=UD_{\lambda } V. \end{aligned}$$

\(D_{\lambda }\) is the diagonal matrix such that for all i,

$$\begin{aligned} D_{\lambda ,ii}=\max ((\sigma _i-\lambda ),0) \end{aligned}$$

, where the \((\sigma _{ii})\)’s are the singular values of X.

1.1 Equivalence between softImpute and the proximal gradient method

By using the same functions \(h_1\) and \(h_2\) as above, one has:

$$\begin{aligned} \hat{\varTheta }^{(t+1)}&=\text {prox}_{\lambda (1/L)h_1}(\hat{\varTheta }^{(t)}-(1/L) \nabla h_2(\hat{\varTheta }^{(t)})) \\&=\text {prox}_{\lambda \Vert \cdot \Vert _\star }(\hat{\varTheta }^{(t)}-\varOmega \odot (\hat{\varTheta }^{(t)} -Y)) \\&=\text {prox}_{\lambda \Vert .\Vert _\star }(\varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}), \end{aligned}$$

so that softImpute and the proximal gradient method are similar.

1.2 Equivalence between the EM algorithm and iterative SVD in the MAR case

We prove here that in the MAR setting, softImpute is similar to the EM algorithm. Let us recall that in the MAR setting the model of the joint distribution is not needed but only the one of the data distribution, so that the E-step is written as follows:

$$\begin{aligned} Q(\varTheta |\hat{\varTheta }^{(t)})&=\mathbb {E}_{Y_\text {mis}}\left[ \log (p(\varTheta ;y))|Y_\text {obs};\varTheta =\hat{\varTheta }^{(t)}\right] \\&\propto -\sum _{i=1}^{n} \sum _{j=1}^{p} \mathbb {E} \left[ \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 |\hat{\varTheta _{ij}}^{(t)}\right] , \end{aligned}$$

by using (3) and the independance of \(Y_{ij}, \, \forall i,j\)). Then, by splitting into the observed and the missing elements,

$$\begin{aligned}&Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n} \sum _{j, \, \varOmega _{ij}=0} \mathbb {E}\left[ \left( \frac{y_{ij} -\varTheta _{ij}}{\sigma }\right) ^2 |\hat{\varTheta _{ij}}^{(t)} \right] \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 \end{aligned}$$

Therefore,

$$\begin{aligned}&Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n} \sum _{j, \, \varOmega _{ij}=0}\mathbb {E}[y_{ij} ^2 -2\varTheta _{ij}y_{ij} +\varTheta _{ij}^2| \hat{\varTheta _{ij}}^{(t)} ]^2 \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2\\&\quad Q(\varTheta |\hat{\varTheta }^{(t)})\propto -\sum _{i=1}^{n}\\&\quad \sum _{j, \, \varOmega _{ij}=0} \left( \sigma ^2 + \hat{\varTheta _{ij}}^{(t)2}-2\hat{\varTheta _{ij}}^{(t)}\varTheta _{ij}+\varTheta _{ij}^2\right) \\&\quad -\sum _{i=1}^{n}\sum _{j, \, \varOmega _{ij}=1} \left( \frac{y_{ij}-\varTheta _{ij}}{\sigma }\right) ^2 \end{aligned}$$

which implies \(Q(\varTheta |\hat{\varTheta }^{(t)})\propto \Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F\)

The M-step is then written as follows:

$$\begin{aligned} \hat{\varTheta }^{(t+1)} \in \text {argmin}_{\varTheta } \Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F + \lambda \Vert \varTheta \Vert _\star \end{aligned}$$

The proximal gradient method is applied with

$$\begin{aligned} h_1(\varTheta )=\lambda \Vert \varTheta \Vert _\star \text { and } h_2(\varTheta )=\Vert \varOmega \odot Y + (1-\varOmega ) \odot \hat{\varTheta }^{(t)}-\varTheta \Vert ^2_F. \end{aligned}$$

Therefore, the EM algorithm in the MAR case is the same one as softImpute.

The EM algorithm in the MNAR case

For the sake of clarity, we present below the EM algorithm in the MNAR and low dimension case.

figure d

We already have given details for the stopping criterium.

We clarify the maximization step given by (8) and (9).

$$\begin{aligned}&\hat{\varTheta } \in \underset{\varTheta }{\text {argmin}} \sum _{i,j} \left( \frac{1}{N_s}\sum _{k=1}^{N_s} \frac{1}{2\sigma ^2}(v_{ij}^{(k)}-\varTheta _{ij})^2 \right) + \lambda \Vert \varTheta \Vert _\star \nonumber \\&\quad \in \underset{\varTheta }{\text {argmin}}\sum _{i,j}\left( \frac{1}{N_s\sigma ^2}\sum _{k=1}^{N_s}-v_{ij}^{(k)}\varTheta _{ij}+\frac{1}{2}\varTheta _{ij}^2 \right) +\lambda \Vert \varTheta \Vert _\star \nonumber \\&\quad \in \underset{\varTheta }{\text {argmin}} \Vert V-\varTheta \Vert _F^2 + \lambda \Vert \varTheta \Vert _\star , \text { where:} \nonumber \\&V=\begin{pmatrix} \frac{1}{N_s}\sum _{k=1}^{Ns}v_{11}^{(k)} \ldots \frac{1}{N_s}\sum _{k=1}^{Ns}v_{1p}^{(k)} \\ \vdots \ddots \vdots \\ \frac{1}{N_s}\sum _{k=1}^{Ns}v_{n1}^{(k)} \ldots \frac{1}{N_s}\sum _{k=1}^{Ns}v_{np}^{(k)} \end{pmatrix} \nonumber \\&\hat{\phi }^{(t+1)} \in \underset{\phi }{\text {argmin}} \sum _{i,j}\frac{1}{N_s} \sum _{k=1}^{N_s} (1-\varOmega _{ij})C_3 - \varOmega _{ij}C_4 \nonumber \\&\quad \in \underset{\phi }{\text {argmin}}\sum _{i,j} \frac{1}{N_s} \sum _{k=1}^{N_s} C_3 + \varOmega _{ij} \phi _{1j} (v_{ij}^{k}-\phi _{2j}) \end{aligned}$$
(17)

with:

$$\begin{aligned} C_3&=\log (1+e^{-\phi _{1j}(v_{ij}^{k}-\phi _{2j})}) \\ C_4&=\log (1-(1+e^{-\phi _{1j}(v_{ij}^{k}-\phi _{2j})})^{-1}) \end{aligned}$$

For all \(j \in \{1,\ldots ,p\}\) such that \(\exists i \in \{1,\ldots ,n\}, \, \varOmega _{ij}=0\), estimating the coefficients \(\phi _{1j}\) and \(\phi _{2j}\) remains to fit a generalized linear model with the binomial link function for the matrix \(J^{(j)}\):

$$\begin{aligned} J^{(j)}=\begin{pmatrix} \varOmega _{1j} &{} v_{1j}^{(1)} \\ \vdots &{} \vdots \\ \varOmega _{nj} &{} v_{nj}^{(1)} \\ \vdots &{} \vdots \\ \varOmega _{1j} &{} v_{1j}^{(N_s)} \\ \vdots &{} \vdots \\ \varOmega _{nj} &{} v_{nj}^{(N_s)} \\ \end{pmatrix} \end{aligned}$$
(18)

1.1 SIR

In the Monte Carlo approximation, the distribution of interest is \(\left[ y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)}\right] \). By using the Bayes rules:

$$\begin{aligned}&p \left( y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)} \right) =: f(y_{ij}) \\&\quad \propto p\left( y_{ij};\hat{\varTheta }_{ij}^{(t)}\right) p\left( \varOmega _{ij}|y_{ij};\hat{\phi }_j^{(t)}\right) =: g(y_{ij}) \end{aligned}$$

Denoting the Gaussian density function of mean \(\varTheta ^{(t)}_{ij}\) and variance \(\sigma ^2\) by \(\varphi _{\varTheta ^{(t)}_{ij},\sigma ^2}\), if \(\sigma >(2\pi )^{-1/2}\), the following condition holds:

$$\begin{aligned}f(y_{ij})=cg(y_{ij})\le \varphi _{\varTheta ^{(t)}_{ij},\sigma ^2}(x).\end{aligned}$$

For M large, the SIR algorithm to simulate

$$\begin{aligned} z \sim \left[ y_{ij}|\varOmega _{ij};\hat{\phi }_j^{(t)},\hat{\varTheta }_{ij}^{(t)}\right] \end{aligned}$$

is described as follows.

figure e

Details on the variables in TraumaBase\(^{\textregistered }\)

A description of the variables which are used in Sect. 5 is given. The indications given in parentheses ph (pre-hospital) and h (hospital) mean that the measures have been taken before the arrival at the hospital and at the hospital.

  • SBP.ph, DBP.ph, HR.ph: systolic and diastolic arterial pressure and heart rate during pre-hospital phase. (ph)

  • HemoCue.init: prehospital capillary hemoglobin concentration. (ph)

  • SpO2.min: peripheral oxygen saturation, measured by pulse oxymetry, to estimate oxygen content in the blood. (ph)

  • Cristalloid.volume: total amount of prehospital administered cristalloid fluid resuscitation (volume expansion). (ph)

  • Shock.index.ph: ratio of heart rate and systolic arterial pressure during pre-hospital phase. (ph)

  • Delta.shock.index: Difference of shock index between arrival at the hospital and arrival on the scene. (h)

  • Delta.hemoCue: Difference of hemoglobin level between arrival at the hospital and arrival on the scene. (h)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sportisse, A., Boyer, C. & Josse, J. Imputation and low-rank estimation with Missing Not At Random data. Stat Comput 30, 1629–1643 (2020). https://doi.org/10.1007/s11222-020-09963-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-020-09963-5

Keywords

Navigation