Abstract
Data fusion involves the integration of multiple related datasets. The statistical file-matching problem is a canonical data fusion problem in multivariate analysis, where the objective is to characterise the joint distribution of a set of variables when only strict subsets of marginal distributions have been observed. Estimation of the covariance matrix of the full set of variables is challenging given the missing-data pattern. Factor analysis models use lower-dimensional latent variables in the data-generating process, and this introduces low-rank components in the complete-data matrix and the population covariance matrix. The low-rank structure of the factor analysis model can be exploited to estimate the full covariance matrix from incomplete data via low-rank matrix completion. We prove the identifiability of the factor analysis model in the statistical file-matching problem under conditions on the number of factors and the number of shared variables over the observed marginal subsets. Additionally, we provide an EM algorithm for parameter estimation. On several real datasets, the factor model gives smaller reconstruction errors in file-matching problems than the common approaches for low-rank matrix completion.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdelaal, T., Höllt, T., van Unen, V., Lelieveldt, B.P.F., Koning, F., Reinders, M.J.T., Mahfouz, A.: CyTOFmerge: integrating mass cytometry data across multiple panels. Bioinformatics 35(20), 4063–4071 (2019)
Ahfock, D., Pyne, S., Lee, S.X., McLachlan, G.J.: Partial identification in the statistical matching problem. Comput. Stat. Data Anal. 104, 79–90 (2016)
Anderson, T.W., Rubin, H.: Statistical inference in factor analysis. In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pp. 238–246 (1956)
Barry, J.T.: An investigation of statistical matching. J. Appl. Stat. 15(3), 275–283 (1988)
Bekker, P.A., ten Berge, J.M.: Generic global identification in factor analysis. Linear Algebra Appl. 264, 255–263 (1997)
Bishop, W.E., Byron, M.Y.: Deterministic symmetric positive semidefinite matrix completion. In: Advances in Neural Information Processing Systems, pp. 2762–2770 (2014)
Browne, M.W.: Asymptotically distribution-free methods for the analysis of covariance structures. Br. J. Math. Stat. Psychol. 37(1), 62–83 (1984)
Candes, E.J., Plan, Y.: Matrix completion with noise. Proc. IEEE 98(6), 925–936 (2010)
Conti, P.L., Marella, D., Scanu, M.: Uncertainty analysis in statistical matching. J. Off. Stat. 28(1), 69–88 (2012)
Conti, P.L., Marella, D., Scanu, M.: Statistical matching analysis for complex survey data with applications. J. Am. Stat. Assoc. 111(516), 1715–1725 (2016)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Royal Stat. Soc. B 39, 1–38 (1977)
D’Orazio, M.: Statistical learning in official statistics: the case of statistical matching. Stat. J. IAOS 35(3), 435–441 (2019)
D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching: Theory and Practice. Wiley, New York (2006a)
D’Orazio, M., Zio, M., Scanu, M.: Statistical matching for categorical data: displaying uncertainty and using logical constraints. J. Off. Stat. 22(1), 137 (2006b)
Gustafson, P.: Bayesian Inference for Partially Identified Models: Exploring the Limits of Limited Data. CRC Press, Boca Raton (2015)
Hastie, T., Mazumder, R.: softImpute: Matrix Completion via Iterative Soft-Thresholded SVD (2021). R package version 1.4-1
Ibrahim, J.G., Zhu, H., Tang, N.: Model selection criteria for missing-data problems using the EM algorithm. J. Am. Stat. Assoc. 103(484), 1648–1658 (2008)
Kadane, J.B.: Some statistical problems in merging data files. J. Off. Stat. 17(3), 423 (2001)
Kamakura, W.A., Wedel, M.: Factor analysis and missing data. J. Market. Res. 37(4), 490–498 (2000)
Koltchinskii, V., Lounici, K., Tsybakov, A.B.: Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Stat. 39(5), 2302–2329 (2011)
Ledermann, W.: On the rank of the reduced correlational matrix in multiple-factor analysis. Psychometrika 2(2), 85–93 (1937)
Lee, G., Finn, W., Scott, C.: Statistical file matching of flow cytometry data. J. Biomed. Inform. 44(4), 663–676 (2011)
Li, G., Jung, S.: Incorporating covariates into integrated factor analysis of multi-view data. Biometrics 73(4), 1433–1442 (2017)
Little, R.J.: Missing-data adjustments in large surveys. J. Bus. Econ. Stat. 6(3), 287–296 (1988)
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley, Hoboken (2002)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010)
Moriarity, C., Scheuren, F.: Statistical matching: a paradigm for assessing the uncertainty in the procedure. J. Off. Stat. 17(3), 407 (2001)
O’Connell, M.J., Lock, E.F.: Linked matrix factorization. Biometrics 75(2), 582–592 (2019)
O’Neill, K., Aghaeepour, N., Parker, J., Hogge, D., Karsan, A., Dalal, B., Brinkman, R.R.: Deep profiling of multitube flow cytometry data. Bioinformatics 31(10), 1623–1631 (2015)
Park, J.Y., Lock, E.F.: Integrative factorization of bidimensionally linked matrices. Biometrics 76(1), 61–74 (2020)
Pedreira, C.E., Costa, E.S., Barrena, S., Lecrevisse, Q., Almeida, J., van Dongen, J.J.M., Orfao, A.: Generation of flow cytometry data files with a potentially infinite number of dimensions. Cytom. Part A 73(9), 834–846 (2008)
Preacher, K.J., Zhang, G., Kim, C., Mels, G.: Choosing the optimal number of factors in exploratory factor analysis: a model selection perspective. Multivar. Behav. Res. 48(1), 28–56 (2013)
Rässler, S.: Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Springer-Verlag, New York (2002)
Rodgers, W.L.: An evaluation of statistical matching. J. Bus. Econ. Stat. 2(1), 91 (1984)
Rubin, D.B., Thayer, D.T.: EM algorithms for ML factor analysis. Psychometrika 47(1), 69–76 (1982)
Sachs, K., Itani, S., Carlisle, J., Nolan, G.P., Pe’er, D., Lauffenburger, D.A.: Learning signaling network structures with sparsely distributed data. J. Comput. Biol. 16(2), 201–212 (2009)
Schönemann, P.H.: A generalized solution of the orthogonal Procrustes problem. Psychometrika 31(1), 1–10 (1966)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Shapiro, A.: Identifiability of factor analysis: some results and open problems. Linear Algebra Appl. 70, 1–7 (1985)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2018)
You, K.: filling: Matrix Completion, Imputation, and Inpainting Methods (2020). R package version 0.2.1
Acknowledgements
We would like to thank the reviewers for thoughtful suggestions that have helped to shape and clarify the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was partially funded by the Australian Government through the Australian Research Council (Project Number DP180101192)
Appendix
Appendix
1.1 Proof of Lemma 1
Due to the rotational invariance of the factor model, we have that
for orthogonal matrices \(\varvec{R}_{1}, \varvec{R}_{2}\), and \(\varvec{R}_{3}\). The alignment of \(\varvec{\varLambda }_{X}^{A}\) and \(\varvec{\varLambda }_{X}^{B}\) is an orthogonal Procrustes problem. Let \(\varvec{R}\) be the solution to the optimisation problem
Assuming that \(\varvec{\varLambda }_{X}^{A}\) and \(\varvec{\varLambda }_{X}^{B}\) are of full column rank, Schönemann (1966) showed that there is a unique solution to (18). As \(\text {rank}(\varvec{\varLambda }_{X}^{A})=\text {rank}(\varvec{\varLambda }_{X}^{B})=\text {rank}(\varvec{\varLambda }_{X})\), both \(\varvec{\varLambda }_{X}^{A}\) and \(\varvec{\varLambda }_{X}^{B}\) are of rank q under Assumption 1. Define \(\varvec{M} = (\varvec{\varLambda }_{X}^{B})^{\mathsf {T}}{\varvec{\varLambda }_{X}^{A}}\) and let the singular value decomposition of \(\varvec{M}\) be given by \(\varvec{M} =\varvec{W}\varvec{D}\varvec{Q}^{\mathsf {T}} \). Then using the result from Schönemann (1966), the unique solution to (18) is given by \(\varvec{R} = \varvec{W}\varvec{Q}^{\mathsf {T}}\). The uniqueness of the solution implies that \(\varvec{R}=\varvec{R}_{3}^{\mathsf {T}}\) as \(\varLambda _{X}^{B}\varvec{R}_{3}\varvec{R}_{3}^{\mathsf {T}} = \varvec{\varLambda }_{X}^{A}\) from (17). Then \(\varvec{\varLambda }_{Z}^{B}\varvec{R} = \varvec{\varLambda }_{Z}^{B}\varvec{R}_{3}^{\mathsf {T}} =\varvec{\varLambda }_{Z}^{A}\varvec{R}_{3}\varvec{R}_{3}^{\mathsf {T}} = \varvec{\varLambda }_{Z}^{A}\) again using (17). Finally, \(\varvec{\varLambda }_{Y}^{A}(\varvec{\varLambda }_{Z}^{B}\varvec{R})^{\mathsf {T}} = \varvec{\varLambda }_{Y}^{A}(\varvec{\varLambda }_{Z}^{A})^{\mathsf {T}} = \varvec{\varLambda }_{Y}\varvec{R}_{1}\varvec{R}_{1}^{\mathsf {T}}\varvec{\varLambda }_{Z}^{\mathsf {T}} = \varvec{\varLambda }_{Y}\varvec{\varLambda }_{Z}^{\mathsf {T}}\).
1.2 Proof of Theorem 1
Using Theorem 5.1 in Anderson and Rubin (1956), Assumption 2 guarantees that if
then the uniquenesses are equal, \(\varvec{\varPsi }_{X}=\varvec{\varPsi }_{X}^{*}\), \(\varvec{\varPsi }_{Y}=\varvec{\varPsi }_{Y}^{*}\), and \(\varvec{\varPsi }_{Z}=\varvec{\varPsi }_{Z}^{*}\), implying
Using Lemma 1, \(\varLambda _{Y}\varLambda _{Z}^{\mathsf {T}}\) can be uniquely recovered given the matrices on the left-hand side of (19) and (20). Likewise, \(\varLambda _{Y}^{*}\varLambda _{Z}^{*\mathsf {T}}\) can be uniquely recovered given the matrices on the right hand side of (19) and (20). It remains to show that \(\varLambda _{Y}\varLambda _{Z}^{\mathsf {T}} = \varLambda _{Y}^{*}\varLambda _{Z}^{*\mathsf {T}}\). To do so, define the eigendecompositions
and the rotated and scaled eigenvectors
Using Assumption 1 and Lemma 1, the equality
must hold, where \(\varvec{W}\) and \(\varvec{Q}\) are the left and right singular vectors of the matrix \(\varvec{M} = (\varvec{\varGamma }_{X}^{B})^{\mathsf {T}}{\varvec{\varGamma }_{X}^{A}} = \varvec{W}\varvec{D}\varvec{Q}^{\mathsf {T}}\). Combining the equalities in (19), (20) and (21) gives the main result
Rights and permissions
About this article
Cite this article
Ahfock, D., Pyne, S. & McLachlan, G.J. Data fusion using factor analysis and low-rank matrix completion. Stat Comput 31, 58 (2021). https://doi.org/10.1007/s11222-021-10033-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-021-10033-7