Composite likelihood and maximum likelihood methods for joint latent class modeling of disease prevalence and high-dimensional semicontinuous biomarker data

Joint latent class modeling of disease prevalence and high-dimensional semicontinuous biomarker data has been proposed to study the relationship between diseases and their related biomarkers. However, statistical inference of the joint latent class modeling approach has proved very challenging due to its computational complexity in seeking maximum likelihood estimates. In this article, we propose a series of composite likelihoods for maximum composite likelihood estimation, as well as an enhanced Monte Carlo expectation–maximization (MCEM) algorithm for maximum likelihood estimation, in the context of joint latent class models. Theoretically, the maximum composite likelihood estimates are consistent and asymptotically normal. Numerically, we have shown that, as compared to the MCEM algorithm that maximizes the full likelihood, not only the composite likelihood approach that is coupled with the quasi-Newton method can substantially reduce the computational complexity and duration, but it can simultaneously retain comparative estimation efficiency.

We sincerely thank two anonymous reviewers, Associate Editor, and Editors for their valuable comments, which had substantially improved this manuscript. The views expressed in this article are those of the authors and do not necessarily represent the views of US Food and Drug Administration.

Appendix 1: Model selection

In practice, after conducting the MCLE and MLE in joint latent class modeling with fixed K’s, data analysts need to determine the optimal number of latent classes. In the context of joint latent class modeling, a unified model selection strategy that can be applied to both MCLE and MLE is preferable. Here, we propose to employ the simulated likelihood approach (Geyer and Thompson 1992; Xie et al. 2013), combined with the Akaike information criterion (AIC), to select the best K. Let \(\hat{\varvec{\theta }}\) be the estimates obtained from the MCLE or MLE procedures. Note that the marginal likelihood (11), or equivalently (8), is the integration (summation) with respect to two latent processes \(L_i\) and \(\mathbf {b}_j\). By the rule of Monte Carlo integration, the maximized likelihood \(L(\hat{\varvec{\theta }})\) can be approximated by

$$\begin{aligned}&\displaystyle \hat{L}(\hat{\varvec{\theta }})= \displaystyle \frac{1}{\Lambda } \sum _{\lambda =1}^\Lambda \\&\quad \times \left[ \prod _{i=1}^I\left\{ \frac{e^{\,y_i\left( \hat{\beta }_0+\hat{\beta }_1L_i^{(\lambda )}+ \mathbf {w}_{i}^{\prime } \hat{\varvec{\gamma }}\right) }}{1+e^{\hat{\beta }_0+\hat{\beta }_1L_i^{(\lambda )}+\mathbf {w}_{i}^{\prime }\hat{\varvec{\gamma }}}} \prod _{j=1}^J\frac{f^{\,\,\,u_{ij}}_{V_{ij}|L_i^{(\lambda )}, \mathbf {b}_j^{(\lambda )}, \mathbf {z}_{ij}}(v_{ij})e^{u_{ij}\left( \hat{\eta }_0+\hat{\eta }_1h(\hat{\mu }_{ij}(L_i^{(\lambda )}, \mathbf {b}_j^{(\lambda )}, \mathbf {z}_{ij}), \mathbf {t}_{ij},\hat{\varvec{\zeta }})\right) }}{1+ e^{\hat{\eta _0}+\hat{\eta }_1h(\hat{\mu }_{ij}\left( L_i^{(\lambda )}, \mathbf {b}_j^{(\lambda )}, \mathbf {z}_{ij}), \mathbf {t}_{ij},\hat{\varvec{\zeta }}\right) }}\right\} \right] \end{aligned}$$

where \(\Lambda \) is the total number of sampling realizations (\(\Lambda =10^{6}\) in the analysis of case study), \(L_i^{(t)}\) is the tth simulated realizations from the multinomial distribution Multinomial\((1,(\hat{\pi }_0,\ldots , \,\hat{\pi }_{K-1}))\) for the ith subject, and \(\mathbf {b}_j^{(t)}\) is the tth simulated realizations from \(N\left( (0, 0)^\prime , \left( \begin{array}{cc} \hat{\sigma }_0^2 &{} \hat{\rho }\hat{\sigma }_0\hat{\sigma }_1 \\ \hat{\rho }\hat{\sigma }_0\hat{\sigma }_1 &{} \hat{\sigma }_1^2 \\ \end{array} \right) \right) \) for the jth biomarker. Once \(\hat{L}(\hat{\varvec{\theta }})\) is obtained, the AIC values can be calculated accordingly.

Zhang, B., Liu, W., Zhang, H. et al. Composite likelihood and maximum likelihood methods for joint latent class modeling of disease prevalence and high-dimensional semicontinuous biomarker data. Comput Stat 31, 425–449 (2016).

