Abstract
This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.
Keywords: EHR, chronic kidney disease, survival analysis, multiresolution hazard, MRH
1. Introduction
Early detection of disease or its progression is one of the fundamental goals of healthcare. The earlier detection occurs, the more intervention options are available to patients and care givers. The Electronic Health Record (EHR) contains much data about patient care collected over time. As such, it presents a detailed and ubiquitous source of patient information, which can be leveraged to learn about disease and healthcare processes. While EHR data are not curated like traditional clinical research datasets, there is evidence that the patient record contains valuable information, which if used effectively, could help shorten delays in diagnosis and avoid severe consequences for patients [1, 2]. Furthermore, because EHR-based models operate over variables which are inherent to the tools clinicians use in their practice, translating results into actionable knowledge is not as complex as it is with traditional epidemiological studies [3]. Finally, in addition to containing clinical variables, EHR data also provide information on healthcare-related variables, such as rate of doctor visits, itself a proxy for patient adherence and access to care, which co-evolve with the clinical variables. There is emerging evidence that documentation patterns in the EHR are indicative of health status [4].
Extracting meaningful information from the EHR is not a trivial step, however. Because the data are collected with care as the primary goal, as opposed to research and clinical discovery, there are several challenges entailed in using the data directly in statistical analyses. In this paper, we explore the potential of using the EHR to extract information for predictive and explanatory statistical modeling of disease progression. As an example, we carry out a detailed survival analysis for chronic kidney disease based on the EHR data comprising up to twenty years of clinical observations collected at NewYork-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for clinical covariates and kidney-related laboratory tests.
The key contributions of this paper are: (i) careful data preparation to mitigate some of the characteristics of EHR data, namely cohort definition, missing data, and censoring; (ii) the novel application of a survival analysis technique to EHR longitudinal data; and (iii) the comparison of two types of models: one based on patient demographics and healthcare process and one augmented with laboratory test variables. The healthcare-based model uncovers valuable information for characterizing progression of disease; while the laboratory-augmented model identifies some key factors of disease progression in line with clinical literature for chronic kidney disease. Our experiments confirm that the EHR holds information to understanding disease progression and identifying future directions for research.
2. Statistical Issues in EHR Data
EHR data hold great potential for exploring long-term medical, epidemiological, public health, economic and demographic questions and are rapidly becoming an ubiquitous and practical data source for both the purpose of guiding clinical practice and epidemiological research. However, as with any non-controlled datasets, the processes under which EHR data are collected or generated are rich with structure that has to be carefully considered in statistical analyses. EHR is especially complex as it represents a mixture of epidemiological (disease related), behavioral (patient related), and institutional (healthcare related) processes, all of which may change over time, and do so depending on the outcome of interest. EHR data require detailed statistical considerations, including discussions of generalizability of results based on a specific EHR, outcome-based cohort selection within a given EHR, informative missingness, and uncertain censoring issues. Each of these issues is non-trivial, making use of EHR data is controversial [5–8], especially when used for describing aggregated macroscopic observables [9]. We take a look at several of the major statistical issues in greater detail. Wherever possible, we try to point the readers to areas of needed future development.
Generalizability.
An EHR population is often tied to a specific location, via the so-called “catchment area” of the governing institution. The size and socio-demographic makeup of the catchment area depends on the institutional infrastructure, and economic as well as administrative issues. These issues may manifest themselves in practice as differences in distributions of risk factors, such as race, socio-economic status, and institutional and medical protocols, which could result in EHR-specific disease case severity and risk profile mixes. When not accounted for, these differences could lead to analyses and results for the specific EHR population being non-representative of the population at large [10, 11]. In addition, uncertainty in patient socio-economic predictors corresponds to the measurement error problem, which needs to be dealt with statistically as described in [11].
Simple examples of this include differences in EHR populations due to institution’s physical location, such as rural versus urban. Location may be a proxy for access to care, as well as environmental and socio-economic factors. More complex examples include situations where prevalence or other aspects of the disease under study are linked to the EHR population characteristics in a confounded way, where confounders are unmeasured and potentially time-varying [12, 13], such as relating health care patterns and the number of visits a subject has made to the doctor to disease severity. Many of these issues are not new, and have been extensively dealt with in the area of environmental health, health economics, and healthcare policy, especially outcomes research [14] and provider profiling literature [15, 16].
Issues such as these underline the importance of sophisticated statistical modeling as well as replicability for EHR at multiple institutions, and in multiple socio-economic strata. Development of formal methods for comparisons of specific EHR results to other EHRs as well as results published in the literature from alternative types of studies, perhaps using meta-analysis tools and Bayesian techniques [17, 18], could shed additional light on these complex issues.
Cohort definition.
EHR contains information on a variety of patients, differing in socio-economic and health histories, health seeking behavior, as well as types and quantity of data in their records. The act of performing an EHR-based study requires an algorithm-driven extraction of a specific and well-defined EHR subpopulation; including all patients from EHR is neither practical nor advisable. This process requires decisions regarding inclusion and exclusion criteria, length of record, and amount and quality of data in the record. The final cohort of patients is ideally well defined, and representative of the larger specific subpopulation it was chosen from.
Identifying patient cohorts from EHR data is an open research problem [19]. For some conditions, selection of patients can rely on simple laboratory test queries. For instance, patients with chronic kidney disease can be selected by their eGFR (estimated glomerular filtration rate; this variable is computed based on creatinine levels, a routine laboratory test) because eGFR is often the sole variable used to diagnose CKD. For other conditions, however, selecting patients is more complex, as the clues to the presence of disease are spread over heterogeneous data elements in the record (e.g., laboratory test values, mentions of symptoms in the notes, and presence of specific medications). Furthermore, an institution’s record may miss relevant information about a patient history. For instance, a particular medical event which should preclude a patient from being identified as a case might have not been recorded into the EHR because the patient sought care in a different institution. As such, there is often uncertainty associated with the definition of cases and control among the patients. A careful cohort definition aims to reduce this uncertainty, ultimately increasing the usability of the findings [11]. However, when not possible, misclassification in cohort definition can be statistically incorporated into effect estimates as described in [11].
Missing data.
EHR data are generally collected over long time periods, which means that missing data is one of the most important statistical issues to consider. Missingness in each of the three processes (disease related, patient related and healthcare related), has a potential for biasing the results of statistical analyses [11, 20, 21]. Additionally, uncertainty about the missingness mechanisms themselves presents another layer of complexity for statistical analysis of EHR data.
Amongst many types of missingness, record gaps (incomplete data) and censoring (left, right, and interval) are some of the most commonly encountered in EHR. More formally, if we let an individual’s true health status be represented by a multivariate continuous time stochastic process X(t), his or her interaction with a healthcare system as another stochastic process Y (t, X(t)), then the observed EHR record is a third stochastic process h(X(t), Y (t, X(t))). In this framework the first processes is latent (hidden, unobserved), while the second and third one are only partially and often irregularly observed.
In many scientific contexts, missingness might be relatively benign and non-informative, and manifested simply as non-uniformly sampled data in a way that is not related to the disease of interest (missing at random). However, in the EHR context, the missingness can be informative and, more specifically, related to the patient’s disease status itself. In those situations, postulating the mechanisms underlying the missingness, as with Bayesian approaches [22], and formally assessing robustness of results with respect to the implicitly defined missingness mechanisms, via subsampling and comparisons with other large EHR populations and clinical studies, is paramount.
Censoring.
Real patients move, change insurance, and are seen by different healthcare centers and institutions that collect data independently. Moreover, real patients vary in healthcare seeking behaviors, both reactively and proactively. This means that measurement pattern (visits as well as lack of visits) is sometimes related to health state. However, it is unknown whether lack of measurement implies absence of the disease, a severe case of an unrelated morbidity or mortality, a severe case of the disease itself treated by a different healthcare provider, or is a reflection of a patient’s attitude towards healthcare. In the time-to-event context, this means that censoring is a complex issue, where different types of censoring (left, right, interval censoring) may occur for each patient, and cannot always be assumed to be a separate, random process. For example, in the study of progression of chronic kidney disease, assuming onset of disease along with subsequent information are indeed present in a patient’s record, lack of a progression event causes the patient’s time to progression to be right censored at the last recorded visit date. Put differently, a lack of measurement does not imply health, and the information in the record only informs the lower bound on the time to progression.
Like in patient cohort selection, one way to alleviate the effect of this challenge is to focus on a patient population for which EHR has enough longitudinal and representative information. This may be an attainable goal for EHRs from institutions with captive populations which do not tend to move (long-term residents), or where patients are observed for all care, as may be the case with inner city populations and community hospitals.
Variable selection.
EHR data exist in the form of laboratory and other physiologic measurements, structured and unstructured notes, orders of various types (e.g., medication orders), billing codes, and images (e.g., CAT scans). Within each of these variable types lie hundreds to thousands of concepts and variable types. Beyond the sheer number of variables to choose from, not all variables are measured with the same frequency, on the same severity scale, or with the same quality and certainty. For example, an event is sometimes reported in a clinician’s note with qualifiers such as “perhaps mild anxiety” or “patient denies smoking”. These issues make the automatic variable selection process in the EHR setting remarkably complex.
Additionally, variables in the record may have different granularities and definitions. For instance, laboratory measurements for the same type of test, such as glucose, are often recorded under different measurement names depending on the healthcare setting where the measurement occurs. In variables extracted from notes, granularity issues are even more pervasive – a patient’s “cough” and “productive cough” could be considered the same concept in a study of kidney function, but should probably be considered separate variables when studying pneumonia [23]. Even further, some measurements become obsolete over different periods of time. For example, a troponin measurement is representative of a physiological state for a day, while a hbA1c value is valid for six months. The dimensions that capture the human condition, and the healthcare system itself, are continually evolving.
While there exist advanced methods for variable selection catered to large data (for example, see [24] for methods for variable selection for survival analysis), none developed to date are especially equipped to deal with EHR-specific issues. The variable selection step is a critical step to EHR-driven analysis, and research into adaptive and time-varying variable selection methods would be of great interest.
Statistical models: joint survival and longitudinal modeling.
A significant amount of EHR appeal lies in its ability to provide joint modeling of patient’s time-varying processes and disease progression. As the medical field moves towards personalized medicine, ability to provide patient-specific counseling based on all available patient’s data would be a powerful asset to any EHR based system. Joint modeling of longitudinal and time-to-event data has been an active area on the forefront of biostatistical research, with many notable contributions over the past two decades (see for example [25–28]).
Within a large EHR, considerable diversity can often be observed in the frequency, quality, and type of longitudinal measurements, as alluded to in the earlier missing data discussion. Many individual patients have a burst of measurements over a short period of time, while allowing years to pass without being seen by the provider. A fraction of patients have routine regular measurements, and aperiodic bursts of dense measurements; see Fig. 1 in [29] as an example. Yet another fraction has very sparse, irregular measurements, with no particular discernible pattern.
One way to deal with irregular measurements would be to follow the aggregation method in [30]. Alternatively, interpolation or latent variable modeling of covariate processes may provide a solution for imputing the missing measurements over time. There is a history of modified interpolation working well in the EHR context [31]. Nevertheless, interpolating many months or years of laboratory test values can lead to serious problems, not the least of which involve drastic changes in the derivatives of a patient’s health state. As an example, consider a patient whose blood glucose level is normal for two years, but is followed by a sudden increase at the end of the two year period. If the patient’s glucose is only measured at the beginning and the end of the two year period, linearly interpolating between the two measurements will obfuscate the sharp change likely masking (left-censoring) the true onset of a disease. Researchers have have tried to understand how to deal with this problem in the EHR context, using methods ranging from “half-life” analysis of laboratory measurements [32], to information entropy for establishing measurement rates such that they remain reliable [33]. Research on joint modeling for observed covariate processes and disease progression in the EHR context promises to be a fruitful area of research.
3. Case Study: Chronic Kidney Disease
3.1. Chronic Kidney Disease
Chronic Kidney Disease (CKD) is a progressive condition in which kidney function decreases over time. When left untreated or poorly managed, CKD can lead to renal failure requiring dialysis and transplantation. In the United States its prevalence is reaching epidemic proportions (13% of the general population have a diagnosis of CKD, and 6.3% have stage 3 or higher CKD), and it costs the nation billions annually, totaling 17% of total Medicare expenditures for patients of age 65 and over [34].
The National Kidney Foundation has established five stages of progression of CKD to reflect levels of kidney damage and ensure that each stage gets appropriate care. Staging is based on the glomerular filtration rate (GFR), which is considered the best estimate of kidney function. In practice, the estimated GFR (eGFR), calculated using routinely collected variables, such as plasma creatinine level, age, and gender, is an acceptable metric for kidney function.
The official definition for CKD stage 3 (or moderate CKD) is kidney damage as reflected by a moderate decrease in eGFR to the range of 30 − 59ml/min/1.73m2 for at least three months [35]. Patients are recommended to be referred to a nephrologist when reaching stage 3, so that an appropriate treatment can be devised to slow down the decrease in kidney function. Stage 4, or severe CKD, is defined as eGFR in the range of 15 − 29ml/min/1.73m2. Stage 5 (GFR below 15ml/min/1.73m2) is considered kidney failure. If CKD is left untreated, eGFR decreases in a somewhat linear fashion [35]. In our study of CKD progression, we thus focus on the survival analysis of progression from stage 3 to stage 4.
Many conditions can lead to CKD, including diabetes, cardiovascular disease, and hypertension [35]. The mechanisms of damage vary – a number of factors have been found to be predictive of a decline in function and some factors may be protective [36–38]. Preventing dire outcomes requires both early detection, aggressive management plans, and resources [39]. Overall, CKD is a complex disease because decrease in kidney function affects many systems in the body, which in turn damage the kidney further. As such, progression of CKD is still poorly understood [38].
3.2. Dataset and Data Preprocessing
After IRB board approval, the dataset was assembled from a single institution, NewYork-Presbyterian (NYP) hospital, in New York City. Part of NYP is an outpatient clinic, which provides primary care. Patients who visit the clinic tend to be local to the hospital. Thus, in addition to routine visits to their primary care providers and consultation with specialists, their records are also more likely to contain information from visits to the emergency department and hospital admissions. Choosing a stable subpopulation like this one reduces censoring and informative missingness, and increases the likelihood of EHR containing most of the patient’s medical record. Furthermore, for the sake of a survival analysis study on a disease such as CKD, for which progression occurs at the resolution of years, our dataset is comprised of regular patients from our outpatient clinic.
While the EHR contains much data, we focused on three types of variables: demographic variables, healthcare variables, and laboratory test variables. Healthcare variables, such as number of visits, are particularly attractive in investigating EHR data, as their dynamics reflect a mixture of processes: the disease-related process (more severely sick patients are likely to visit their doctors more than less severe patients), the healthcare process (routine visits and consultation visits often follow schedules as suggested by clinical guidelines, but also depend on access to care and types of insurance benefits), and the patient process (some patients are more adherent to care than others, and thus are more likely to visit their doctors). The EHR is the perfect conduit for capturing visit patterns. We hypothesize that healthcare variables, as derived from the EHR, provide valuable insight into disease progression. As for laboratory variables, they have been the primary types of variables investigated in disease progression thus far and as such made much sense to be included [38]. We leave for future work the use of other EHR variables, such as diagnosis codes and medication orders.
Patient cohort identification.
Our starting pool of patients consisted of all patients who visited the outpatient clinic at NewYork-Presbyterian at least three times between 2006 and 2012. A threshold of at least three clinic visits was chosen to ensure that enough information would be present in the patient records, while not pruning too many patients the original number of patients with at least one visit to the clinic was 23,751 patients). Patients with a transplant or a diagnosis of HIV-AIDS were filtered out, as their kidney function is well known to decline in a predictable manner. Patients who had any acute kidney failure (i.e., a sudden severe drop of eGFR over a few days) were also filtered out of the cohort. Table 1 shows the different steps towards the final dataset collection.
Table 1:
Patient cohort | All n | Males n (%) | Females n (%) |
---|---|---|---|
Patients with 3+ visits in outpatient clinic | 14141 | 4952 (35%) | 9189 (65%) |
AND eGFR<60 at least once in the longitudinal record, excluding patient with an ICD9 for transplant or HIV-AIDS | 3745 | 1234 (33%) | 2511 (67%) |
AND eGFR<60 for at least 3 months | 1038 | 431 (41.1%) | 617 (58.9%) |
AND no acute renal failure | 1008 | 420 (41.7%) | 588 (58.3%) |
Because the definition of stage 3 CKD is based on variables routinely collected in the EHR for all patients, the most accurate fashion to select patients is based on their eGFR values, as opposed to the presence of a CKD diagnosis in the record [40]. Thus, we relied on eGFR calculations to select patients (namely the MDRD calculation [41]). The final patient cohort contains 1,008 patients. The average age at the time of stage 3 diagnosis was 62.2 years (SD = 12.4), and the data set was 41.7% male.
Variables.
Given the patient cohort as specified above, we collected their longitudinal record (including visits and tests outside of the selection window of 2006–2012). The longitudinal record contains much information – time series of laboratory tests, ICD9 diagnosis codes, medication orders, and patient notes authored by clinicians. In this paper, we focused on three types of variables: demographic variables, laboratory variables, and healthcare variables.
Demographics.
We selected gender and age at stage 3 diagnosis for the demographics variables. While race is a potentially useful variable to consider, we decided not to include it in our analysis because it is not well documented in the EHR in general. In our outpatient clinic specifically, race was recorded only for 631 patients out of the 14141 patients (the label “Other” is used for 95.5% of patients) (see Table 2). Thus, in our calculation of eGFR, we assume all patients are white.
Table 2:
Covariate | All | Males | Females | |||
---|---|---|---|---|---|---|
% | % | % | ||||
Gender | 100% | 41.7% | 58.3% | |||
Asian | 0.1% | 0.2% | 0.0% | |||
Range | Range | Range | ||||
Age | (16.7–94.7) | (16.8–94.7) | (16.7–90.2) | |||
Record span (years) | (.01–20.73) | (.01–20.64) | (.05–20.73) | |||
Number of visits (1) | (0–277) | (0–277) | (0–76) | |||
Number of visits (2) | (0–24) | (0–24) | (0–18) | |||
Number of labs | (0–5409) | (0–5409) | (0–1203) |
Laboratory Test Values.
Based on a literature review of CKD progression and the input of our nephrologist on the team, we selected the following 17 laboratory tests to investigate in our models: 25-OH vitamin D, bicarbonate, blood urea nitrogen, calcium, ionized calcium, chloride, high-sensitivity C-reactive protein, hematocrit, hemoglobin, potassium, magnesium, microalbumin creatinine ratio, sodium, phosphorus, parathyroid hormone, triglycerides, and uric acid. These laboratory tests are discussed further in Section 5. Because laboratory tests names are not standardized in the EHR [42] and different tests capture similar biological constructs with different instruments, the laboratory tests stored in the EHR are often too granular to be used directly in data mining and statistical analysis [43]. For instance, measurements for the laboratory test for bicarbonate can be stored in the EHR under 17 different labels. We hired medical students to identify and map the different low-granularity data elements as stored in the EHR to each laboratory test, resulting in a time series for each type of test, one per variable as defined above.
Laboratory Test Indicators.
The laboratory test measurements needed to be synthesized so that subjects with zero to many laboratory measurements could be included in the analyses. To accommodate patients with multiple laboratory measurements, we calculated the average of each test in the year prior to stage 3 diagnosis. To include patients who had no measures of a certain lab (who would typically be removed from an analysis), we used 17 indicator variables, denoting whether the patient had at least one measure for that laboratory test in the year prior to stage 3 diagnosis. The inclusion of these patients is important, because there is information in a missing test [44], as a test is ordered based on a variety of reasons, including how sick the patient is, the clinician’s rationale and decision making process, and even how expensive the test is. The continuous, average laboratory test measurement variable was then log transformed, standardized, and crossed with the indicator variables, so that all patients could be included in the model, regardless of how many measurements were taken for that lab. Therefore, the indicator variables are meant to serve as proxies for the impact of the measurement itself on the hazard of progression to stage 4, and the actual laboratory variables can then reflect the impact on the hazard of progression to stage 4 for the patients who had the measurement taken.
Healthcare Variables.
We experimented with ways to define a visit. A visit of type 1 was defined as any date for which there is at least one laboratory measurement in the longitudinal record. So the number of visits corresponds to the number of unique dates at which tests were taken. A visit of type 2 was defined as any consecutive set of measurements within three days of each other. Thus, the number of visits in a record corresponds to the number of unique dates when tests were taken, but only if the dates were 3 or more days apart. Under the second definition, a four-day admission to the hospital would be considered one visit, while it would be considered as four visits under the first definition.
4. Survival Analysis of CKD
Survival analysis, time-to-event data analysis, lifetime data analysis, or reliability theory [45–55], all generically refer to a class of statistical models used to describe how the distribution of time to an event of interest changes as a function of certain factors in a given population. Survival analysis owes its name to the probability function of an event occurring after a certain time point t, known as the “survival function”, S(t) := P(T > t). An alternative to the survival function is the hazard rate, or formally, h(t) := f(t)/S(t), where, f(t) is the density associated with the distribution of time to an event of interest - event time - T. Often, the hazard rate h(t) is interpreted as the probability of an event happening in a short time period immediately following t, given that the event has not happened at or before time t. Although more difficult to estimate, hazard functions help visualize finer details in risk patterns than the survival function or the cumulative hazard function, .
The class of survival analysis comprises a wide range of models. In particular, numerous methods have been studied for hazard rate estimation, from classic parametric methods, to semiparametric such as wavelet methods [56] and smoothing methods [57–59], to nonparametric hazard estimation. In this latter group for example, Beta process prior models [60, 61], and correlated process priors such as those used by [62, 63], have been some of the extensively studied. Many other nonparametric methods, including Bayesian, are reviewed in [64] and [65].
Despite the variety of proposed approaches for hazard estimation, challenges in computing and associated statistical inferential procedures remain an important consideration even today. Fast approaches that can compromise between ease of implementation, inference, and flexibility especially over longer time scales, are appealing. The Bayesian multi-resolution hazard (MRH) estimator [66–70] offers several advantages to that extent, including a-priori self-consistency at multiple time scales [68, 71], ability to accommodate a variety of a priori shape and smoothness of the hazard function, fast implementation, and simple inference procedures. The latter includes ability to provide uncertainty estimates on the hazard functions itself via point-wise or curve-wise credible bands, at multiple time scales simultaneously, and adjusted for covariate effects.
4.1. Multi-Resolution Hazard Estimator
Bayesian multiresolution models have recently been adapted to hazard functions, and equipped to deal with censoring and truncation [67–69]. Following [66], these models assume a binary partition tree structure for time-axis partition. A Gamma prior is placed on the total cumulative hazard function over a finite time interval, and a set of Beta priors are used to describe the recursively defined partition parameters (also referred to as the “split” parameters). While essentially providing a piece-wise continuous prior on the hazard rate, the tree structure allows several interesting properties that are not common to other piece-wise hazard models. This includes the ability to control the smoothness of the resulting hazard rate via hyperparameters of the Gamma and Beta priors. The MRH model was used to model a breast cancer recurrence risk across multiple clinical centers in [68], prostate cancer [71], and multivariate hazard for different subpopulations in [69].
In MRH, we define the “final time resolution” comprising J equal length time intervals , covering the time span of the study (t0, tJ). The final time, tJ, is set at, or slightly before, the last observed failure in the sample. The piece-wise approximation to the hazard rate is done via a set of hazard increments, , where each dj represents the aggregated hazard rate over the interval (tj−1, tj), i.e. . To facilitate the recursive binary partition, we let J = 2M, where M is a positive integer describing the “height” of the partition tree. M can be selected via model selection criteria as in [68], or pre-set using clinical input, as in [70, 71]. From a statistical point of view, M is usually chosen so that each time interval has multiple event occurrences within it.
For simplicity, let H denote the total cumulative hazard H(tJ) over the study interval (t0, tJ), and let HM,0 ≡ d1, . For each m, m = 1, 2, …, M −1, we recursively define the “level-m” hazard increments as Hm,p ≡ Hm+1,2p+Hm+1,2p+1, where p = 0, 1, 2, …, 2m−1. Then the corresponding split variable Rm,p are defined as Hm,2p/Hm−1,p. From here, the MRH tree can be specified by H, and the set of splits , and any dj can be represented as a product of H and a certain branch of split variables. For example, when M = 3, we have d1 = HR1,0R2,0R3,0, d2 = HR1,0R2,0(1−R3,0), …, d8 = H(1 − R1,0)(1 − R2,1)(1 − R3,3).
We adopt Beta priors for split parameters Rm,p’s, and a Gamma prior on H as in [68]. For example, the priors for H and Rm,p in a M = 3 (J = 8) model would have the following priors:
(1) |
(2) |
(3) |
(4) |
With this parametrization, the prior expectation of each hazard increment is determined by the shape parameters of the Beta priors, and the hyperparameters of H, as each increment dj is a product of independent variables; in the M = 3 example, E(d1) = E(H)E(R1,0)E(R2,0)E(R3,0). The tree hyperparameters are thus directly linked to the distribution of the hazard increment, specifying a-priori expected values of each hazard increment as well as the correlation between the hazard increments [67, 71], thus controlling the smoothness in the multiresolution prior.
4.2. Estimation
The multiresolution prior for the hazard rate can be used as a prior for the baseline hazard function in any model for time-to-event data. One of the most commonly used models in survival analysis is the Cox proportional hazard model [45], where the hazard rate for patient i is modeled as follows:
(5) |
Here, the hazard rate for a specific observation i is obtained via multiplication of the baseline hazard rate h0 (the hazard in the baseline group of patients) by . The parameter vector β is known as log hazard ratio. The log hazard ratios are the effects of covariates in Xi, a vector of p explanatory variables Xi,1, Xi,2 ···Xi,p for observation i. The hazards in different patient groups are thus assumed to be proportional to each other over the entire study period, giving this model its name.
Unlike the Cox proportional hazards model [45] which aims to estimate the effects of covariates while treating the hazard function as a nuisance parameter, the MRH model estimates the hazard and all parameters jointly, allowing for joint inference which is based on the joint posterior distribution estimated using MCMC. The likelihood function of the MRH proportional hazards model based on right censored data from N patients is:
Here, δi = 0 if Ti is right censored, and δi = 1 if not (i.e. if that patient did not have an observed event within the study period).
The Bayesian MRH model is then completed by specifying prior distributions for β, as well as on hyperpriors in the MRH tree parameters (a, λ, k, and the set of γm,p parameters). In what follows, a flat prior shape on the baseline hazard function is assumed, via setting k = 0.5, and γm,p = 0.5 for all m, p, making the prior on the split parameters a symmetric beta density, Rm,p Beta(a/(2m), a/(2m)), evenly splitting the hazard among the intervals. The Gibbs sampler [72] steps for the remaining variables (H, {Rm,p}, a, λ, and β) are the same as those in [68]. We briefly outline each full conditional distribution below, making up the core of the Gibbs sampler. In what follows, the superscript ‘−’ is used to denote a vector of all other parameters and data except for that parameter:
- If a is given a vague zero-truncated Poisson prior, , the full conditional distribution for a is given as follows:
where B(a/2ma, a/2m) is the Beta function. - If the scale parameter λ from Gamma prior for the cumulative hazard function H is given an exponential prior with mean μλ, the resulting full conditional is:
- A normal prior on each log hazard ratio, βj, leads to the the following full conditional distribution:
where F(Ti) = H(min(Ti, tJ))/H(tJ). The full conditional for H, π(H|H−) is a Gamma density, with the mean of , and variance of .
- The full conditionals for each Rm,p are given as follows:
The sampling from full conditionals is done either directly, or via a Metropolis-Hastings like algorithms. All estimation was performed using a custom made MRH library in R [73]. For each model we consider, statistical inference was based on 5 chains, each with 200,000 iterations, the first 50,000 iterations burned, and the remaining iterations thinned to remove autocorrelation. Convergence was determined via trace plots, autocorrelation plots, and the Geweke test statistic [74]. Parameter estimates were calculated using the median of the posterior distribution of each parameter, generated by the Gibbs sampler chains. Uncertainty was communicated via 95% central credible intervals, whose bounds were calculated as the 2.5 and 97.5 percentiles of the simulated posterior distribution.
5. Results
5.1. Experimental Setup
The goal of the analysis is to identify variables associated with progression of CKD, as derived from our patient cohort. Because of the nature of the EHR dataset, we experimented with different settings for the survival analysis and survival models. Table 3 shows descriptive statistics of the different settings.
Table 3:
All | |||
---|---|---|---|
% | % | % | |
Total patients | 100% | 63.4 | 60.4% |
Males | 41.7% | 38.7% | 38.6% |
# stage 4 dx | 11.9% | 8.9% | 9.4% |
Covariate | Range | Range | Range |
Age | (16.7–94.7) | (16.7–94.7) | (16.7–94.7) |
Observation time (yrs), UB | (0–20.7) | (0–20.4) | (0–20.0) |
# of visits (1) | (0–277) | (0–277) | (0–277) |
# of labs | (0–5409) | (0–5409) | (0–5409) |
Effect of left censoring.
To investigate the effect of left censoring on the survival analysis, we experiment with three sub-cohorts from our dataset: all patients (All, n=1008), patients with at least one laboratory measurement at least 3 months prior to CKD diagnosis (3m group, n=639), and patients with at least one laboratory measurement at least 6 months prior to CKD diagnosis (6m group, n=609).
Effect of right censoring.
Out of the 1008 patients, 11.9% (n=120) had an observed transition from stage 3 to stage 4 CKD, 4.8% (n=49) had an observed death, and the remaining patients (n=839) were right censored. To investigate the effect of right censoring on the survival analysis, we experiment with two alternative observation times for the right-censored patients: an upper bound (UB) observation time with a censoring date as the end of the data collection time for the entire dataset, and a lower bound (LB) observation time with censoring date defined as the time of last laboratory measurement for a given patient.
Effect of visit definition.
We experimented with the two visit definitions, as described in section 3.2. To dampen the effects of outliers, the two number of visits variables were both capped at their respective 95th percentiles (equal to 41 and 10), with observations in the upper 5% set to the values of the caps.
Basic healthcare vs. clinical variables.
To test the sensitivity of the modeling to the healthcare variables, we selected two types of variables. The basic healthcare variables were age (which was standardized), gender, the number of visits to the doctor in the year before stage 3 diagnosis, and the number of laboratory tests recorded in the year before stage 3 diagnosis. The number of laboratory tests recorded was discretized into 5 categories: none, 1–50, 51–100, 101–200, and more than 200 laboratory tests in the year prior to stage 3 diagnosis. Race was not included in the model as 97.4% of the 1,008 patients were in the “other” category (Table 2).
We examined clinical variables by adding laboratory values and laboratory indicators to the basic healthcare variables. The average of the laboratory tests within the year prior to stage 3 diagnosis was calculated, and used as a baseline stage 3 diagnosis variable.
MRH vs. traditional survival analysis.
While MRH models are powerful tools for joint multiscale estimation of hazard rate and covariate effects, we also assessed whether simpler models represent our data accurately. To that end, were compared the MRH results to the results from standard Cox proportional hazards, Weibull, and Exponential survival models.
To identify the optimal set of clinical variables to include in the MRH model, we selected the best variable subset (among all healthcare and clinical variables) through a step-wise regression in the Cox proportional hazards model. In all of the MRH models, the study period was capped at 16 years, with the time axis divided into 16 bins where each bin represents one year.
Experiments.
Given the different population subsets, variables, and models, we conducted the following sets of experiments. Overall, there were 6 basic healthcare models explored, where each of the six model was calculated using both the upper and lower bound definition of observation time, for a total of 12 models:
Model A1: All patients, with the number of visits defined as type 1
Model A2: All patients, with the number of visits defined as type 2
Model B1: 3m patients, with the number of visits defined as type 1
Model B2: 3m patients, with the number of visits defined as type 2
Model C1: 6m patients, with the number of visits defined as type 1
Model C2: 6m patients, with the number of visits defined as type 2.
5.2. Basic Healthcare Model
Differences between the two definitions of the observation time were minimal, based on the summaries in Table 3 and the MRH model results in Table 4. Summaries of the difference between the lower and upper bound definitions show an average difference of less than 5 months for all groups, with a maximum difference of 3.5 years. In addition, the MRH model results show very minor differences in the parameter estimates of all models, particularly in the comparison of the 95% credible intervals. It does not appear that the different definitions of the observation time have a large impact on the model estimates. For this reason, moving forward our discussion of the results will only focus on the models using the upper bound definition of the observation time.
Table 4:
Parameter Estimates | |||||
---|---|---|---|---|---|
Model | Gender (male) | Age | # visits | # labs | |
UB | A1 | 0.52* (0.15, 0.89) | 0.06 (−0.13, 0.24) | −0.12 (−0.04, 0.02) | 0.26* (0.02, 0.51) |
A2 | 0.47* (0.11, 0.83) | 0.05 (−0.11, 0.25) | −0.15* (−0.25, −0.07) | 0.36* (0.18, 0.54) | |
LB | A1 | 0.70* (0.26, 1.16) | 0.09 (−0.13, 0.33) | 0.001 (−0.04, 0.03) | 0.22 (−0.09, 0.54) |
A2 | 0.68* (0.25, 1.12) | 0.09 (−0.14, 0.32) | −0.04 (−0.15, 0.06) | 0.28* (0.05, 0.53) |
denotes that there is 95% or higher probability that the effect is different than 0. In the table, “UB” denotes the upper bound calculation of the observation period, and “LB” denotes the lower bound calculation of the observation period, as discussed in section 3.2.
Figure 1 shows the smoothed estimated hazard rate for the three different patient subgroups, using the second definition of the number of visits (models A2, B2, and C2). As expected, the hazard rate for all patients is higher throughout the course of the study, and especially so within the first few years since stage 3 diagnosis (as recorded in the EHR), when compared to the 3m and 6m groups (patients who had at least one laboratory test measure at least 3 (6) months prior to stage 3 CKD diagnosis). However, there appears to be almost no difference in the hazard rate between the 3m and 6m subpopulations of patients. For all groups, we observe that the hazard rate is highest in the first year, then flat between years 5 and 10 post stage 3 diagnosis, followed by a slight increase after ten years.
Figure 2 provides the summary of the results in terms of probability of developing stage 4 CKD within one year from stage 3 CKD diagnosis. Among all patients this probability is centered around 3.0%, which is approximately twice of that for patients with at least one laboratory test measure recorded in the EHR at least 3 (6) months before the diagnosis date. The probability in these two groups, 3m and 6m, is centered around 1.5%, confirming once again that there is little practical difference among 3m and 6m patients. This figure also shows the uncertainty about these probability estimates, indicating that there is almost complete overlap in information about the probability of failure within first year from stage 3 diagnosis for 3m and 6pm patients, while all patients are clearly at an increased risk in spite of some distributional overlap with the other two groups.
With respect to the effects of age on the hazard rate, among all patients the older ones are found to have higher risk of stage 4 CKD. However, the opposite seems true when 3m or 6m patients are analyzed. For example, Figure 3 shows that in the model with all patients, probability of a 40 year old developing stage 4 CKD within 5 years since stage 3 diagnosis is centered around 5.3%, while this probability is centered around 6.3% for an 80 year old. However, for 3m patients, the probability of a 40 year old developing stage 4 CKD within 5 years is centered around 4.8%, and around 2.6% for an 80 year old. In general, censoring rates will have a dramatic effect on covariate effects and survival times [21, 75]. In this particular CKD case, there are two censoring issues: possible informative left-censoring for patients who were in EHR for less than 3 months, and right censoring of either older patients or those patients who were in the EHR a long time with censoring due to other diseases or death. Thus, one possible explanation for the estimated probability of stage 4 for older patients in the “All” group might be the more rapid stage 4 onset, implying that the effects of age are not as impacted by censoring as they are in the 3m patient group.
In terms of the effect of number of visits, there were some notable differences between all patients and those patients who were in EHR for at least 3 months. For all patients, each additional visit to the doctor in the year prior to stage 3 diagnosis decreased the risk of stage 4 CKD by 11.3% (definition 1 for number of visits) or 13.9% (definition 2 for number of visits). However, the impact of the number of visits was more moderate for the patients who were in the EHR system for more than 3 months: for them, each additional visit to the doctor in the year prior to stage 3 CKD was associated with an increased risk of stage 4 CKD by 2% (definition 1), or decreased the risk by 7% (definition 2). This pattern can be observed in figure 4, which shows that the probability of developing stage 4 CDK within 5 years of stage 3 CKD diagnosis is highest for all subjects who had zero visits to the doctor in the year prior to diagnosis (using definition 1). However, this probability is lowest for the subjects who were in the system at least 3 months prior to diagnosis, and who did not visit the doctor (using definition 1). It is important to note that all the parameter effects’ 95% credible intervals included the null effect. All non-exponentiated covariate effect (log-hazard ratio) estimates, and the corresponding credible intervals, are shown in Table 4.
In terms of the effect of the number of lab tests, an increase in the number of tests (which can be a proxy for how sick an individual is) was associated with an increase in the hazard of stage 4 CKD. This, combined with the effects discussed in the above paragraph, indicate that while a large number of laboratory tests are associated with increased hazard of stage 4, frequent visits to the doctor can decrease the hazard of 4 CKD.
Finally, in terms of gender effects, male gender was estimated to be positively associated with the hazard of stage 4 diagnosis, across all models, although the only 95% credible intervals which did not include the null effect were obtained in models with all patients.
5.3. Clinical Model
Results from the basic healthcare models allowed us to conclude that the 3m subgroup is the most appropriate set of patients to use in studying the hazard of stage 4 CKD, as it seems that the data set contains a separate subgroup of patients who do not have a recorded true stage 3 diagnosis date. Additionally, as the laboratory tests are often performed over the course of a several successive days, it seems also reasonable to adopt the second definition of the number of visits. Starting from there, we extended this basic model (B2) to include 17 laboratory test variables defined in section 3, as well as the 17 indicator variables defined in section 3. The laboratory test variables considered (25-OH vitamin D, bicarbonate, blood urea nitrogen, calcium, ionized calcium, chloride, high-sensitivity C-reactive protein, hematocrit, hemoglobin, potassium, magnesium, microalbumin creatinine ratio, sodium phosphorus, parathyroid hormone, triglycerides, and uric acid).
To determine which of the 34 total laboratory variables (17 laboratory test variables and 17 laboratory test indicators) were most important to stage 4 progression, we performed a backward-forward Cox proportional hazards regression, adjusted for mandatory inclusion of non-laboratory variables already in the B2 model. The subset of variables selected this way (blood urea nitrogen, calcium, hematocrit, magnesium, microalbumin creatinine ratio, and sodium as continuous laboratory test covariates, and chloride, and C-reactive protein as indicators), along with gender, age, number of visits (type 2) in the year before stage 3 diagnosis, and the number of laboratory tests in the year prior to stage 3 diagnosis, were all included as predictors in the MRH model, which we refer to as “Optimal model”. We also consider a second model, which includes all laboratory indicators in addition to the selected laboratory variables and basic healthcare variables, as main effects. This second model is referred to as the “Full model”.
Figure 5 shows the hazard rate results from the MRH model: the hazard rate for the baseline group (females, 62.2 years old, with no visits or laboratory tests in the year prior to stage 3 CKD diagnosis) was estimated to be much flatter and lower in magnitude than the one shown for the B2 model. It is interesting to contrast these two models, as the optimal model is in fact model B2 with laboratory tests added as covariates. The baseline hazard rates for the two models look fairly different (Figure 5): the hazard rate of model B2 is lower than that of the optimal model, and the hazard rate of the optimal model shows an increase in the hazard rate after ten years which is not as visible in the hazard rate for model B2. Both models, however, suggest that baseline patients are at low risk for progression to stage 4. The hazard rate is again estimated to be higher for males, but the estimate of the effect is much smaller in this model when compared to the basic B2 model (Table 5). This is likely due to the fact that gender actually carries some of the same metabolic information as the laboratory test measurements, and that the laboratory test values themselves account for some of the differences between males and females. As with the basic healthcare model, the number of visits is estimated to have a negative effect on the risk of developing stage 4 CKD, when adjusted for laboratory variables. However, unlike in B2, in this model the number of laboratory tests is also estimated to decrease this risk, when adjusted for laboratory variables.
Table 5:
Optimal model | Full model | |||||
---|---|---|---|---|---|---|
Covariate | Estimate | 95% Cl | exp (Est) | Estimate | 95% Cl | exp (Est) |
Male | 0.132 | (−0.464, 0.719) | 1.141 | 0.097 | (−0.506, 0.714) | 1.102 |
Age | −0.265 | (−0.512, −0.013)* | 0.767 | −0.245 | (−0.5, 0.01) | 0.783 |
Number of visits | −0.018 | (−0.131, 0.096) | 0.982 | −0.02 | (−0.141, 0.094) | 0.980 |
Number of lab tests | −0.286 | (−0.629, 0.053) | 0.752 | −0.268 | (−0.665, 0.148) | 0.765 |
Indicators | ||||||
Chloride | 0.207 | (−0.967, 1.658) | 1.230 | 0.486 | (−1.066, 2.116) | 1.623 |
C-reactive protein | 1.501 | (0.620, 2.33)* | 4.487 | 1.592 | (0.69, 2.429)* | 4.913 |
Calcium | - | - | - | −0.494 | (−1.486, 0.648) | 0.610 |
Hematocrit | - | - | - | 0.251 | (−0.850, 1.448) | 1.285 |
Magnesium | - | - | - | 0 | (−0.816, 0.808) | 1 |
Microalbumin creatinine ratio | −0.682 | (−2.549, 0.647) | 0.506 | |||
Interaction terms | ||||||
Blood urea nitrogen | 1.133 | (0.816, 1.461)* | 3.104 | 1.138 | (0.816, 1.466)* | 3.121 |
Calcium | −0.523 | (−0.759, −0.260)* | 0.593 | −0.533 | (−0.776, −0.270)* | 0.587 |
Hematocrit | −0.376 | (−0.627, −0.105)* | 0.689 | −0.358 | (−0.620, −0.071)* | 0.699 |
Magnesium | −0.624 | (−1.039, −0.222)* | 0.536 | −0.640 | (−1.068, −0.225)* | 0.527 |
Microalbumin creatinine ratio | 1.217 | (0.586, 1.727)* | 3.378 | 1.553 | (0.688, 2.572)* | 4.726 |
Sodium | 0.441 | (0.164, 0.736)* | 1.554 | 0.394 | (0.107, 0.695)* | 1.483 |
denotes that there is 95% or higher probability that the effect is different than 0.
The optimal and full models show very similar estimates of effects for common variables. However, the effect interpretation in both models is slightly unconventional. The lab test value effects summarized under “Interaction terms” in Table 5 under the “Full model” column heading should be interpreted as the log hazard ratios of these lab test values conditional on the fact that the patient had that particular lab test ordered. For example, the full model shows that estimate of the indicator for hematocrit is equal to 0.251, and the estimate for the lab interaction term is equal to −0.358. Thus, for patients who had hematocrit test done in the year prior to stage 3 diagnosis, the estimated hazard rate is 29% higher than for patients who did not have any hematocrit measures ordered during that year, all else being equal. However, among those patients who did have a hematocrit test in the previous year, larger hematocrit values led to decreased risk (an estimated 30% decrease for every standard deviation increase on the log scale) when compared to patients who had lower hematocrit values.
Results from both lab models (Table 5) show that the largest impact on the hazard rate comes from increases in blood urea nitrogen and microalbumin creatinine ratio values. For example, patients who are 1 standard deviation above the mean for each lab (on the log scale) have a hazard rate more than 3 times higher than patients who had average measures for these two labs. This finding is not surprising as microalbumin creatinine ratio is well recognized as the first predictor of progression in the literature [76], and is one of the few labs which reflects kidney function directly rather than symptoms and quantities the kidneys help regulate.
Additionally, patients who had at least one high sensitivity C-reactive protein laboratory measurement during the year prior to stage 3 diagnosis have the hazard rate almost 4.5 times higher than patients who did not have that measurement during the year prior to stage 3. This finding too is not unexpected, as C reactive protein is a marker for inflammation and is typically ordered for patients with heart disease or chronic inflammatory conditions such as lupus, often seen as comorbidities of CKD. In essence, CRP could be an important proxy for a the presence of such comorbidities associated with elevated risk of CKD progression [77].
Furthermore, lower levels of hematocrit, magnesium, and calcium are found associated with increased hazard of stage 4 CKD in both lab models. In addition, both models also estimate that higher levels of sodium are associated with increased hazard of stage 4 CKD, which is not unexpected given that high sodium concentration is generally a proxy for poor hydration. However, there was little evidence that the effects of these two variables were different from 0, as indicated by the fact that both of their 95% credible intervals contained 0, for both the optimal and full model.
5.4. Comparison to Traditional Survival Analysis
So far we have discussed two types of MRH models: six basic healthcare models and two clinical models. While MRH is a powerful class of models that can be used to provide joint estimates of the hazard function and covariate effects, along with their uncertainty, a simple parametric model for the hazard function might still provide a reasonable alternative in some datasets. To test whether that is the case for the different types of healthcare models we consider in this paper, we utilize two parametric models models, assuming an Exponential and a Weibull distribution, respectively. The hazard rate is flat under the exponential distribution assumption for progression to stage 4 times, which resembles the estimated hazard shape starting with 18 months post stage 3 CKD diagnosis (Figure 5). The Weibull model is similar to the exponential model, but allows for more flexibility in the shape of the hazard rate. Parametric models were calculated using the survreg package in R [73]. The models were compared using Bayesian Information Criterion (BIC [78]), and a custom designed goodness-of-fit (GOF) measure as defined below:
where | · | stands for absolute value, Ii{advance to stage 4 CKD > t} is an indicator variable which equals 1 if the subject i advances to stage 4 CKD after time t post stage 3 diagnosis and equals 0 otherwise, and P(subject i advances to stage 4 CKD > t) is the model-based probability of the subject i advancing to stage 4 after time t post stage 3 diagnosis. This probability is calculated based on the estimated MRH parameters and covariate effects for subject i. Patients who were censored before time t were not included in the GOF calculation at time t. Therefore, nt represents the total number of patients in the cohort minus the number of patients censored before time t. This way, the maximum value the GOF statistic can take is 1, and that maximum can only be achieved at the baseline time t = 0. Lower GOF values indicate a better fitting model.
Results from the model comparison (Table 6) show that the predictive abilities of the Weibull model and the MRH model are very similar across all time points, with the MRH models performing slightly better, although the Weibull model had lower BIC values. The lower BIC values for Weibull model were not unexpected, given that the MRH model estimated a fairly flat hazard function, and that the flexibility of MRH models requires them to have a larger number of parameters than Weibull. The exponential model performed worse than the other two models, with very high GOF values. While the MRH model performs the best in predicting the advance to stage 4 CKD, it seems that a Weibull parametric model is a reasonable alternative for estimation of the hazard rate from stage 3 to stage 4 CKD. The Weibull model is relatively easy to implement and communicate in an applied medical research. The exponential model does not perform very well in this analysis. In general, we do not recommend using a flat hazard in analysis of long-time studies, such as those based on EHR or longer clinical trials [70, 71], where the hazard function is likely to change over time due to changes in factors such as population structure, health care protocols, and prevalence of comorbidities.
Table 6:
Model | GOF | BIC | ||||
---|---|---|---|---|---|---|
Clinical Model | MRH | 0.05 | 0.07 | 0.11 | 0.21 | 650.9 |
Basic Model A1 | MRH | 0.11 | 0.14 | 0.22 | 0.36 | 1243.0 |
Basic Model A2 | MRH | 0.11 | 0.13 | 0.22 | 0.35 | 1230.0 |
Basic Model B1 | MRH | 0.07 | 0.09 | 0.17 | 0.31 | 699.1 |
Basic Model B2 | MRH | 0.07 | 0.09 | 0.17 | 0.32 | 701.6 |
The GOF table also allows us to compare the MRH models to each other. Not surprisingly, the model that performs best is the healthcare model with the laboratory tests. Despite the increased number of parameters, this model has the lowest BIC of all the MRH models, and performs the best in predicting time to stage 4 CKD. Because this model uses more patient-level information, it is better able to predict when each person will advance to stage 4 CKD. It is also not unexpected that the model with all patients is harder to predict accurately when compared to the models using 3m patients, as its GOF and BIC values are higher.
In addition to comparing the hazard rate estimates, we also observed that the covariate estimates and among the different MRH models, the Weibull model, and the Cox proportional hazards model were very similar, including the 95% uncertainty bounds.
6. Conclusions
This paper explores how EHR data, collected primarily for clinical care, can be used for survival analysis in the context of CKD progression. Our findings indicate that survival analysis combined with EHR data is possible when the data are treated properly and it can highlight variables, both clinical and health-care related, which are linked to disease progression. Because the data comes directly from the EHR without any manual curation step, the findings represent directly actionable knowledge, and can be fed back into the EHR easily, thus facilitating the translation of research into patient care. Furthermore, the use of healthcare-related variables along with their temporal patterns is specific to EHR data and provides novel insight into disease progression – such variables do not make sense in a prospective, curated study, where measurements are regular and detached from typical healthcare temporal patterns.
EHR data are complex and present with a range of characteristics, which need attention, however: patient cohort specificities and their impact on the generalizability of anlaysis, data missing not at random, large number of dimensions to represent patient information with unknown inter-relations that render variable selection process complex, uncontrolled data collection, including for time of disease onset – left censoring impacts modeling. Finally, it is difficult to predict how EHR-based complexities can impact whether a particular model can provide a consistent, stable, accurate, and complete picture of a disease progression.
Generalizability.
EHR data are connected to their physical environment, making generalizability an issue. The Columbia University Medical Center AIM clinic EHR treats patients in lower socio-economic strata living in northern Manhattan. There is a large amount of hispanic patients, for which CKD progression is poorly understood [79]. Our results could be specific to ethnicity, demographics and their cultural implications, and general environment such as air quality, housing quality, etc. One way to increase the generalizablity is through cohort selection. To further increase generalizability of our results, we carefully defined our cohort by narrowing the population by comorbidity. For instance, we removed transplant and HIV-AIDS positive subjects because these patients have kidney failure for known reasons and because these patients progression, induced by treatment of a particular disease, differs from canonical CKD patients. Similarly, we removed patients with acute kidney disease whose kidney failure occurs on a considerably different time scale than CKD. It is worth mentioning that CKD is a disease with a particularly direct method for diagnosis because eGFR, used to diagnose CKD, is easily calculated using routine laboratory tests. Not all diseases are this easily defined (e.g., congestive heart failure, pneumonia).
Data missingness.
When analyzing the longitudinal records of patients, missing data can occur because a patient seeks care in a different institution or because the patient does not seek care at regular times. To account for this, we selected a cohort which is local and comes somewhat regularly for primary care and acute care to our institution. Furthermore, the healthcare variables encapsulated the patterns of data missingness: number of visits made in the year prior to CKD onset, for instance, reflect the measurement patterns as well as potentially the patients’ attitude towards healthcare. While visitation rate does not fully account for the complex relationship between lifelong healthcare patterns and disease severity, the histograms of the visit data, stratified by length of the record, indicate that changes in patient visitation is a proxy for patient acuity. For example individuals with shorter record spans with a higher average numbers of yearly laboratory tests and visits (figure not shown) are more likely to become ill. Furthermore, the interaction laboratory terms allow us to determine how the hazard rate changes based on a patient having laboratory measurement at all.
Another type of missing data comes from the way information is recorded in the EHR. For instance, while race is a potentially important variable for CKD progression, we were not able to rely on it in the analysis, since race is not recorded for most of the patients in our cohort. This is a clear limitation to EHR data. In our current work, we are exploring ways to infer missing information from additional data source in the EHR, such as the clinical narrative about the patient, where race is mentioned.
Selection of variables that best predict outcomes.
Our variable selection procedure was governed by three processes: (i) biological rationale (to determine the 17 laboratory tests most relevant to CKD progression); (ii) logical thinking (to create a proper variable to denote the number of visits to the doctor); and (iii) basic statistical tools used to determine the optimal model. The results of this process allowed us to determine the effects of gender, age, and the number of visits to the doctor and number of laboratory tests in the year prior to CKD stage 3 onset on the time to stage 4. While forward-backward stepwise regression is a simplistic statistical tool for variable selection, we were able to identify laboratory tests that significantly effect time to stage 4 CKD, with results that match previously published clinical findings in the CKD medical community. In the future, we want to development and apply statistical tools for variable selection that would allow us to determine optimal models over a wider range of variables, including a larger set of laboratory tests, comorbidities, medications, and other potentially relevant patient features.
Censoring.
Because individuals are measured either at routine appointment times or when seeking care for an illness, it can be difficult to identify the exact start time of a disease — the time of clinical diagnosis can be days to months after the actual onset of the disease. Thus, left-censoring had an effect on the shape of the hazard rate. We accounted for these left-censoring effects by contrasting different subsets of patients selected based on the length of their observation record. The different subgroups did have different hazard rates, which implies that some of the subjects do not have their true stage 3 CKD diagnosis date recorded. In the future, we want to develop and apply methods for addressing uncertainty due to left-censoring in order to more accurately align patients or account for the varying uncertainty of diagnosis time for different patients, within a statistical model. We would also like to better infer, impute, or estimate the accurate time of diagnosis in particular.
Model selection and stability.
The MRH method is a robust technique for estimating the hazard rate from stage 3 to stage 4 CKD. Moreover the MRH was flexible enough to accommodate the complex fluctuations in the hazard rate while simultaneously estimating covariate effects. The data were best represented by the type III generalized extreme value distribution, the parametric Weibull model. In contrast, the exponential model was not a suitable representation of the data. This is an important point because it highlights the fact that the statistical model distribution used to represent variables collected in the EHR context must be chosen carefully and consciously. Even when a variable distribution could be governed by a known physiologic process, the health care process can intervene though selective measurement, netting a non-physiologic distributional representation of what should have been a physiologic variable.
Finally, the MRH results were consistent with the other modeling techniques, leading to the conclusion that the results do not depend strongly on the model. While the MRH was the best performer the Cox regressions revealed much of what was revealed though the MRH analysis. The MRH has potential to be much more representative, and much more able to account for EHR complexities. Nevertheless, when more simple modeling technology is desired, the Cox regression remains a good choice.
In the future, related specifically to MRH, there is much to be done to properly analyze the process of going from stage 3 to stage 4 CKD. Development of a model that can properly account for the uncertainty in left-censoring and interval-censoring of subjects is paramount to analyzing the hazard rate using all subjects available in EHR data sets. Use of all subjects may also differently inform us on how laboratory tests and other covariates impact time to stage 4 CKD.
A main focus of future work from the MRH modeling standpoint includes accounting for the longitudinal aspect of the laboratory measurements, both before and after stage 3 CKD diagnosis. This would allow clinicians to check the progress of the disease once a patient has been diagnosed with stage 3 CKD, and would allow them to sequentially update the estimates of risk of developing stage 4 CKD as each new laboratory test results arrives into the record. Such a methodological extension would allow for a more specific patient risk assessment, in line of the recent personalized medicine efforts. The MRH model can be extended to accommodate these types of measures, and will account for both the level of laboratory measurements as well as changes in the laboratory measurements, allowing researchers to identify higher and lower periods of risk based on gradual or sharp changes in clinical measures.
Supplementary Material
Acknowledgments
The authors thank their colleagues Adler Perotte and Hojjat Salmasian for the helpful discussions. The authors thank the NSF EEID, NIH NIGMS (U01GM087729), NIH NIDA (R21DA027624-01), NIH NLM (RO1LM06910), and NIH NLM (HHSN276201000024C) for partial support.
References
- [1].Brownstein JS, Kleinman KP, and Mandl KD. Identifying pediatric age groups for influenza vaccination using a real-time regional surveillance system. American Journal of Epidemiology, 162(7):686–693, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Feldman J, Hoffer EP, Barnett GO, Kim RJ, Famiglietti KT, and Chueh H. Presence of key findings in the medical record prior to a documented high-risk diagnosis. Journal of the American Medical Informatics Association, 19(4):591–596, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Friedman CP, Wong AK, and Blumenthal D. Achieving a nationwide learning health system. Science Translational Medicine, 2(57):57cm29, 2010. [DOI] [PubMed] [Google Scholar]
- [4].Collins SA and Vawdrey DK. “reading between the lines” of flow sheet data: nurses’ optional documentation associated with cardiac arrest outcomes. Applied Nursing Research, 25(4):251–257, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].van der Lei J. Use and abuse of computer stored medical records. Methods of Information in Medicine, 30:79, 1991. [PubMed] [Google Scholar]
- [6].Hogan W and Wagner M. Accuracy of data in computer-based patient records. Journal of the American Medical Informatics Association, 4:342–355, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Hripcsak G, Knirsch C, Zhao L, Wilcox A, and Milton GB. Using discordance to improve classification in narrative clinical databased: an application to community-aquired pneumonia. Computation and Mathematical Biomedical Engineering, 37(3):296–304, 2007. [DOI] [PubMed] [Google Scholar]
- [8].Sagreiya H and Altman RB. The utility of general purpose versus specialty clinical databases for research: Warfarin dose estimation from extracted clinical variables. Journal of Biomedical Informatics, 43:747–751, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Hripcsak G and Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 10:1–5, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Kleinberg S and Elhadad N. Lessons learned in replicating data-driven experiments in mulitple medical systems and patient populations. In Proceedings AMIA (American Medical Informatics Association) Annual Fall Symposium, 2013. [PMC free article] [PubMed] [Google Scholar]
- [11].Rothman KJ and Greenland S. Modern Epidemiology. Wolters Kluwer Health, 2008. [Google Scholar]
- [12].Dominici F, McDermott A, and Hastie T. Improved semi-parametric time series models of air pollution and mortality. Journal of the American Statistical Association, 99(468):938–948, 2004. [Google Scholar]
- [13].Dukic V, Hayden M, Forgor A, Hopson T, Akweongo P, Hodgson A, Monaghan A, Wiedinmyer C, Yoksas T, Thomson MC, Trzaska S, and Pandya R. The role of weather in meningitis outbreaks in navrongo, ghana: A generalized additive modeling approach. Journal of Agricultural, Biological, and Environmental Statistics, 17(3):442–460, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Wennberg J. Tracking Medicine: A Researcher’s Quest to Understand Health Care. Oxford University Press, New York, 2010. [Google Scholar]
- [15].Normand S-L, Glickman M, and Gatsonis C. Statistical methods for profiling providers of medical care: Issues and applications. Journal of the American Statistical Association, 92:803–814, 1997. [Google Scholar]
- [16].Normand S-L and Shahian DM. Statistical and clinical aspects of hospital outcomes profiling. Statistical Science, 22:206–226, 2007. [Google Scholar]
- [17].Higgins JPT and Green S, editors. Cochrane Handbook for Systematic Reviews of Interventions. The Cochrane Collaboration, www.cochrane-handbook.org, 2011. [Google Scholar]
- [18].Dukic V and Gatsonis C. Meta-analysis of diagnostic test accuracy assessment studies with varying number of thresholds. Biometrics, 59(4):936–946, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Newton KM, Peissig PL, Kho AN, Bielinski SJ, Berg RL, Choudhary V, Basford M, Chute CG, Kullo IJ, Li R, Pacheco JA, Rasmussen LV, Spangler L, and Denny JC. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. Journal of the American Medical Informatics Association, 20(e1):e147–154, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Little RJ. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association, 88:125–134, 1993. [Google Scholar]
- [21].Little RJA and Rubin DB. Statistical Analysis with Missing Data. Wiley, New York, 2002. [Google Scholar]
- [22].Daniels MJ and Hogan JW. Missing data in longitudinal studies: Strategies for Bayesian modeling and sensitivity analysis. Chapman and Hall/CRC, 2008. [Google Scholar]
- [23].Pivovarov R and Elhadad N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. Journal of Biomedical Informatics, 45(3):471–481, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Fan J, Feng Y, and Wu Y. High-dimensional variable selection for Cox’s proportional hazards model, volume 6, pages 70–86. Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2010. [Google Scholar]
- [25].Hogan J and Laird N. Model-based approaches to analysing incomplete longitudinal and failure time data. Statistics in Medicine, 16(1–3):259–272, 1997. [DOI] [PubMed] [Google Scholar]
- [26].Wu L, Liu W, Yi G, and Huang Y. Analysis of longitudinal and survival data: Joint modeling, inference methods, and issues. Journal of Probability and Statistics, 2012(3):id:640153, 2012. [Google Scholar]
- [27].Ibrahim J, Chen MH, and Sinha D. Bayesian methods for joint modeling of longitudinal and survival data with applications to cancer vaccine trials. Statistica Sinica, 14(3):863–883, 2004. [Google Scholar]
- [28].Ding J and J-L. Wang. Modeling longitudinal data with nonparametric multiplicative random effects jointly with survival data. Biometrics, 64(2):546–556, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Albers DJ and Hripcsak G. Estimation of time-delayed mutual information from sparsely sampled sources. Chaos, Solitons, and Fractals, 45(6):853–860, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Albers DJ and Hripcsak G. Using time-delayed mutual information to discover and interpret temporal correlation structure in complex populations. Chaos, 22(1):013111, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Hripcsak G, Albers DJ, and Perotte A. Exploiting time in electronic health record correlations. Journal of the American Medical Informatics Association, 18:109–115, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Hug CW. Predicting the risk and trajectory of intensive care patients using survival models. PhD thesis, MIT, 2006. [Google Scholar]
- [33].Weber GM and Kohane IS. Extracting physician group intelligence from electronic health records to support evidence based medicine. PLoS ONE, 8:e64933, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].United States Renal Data System. 2012 Atlas of chronic kidney disease in the United States. http://www.usrds.org/atlas.aspx, 2012. Last accessed June 24, 2013.
- [35].National Kidney Foundation. KDOQI clinical practice guidelines for chronic kidney disease: Evaluation, classification, and stratification. http://www.kidney.org/professionals/KDOQI/guidelines_ckd/toc.htm, 2002.
- [36].Gorriz JL and Martinez-Castelao A. Proteinuria: detection and role in native renal disease progression. Transplant Review (Orlando), 26(1):3–13, 2012. [DOI] [PubMed] [Google Scholar]
- [37].Li YC. Vitamin D: roles in renal and cardiovascular protection. Current Opinion in Nephrology and Hypertension, 21:72–9, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Echouffo-Tcheugui JB and Kengne AP. Risk models to predict chronic kidney disease and its progression: A systematic review. PLoS Med, 9:e1001344, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Kinchen KS, Sadler J, Fink N, Brookmeyer R, Klag MJ, Levey AS, and Powe NR. The timing of specialist evaluation in chronic kidney disease and mortality. Annals of Internal Medicine, 137(6):479–486, 2002. [DOI] [PubMed] [Google Scholar]
- [40].Chase HS, Radhakrishnan J, Shirazian S, Rao MK, and Vawdrey DK. Under-documentation of chronic kidney disease in the electronic health record in outpatients. Journal of the American Medical Informatics Association, 17(5):588–594, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Levey AS, Bosch JP, Lewis JB, Greene T, Rogers N, and Roth D. A more accurate method to estimate glomerular filtration rate from serum creatinine: a new prediction equation. Annals of Internal Medicine, 130:461–470, 1999. [DOI] [PubMed] [Google Scholar]
- [42].Abhyankar S, Demner-Fushman D, and McDonald C. Standardizing clinical laboratory data for secondary use. Journal of Biomedical Informatics, 45(4):642–650, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Cimino JJ, Clayton PD, Hripcsak G, and Johnson SB. Knowledge-based approaches to the maintenance of a large controlled medical terminology. Journal of the American Medical Informatics Association, 1(1):35–50, 1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Lin JH and Haug PJ. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. Journal of Biomedical Informatics, 41(1):1–14, 2008. [DOI] [PubMed] [Google Scholar]
- [45].Cox DR. Regression models and life tables. Journal of the Royal Statistical Society - Series B, 34(2):187–220, 1972. [Google Scholar]
- [46].Cox DR and Oakes D. Analysis of survival data. Chapman and Hall/CRC, 1984. [Google Scholar]
- [47].Kalbfleisch JD and Prentice RL. The Statistical Analysis of Failure Time Data. John Wiley & Sons, New York, 1980. [Google Scholar]
- [48].Aalen O, Borgan O, and Gjessing H. Survival and Event History Analysis. Springer-Verlag, New York, 2008. [Google Scholar]
- [49].Cook RJ and Lawless JF. The Statistical Analysis of Recurrent Events. Springer-Verlag, New York, 2007. [Google Scholar]
- [50].Thernau TM and Grambsch PM. Modeling Survival Data: Extending the Cox Model. Springer-Verlag, New York, 2000. [Google Scholar]
- [51].Ibrahim JG, Chen M-H, and Sinha D. Bayesian Survival Analysis. New York, Springer-Verlag, 2001. [Google Scholar]
- [52].Nelson W. Applied Lifetime Data Analysis. John Wiley & Sons, New York, 1982. [Google Scholar]
- [53].Marubini E and Valsecchi MG. Analysing Survival Data from Clinical Trials and Observational Studies. John Wiley & Sons, Inc., Chichester, England, 1995. [Google Scholar]
- [54].Leemis LM. Reliability: probabilistic models and statistial methods. Lawrence Leemis, 2009. [Google Scholar]
- [55].Ebeling CE. An introduction to reliability and maintainability engineering. Waveland Pr Inc, 2009 [Google Scholar]
- [56].Antoniadis A, Gregoire G, and Nason G. Density and hazard rate estimation for right-censored data by using wavelet methods. Journal of the Royal Statistical Society - Series B, 61(1):63–84, 1999. [Google Scholar]
- [57].Gray RJ. Some diagnostic methods for Cox regression models through hazard smoothing. Biometrics, 46(1):93–102, 1990. [PubMed] [Google Scholar]
- [58].Gray RJ. Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. Journal of the American Statistical Association, 87(420):942–951, 1992. [Google Scholar]
- [59].Gray RJ. Hazard rate regression using ordinary nonparametric regression smoothers. Journal of Computational and Graphical Statistics, 5(2):190–207, 1996. [Google Scholar]
- [60].Hjort N. Nonparametric Bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18(3):1259–1294, 1990. [Google Scholar]
- [61].Lee J and Kim Y. A new algorithm to generate beta processes. Technical report. Department of Statistics, Pennsylvania State University, 2002. [Google Scholar]
- [62].Arjas E and Gasbarra D. Nonparametric Bayesian inference from right censored survival data, using the Gibbs sampler. Statistica Sinica, 4:505–524, 1994. [Google Scholar]
- [63].Nieto-Barajas L and Walker S. Markov beta and gamma processes for modeling hazard rates. Scandinavian Journal of Statistics, 29:413–424, 2002. [Google Scholar]
- [64].Sinha D and Dey DK. Semiparameteric Bayesian analysis of survival data. Journal of the American Statistical Association, 92:1195–1212, 1997. [Google Scholar]
- [65].Müller P and Rodriguez A. Nonparametric Bayesian Inference. Institute of Mathematical Statistics and American Statistical Assocation, Beachwood, Ohio and Alexandria, Virginia, USA, 2013. [Google Scholar]
- [66].Kolaczyk ED. Bayesian multiscale models for Poisson processes. Journal of the American Statistical Association, 94(477):920–933, 1999. [Google Scholar]
- [67].Bouman P, Dukic V, and Meng XL. A Bayesian multiresolution hazard model with application to and AIDS reporting delay study. Statistica Sinica, 15:325–357, 2005. [Google Scholar]
- [68].Bouman P, Dignam J, Dukic V, and Meng XL. A multiresolution hazard model for multi-center survival studies: Application to Tamoxifen treatment in early stage breast cancer. Journal of the American Statistical Association, 102(480):1145–1157, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [69].Dukic V and Dignam J. Bayesian hierarchical multiresolution hazard model for the study of time-dependent failure patterns in early stage breast cancer. Bayesian Analysis, 2(3):591–610, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [70].Dignam JJ, Dukic V, Anderson SJ, Mamounas EP, Wickerham DL, and Wolmark N. Hazard of recurrence and adjuvant treatment effects over time in lymph node-negative breast cancer. Breast Cancer Research and Treatment, 116:595–602, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [71].Chen Y, Dignam J, and Dukic V. Pruned multiresolution hazard (PMRH) models for time-to-event data. Bayesian Analysis (under review), 2013. [Google Scholar]
- [72].Geman S and Geman D. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Information Technology in Biomedicine, 6(6):721–741, 1984. [DOI] [PubMed] [Google Scholar]
- [73].R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3–900051-07–0. [Google Scholar]
- [74].Geweke J. Bayesian Statistics 4. Oxford University Press, 1992. [Google Scholar]
- [75].Leung K-M, Elashoff RM, and Afifi AA. Censoring issues in survival analysis. Annual Review of Public Health, 18:83–104, 1997. [DOI] [PubMed] [Google Scholar]
- [76].Mahmoodi BK, Gansevoort RT, Veeger NJ, Matthews AG, Navis G, Hillege HL, van der Meer J, and Prevention of Renal and Vascular End-stage Disease (PREVEND) Study Group. Microalbuminuria and risk of venous thromboembolism. Journal of the American Medical Association, 301:1790–1797, 2009. [DOI] [PubMed] [Google Scholar]
- [77].Halbesma N, Jansen DF, Heymans MW, Stolk RP, de Jong PE, Gansevoort RT, and the PREVEND Study Group. Development and validation of a general population renal risk score. Clinical Journal of the American Society of Nephrology, 6(7):1731–1738, 2011. [DOI] [PubMed] [Google Scholar]
- [78].Schwarz G. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978. [Google Scholar]
- [79].Fischer MJ, Go AS, Lora CM, Ackerson L, Cohan J, Kusek JW, Mercado A, Ojo A, Ricardo AC, Rosen LK, Tao K, Xie D, Feldman HI, and Lash JP. CRIC and H-CRIC study groups: CKD in hispanics: Baseline characteristics from the CRIC (Chronic Renal Insufficiency Cohort) and Hispanic-CRIC studies. American Journal of Kidney Diseases, 58:214–227, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.