Using recursive feature elimination in random forest to account for correlated variables in high dimensional data

BMC Genet. 2018 Sep 17;19(Suppl 1):65. doi: 10.1186/s12863-018-0633-8.

Authors

Burcu F Darst¹, Kristen C Malecki¹, Corinne D Engelman²

Affiliations

¹ Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA.
² Department of Population Health Sciences, School of Medicine and Public Health, University of Wisconsin, 610 Walnut Street, 1007 WARF, Madison, WI, 53726, USA. cengelman@wisc.edu.

Abstract

Background: Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets.

Results: We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype-methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables.

Conclusions: Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.

Keywords: Correlation; Epigenomics; Genetics; Genomics; High-dimensional data; Integration; Machine-learning; Methylation; Omics; Random forest; Recursive feature elimination.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

CpG Islands
DNA Methylation
Epigenomics
Genome-Wide Association Study
Genotype
Humans
Hypertriglyceridemia / drug therapy
Hypertriglyceridemia / genetics
Hypoglycemic Agents / therapeutic use
Machine Learning*
Polymorphism, Single Nucleotide
Triglycerides / blood

Substances

Hypoglycemic Agents
Triglycerides

Abstract

Publication types

MeSH terms

Substances

Grants and funding