Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

How does normalization impact RNA-seq disease diagnosis?

Published: 01 September 2018 Publication History

Graphical abstract

Figure the d-index comparisons of deep neural network (DNN), extra-trees (ET) and support vector machine (SVM) under raw data, and RPKM, ML, DESeq, TMM normalized data of Breast, Kidney and Prostate data. The best diagnosis is further marked with an extra plot for each data.
Display Omitted

Highlights

The d-index comparisons of raw data and normalized data under DNN, ET and SVM.
RPKM, ML, DESeq, TMM normalized data of Breast Kidney and Prostate data included.
The best diagnosis is further marked with an extra plot for each data.

Abstract

With the surge of next generation high-throughput technologies, RNA-seq data is playing an increasingly important role in disease diagnosis, in which normalization is assumed as an essential procedure to produce comparable samples. Recent studies have seen different normalization methods proposed to remove various technical biases in RNA sequencing. However, there are no previous studies evaluating the impacts of normalization on RNA-seq disease diagnosis.
In this study, we investigate this problem by analyzing structured big data: RNA-seq data acquired from the TCGA portal for its popularity in RNA-seq disease diagnosis. We propose a novel normalization effect test algorithm, diagnostic index (d-index), and data entropy to analyze and evaluate the impacts of normalization on RNA-seq disease diagnosis by using state-of-the-art machine learning models. Furthermore, we present an original visualization analysis to compare the performance of normalized data versus raw data.
We have found that normalized data yields generally an equivalent or even lower level diagnosis than its raw data. Moreover, some normalization approaches (e.g. RPKM) even bring negative effects in disease diagnosis. On the other hand, raw data seems to have the potential to decipher pathological status better or at least comparable than when the data is normalized. Our visualization analysis also shows that some normalization methods even bring ‘outliers’, which unavoidably decreases sample detectability in diagnosis. More importantly, our data entropy analysis shows that normalized data usually demonstrates equivalent or lower entropy values than raw data. Those data with high entropy values tend to achieve better diagnosis than those with low entropy values. In addition, we found that high-dimensional imbalance (HDI) data is unaffected by any normalization procedures in diagnosis, and fails almost all machine learning models by only recognizing majority types in spite of raw or normalized data.
Our results suggest that normalized data may not demonstrate statistically significant advantages in disease diagnosis than its raw form. It further implies that normalization may not be an indispensable procedure in RNA-seq disease diagnosis or at least some normalization processes may not be. Instead, raw data may perform better for capturing more original transcriptome patterns in different pathological conditions.

References

[1]
Sara A. Byron, Kendall R. Van Keuren-Jensen, David M. Engelthaler, John D. Carpten, David W. Craig, Translating RNA sequencing into clinical diagnostics: opportunities and challenges, Nat. Rev. Genet. 17 (5) (2016) 257–271.
[2]
S. Ellard, G.P. Patrinos, W.S. Oetting, Clinical applications of next-generation sequencing, Hum. Mutat. 34 (11) (2013) 1583–1587.
[3]
K. Renkema, et al., Next-generation sequencing for research and diagnostics in kidney disease, Nat. Rev. Nephrol. 10 (2014) 433–444.
[4]
A1. Conesa, P. Madrigal, S. Tarazona, D7. Gomez-Cabrero, A. Cervera, A. McPherson, M.W. Szczeniak, D.J. Gaffney, L.L. Elo, X. Zhang, A. Mortazavi, A survey of best practices for RNA-seq data analysis, Genome Biol. (2016).
[5]
C.A. Maher, C. Kumar-Sinha, X. Cao, et al., Transcriptome sequencing to detect gene fusions in cancer, Nature 458 (7234) (2009) 97–101.
[6]
Wang, et al., RNA-Seq: a revolutionary tool for transcriptomics, Nat. Rev. Genet. 10 (1) (2009) 57–63.
[7]
B. Langmead, S. Salzberg, Fast gapped-read alignment with Bowtie 2, Nat. Methods 9 (2012) 357–359.
[8]
Dillies, et al., A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis, Briefings Bioinformat. (2012).
[10]
Q. Zhang, et al., Integrative network analysis of TCGA data for ovarian cancer, BMC Syst. Biol. 8 (2014) 1338.
[11]
H. Han, Y. Liu, Transcriptome marker diagnostics using big data, IET Syst. Biol. 10 (2016) 41–48.
[12]
N.H. Shah, J.D. Tenenbaum, The coming age of data-driven medicine: translational bioinformatics’ next frontier, J. Am. Med. Inform. Assoc. 19 (2012) e2–e4.
[13]
Alyass Akram, Turcotte Michelle, Meyre David, From big data analysis to personalized medicine for all: challenges and opportunities, BMC Med. Genomics 8 (2015) 33.
[14]
H. Han, Diagnostic biases in translational bioinformatics, BMC Med. Genomics 8 (2015) 46.
[15]
S. Anders, W. Huber, Differential expression analysis for sequence count data, Genome Biol. 11 (2010) R1.
[16]
H. Han, X. Jiang, Disease biomarker query from RNA-Seq data, Cancer Informat. 13 (2014) 81–94.
[17]
M.D. Robinson, A. Oshlack, A scaling normalization method for differential expression analysis of RNA-seq data, Genome Biol. 11 (2010) R25.
[18]
J.C. Marioni, C.E. Mason, S.M. Mane, M. Stephens, Y. Gilad, RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays, Genome Res. 18 (9) (2008) 1509–1517.
[19]
G.P. Wagner, K. Kin, V.J. Lynch, Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples, Theory Biosci. 131 (4) (2012) 281–285.
[20]
Hu, et al., The Drosophila Gene Expression Tool (DGET) for expression analyses, BMC Bioinformat. 18 (2017) 98.
[21]
Lin, et al., Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster, BMC Genomics 17 (2016) 28.
[22]
J.H. Bullard, E. Purdom, K.D. Hansen, et al., Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments, BMC Bioinformat. 11 (94) (2010).
[23]
A. Oshlack, M.J. Wakefield, Transcript length bias in RNA-seq data confounds systems biology, Biol. Direct 4 (14) (2009).
[24]
John Shawe-Taylor, Nello Cristianini, Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000.
[25]
H. Han, X. Li, Multi-resolution independent component analysis for high-performance tumor classification and biomarker discovery, BMC Bioinformat. 12 (S1) (2011) S7.
[26]
G.T. Van, J.A.K. Suykens, B. Baesens, S. Viaene, J. Vanthin, G. Dedene, B. De M, J. Vandewalle, Benchmarking least squares support vector machine classifiers, Mach. Learn. 54 (1) (2004) 5–32.
[27]
Tarazona, et al., Differential expression in RNA-seq: a matter of depth, Genome Res. 21 (2011) 2213–2223.
[28]
T. Chen, Hang Li, Q Yang, Y. Yu, General functional matrix Factorization using gradient boosting, in: ICML, 2013.
[29]
J. Friedman, Greedy boosting approximation: a gradient boosting machine, Ann. Stat. 29 (2001) 1189–1232.
[30]
L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[31]
Fernández-Delgado, et al., Do we need hundreds of classifiers to solve real world classification problems?, JMLR (2014).
[32]
Y. Chu, D. Corey, RNA sequencing: platform selection, experimental design, and data interpretation, Nucleic Acid Ther. 22 (4) (2012) 271–274.
[33]
L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.
[34]
P. Geurts, Ernst, L. Wehenkel, Extremely randomized trees, Mach. Learn. 63 (1) (2006) 3–42.
[35]
Y. Bengio, A. Courville, P. Vincent, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828.
[36]
Z. Zhang, G. Dai, C. Xu, M.I. Jordan, Regularized discriminant analysis, ridge regression and beyond, J. Mach. Learn. Res. 11 (2010).
[37]
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data mining, Inference, and Prediction, Springer, 2009.
[38]
K. Burnham, D.R. Anderson, Model Selection and Multimodel Inference, second ed., Springer-Verlag, 2002.
[39]
G.H. Golub, C. Reinsch, Singular value decomposition and least squares solutions, Numer. Math. 14 (5) (1970) 403–420.
[40]
O. Alter, P.O. Brown, D. Botstein, Singular value decomposition for genome-wide expression data processing and modeling, PNAS 97 (18) (2000) 10101–10106.
[41]
W.S. Cleveland, S. Devlin, Locally-weighted regression: an approach to regression analysis by local fitting, J. Am. Stat. Assoc. 83 (1988).
[42]
D. Powers, ROC-ConCert: ROC-based measurement of consistency and certainty, Spring Congress on Engineering and Technology (SCET), vol. 2, IEEE, 2012, pp. 238–241.
[43]
David J. Hand, Robert J. Till, A simple generalization of the area under the ROC curve for multiple class classification problems, Mach. Learn. 45 (2001) 171–186.
[44]
Zhang, et al., Evaluation and comparison of computational tools for RNA-seq isoform quantification, BMC Genomics 18 (2017) 583.
[45]
G. Golub, C. Van Loan, Matrix Computations, third ed., John Hopkins University Press, 1996.
[46]
C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn, L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nat. Biotechnol. 31 (1) (2013),.
[47]
S. Jackman, I. Birol, Assembling genomes using short-read sequencing technology, Genome Biol. 11 (2010) 202.
[48]
G. Hinton, Deep belief networks, Scholarpedia 4 (5) (2009) 5947.
[49]
G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Comput. 18 (7) (2006) 1527–1554.
[50]
Kingma et al., Semi-supervised learning with deep generative models, in: NIPS’14 Proceedings of the 27th International Conference on Neural Information, 2014, pp. 3581–3589.
[51]
G. Lemaitre, F. Nogueira, C. Aridas, Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning, J. Mach. Learn. Res. 18 (2017) 1–5.
[52]
X. Han, Nonnegative principal component analysis for cancer molecular pattern discovery, IEEE/ACM Trans. Comput. Biol. Bioinformat. 7 (3) (2010) 537–549.
[53]
I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182.
[54]
H. Han, A novel feature selection for RNA-seq analysis, Comput. Biol. Chem. 71 (2017) 245–257.
[55]
H. Han, Derivative component analysis for mass spectral serum proteomic profiles, BMC Med. Genomics 7 (2014) S1.
[56]
W. Li, J.E. Cerise, Y. Yang, H. Han, Application of t-SNE to human genetic data, J. Bioinformat. Comput. Biol. 15 (4) (2017) 1750017.

Cited By

View all
  • (2024)Effect of RNA-Seq data normalization on protein interactome mapping for Alzheimer’s diseaseComputational Biology and Chemistry10.1016/j.compbiolchem.2024.108028109:COnline publication date: 1-Apr-2024
  • (2023)PORDE: Explaining Data Poisoning Attacks Through Visual Analytics with Food Delivery App ReviewsCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581754.3584128(46-50)Online publication date: 27-Mar-2023

Index Terms

  1. How does normalization impact RNA-seq disease diagnosis?
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image Journal of Biomedical Informatics
            Journal of Biomedical Informatics  Volume 85, Issue C
            Sep 2018
            208 pages

            Publisher

            Elsevier Science

            San Diego, CA, United States

            Publication History

            Published: 01 September 2018

            Author Tags

            1. RNA-seq
            2. RNA-Seq
            3. Normalization
            4. Big data
            5. Machine learning

            Author Tags

            1. 00-01
            2. 99-00

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 17 Jan 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Effect of RNA-Seq data normalization on protein interactome mapping for Alzheimer’s diseaseComputational Biology and Chemistry10.1016/j.compbiolchem.2024.108028109:COnline publication date: 1-Apr-2024
            • (2023)PORDE: Explaining Data Poisoning Attacks Through Visual Analytics with Food Delivery App ReviewsCompanion Proceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581754.3584128(46-50)Online publication date: 27-Mar-2023

            View Options

            View options

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media