Abstract
Analysis of high-throughput genomic data is challenging and requires specialized knowledge of experimental design, genomic data preprocessing and quality control, high-dimensional data analysis, and machine learning. Each research project involving high-throughput genomic data is unique, and there is no recipe for data analysis that will fit every project and research question. In this chapter, we will introduce concepts of designing genomic experiments, basic principles of bioinformatic analysis of high-throughput genomic data, and discuss best practices for design and reproducibility of computational analyses. Our main purpose is to introduce the basic concepts of planning successful genomic studies using analysis of gene expression data as an example in order to facilitate communication with the statisticians and bioinformaticians who will be involved in designing the studies and data analysis. We will also briefly introduce statistical and bioinformatic methods commonly used for the analysis of high-throughput genomic data to help the readers follow computational analyses in the cancer research literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
The Cancer Genome Atlas. 2015. Available from http://cancergenome.nih.gov/abouttcga/overview. Cited July 2015.
Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004 Dec 30;351(27):2817–26. PubMed PMID: 15591335.
Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol. 2006 Aug 10;24(23):3726–34. PubMed PMID: 16720680.
The TAILORx Breast Cancer Trial. 2015. Available from http://www.cancer.gov/types/breast/research/tailorx. Cited Dec 2015.
The RxPONDER Breast Cancer Trial. 2015. Available from http://www.cancer.gov/about-cancer/treatment/clinical-trials/search/view?cdrid=692475. Cited Dec 2015.
Retraction. Pharmacogenomic strategies provide a rational approach to the treatment of cisplatin-resistant patients with advanced cancer. J Clin Oncol. 2007;25:4350–7. J Clin Oncol. 2010 Dec 10;28(35):5229. PubMed PMID: 21148129.
Retraction. An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. J Clin Oncol. 2012 Feb 20;30(6):678. PubMed PMID: 22451975.
Baggerly K, Coombes K. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann Appl Stat. 2009;26(7):1186–7.
Baron AE, Bandeen-Roche K, Berry DA, Bryan J, Carey VJ, Chaloner K, et al. Letter to Harold Varmus: concerns about prediction models used in duke clinical trials. MD: Bethesda; 2010.
Bonnefoi H, Potti A, Delorenzi M, Mauriac L, Campone M, Tubiana-Hulin M, et al. Retraction-validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00–01 clinical trial. Lancet Oncol. 2011 Feb;12(2):116. PubMed PMID: 21277543.
Potti A, Dressman HK, Bild A, Chan G, Sayer R, Cragun J, et al. Retraction: genomic signatures to guide the use of chemotherapeutics. Nature Med. 2011;17(1):135.
Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, et al. Retraction: a genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med. 2011;364(12):1176.
Committee on the review of omics-based tests for predicting patient outcomes in clinical trials. In: Micheel CM, Nass SJ, Omenn GS, editors. Washington, DC; 2012.
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. PubMed PMID: 15461798. Pubmed Central PMCID: 545600.
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010 Oct;11(10):733–9. PubMed PMID: 20838408. Pubmed Central PMCID: 3880143.
Parker HS, Leek JT. The practical effect of batch on genomic prediction. Stat Appl Genet Mol Biol. 2012;11(3):Article 10. PubMed PMID: 22611599. Pubmed Central PMCID: 3760371.
Baggerly KA, Coombes KR, Neeley ES. Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. J Clin Oncol. 2008 Mar 1;26(7):1186–7; Author reply 7–8. PubMed PMID: 18309960.
van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Exp Cell Res. 2014 Mar 10;322(1):12–20. PubMed PMID: 24440557.
Johnson EW, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012 Mar 15;28(6):882–3. PubMed PMID: 22257669. Pubmed Central PMCID: 3307112.
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007 Sept;3(9):1724–35. PubMed PMID: 17907809. Pubmed Central PMCID: 1994707.
Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014 Oct;30(19):2757–63. PubMed PMID: 24907368. Pubmed Central PMCID: 4173013.
Gene Expression Omnibus. 2015. Available from http://www.ncbi.nlm.nih.gov/geo/. Cited July 2015.
Array Express. 2015. Available from http://www.ebi.ac.uk/arrayexpress/. Cited July 2015.
Sequence Reads Archive. 2015. Available from http://www.ncbi.nlm.nih.gov/sra/. Cited July 2015.
Waldron L, Riester M. doppelgangR: identify possibly duplicate samples in a list of ExpressionSets (July 2015). Available from https://github.com/lwaldron/doppelgangR.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr;4(2):249–64. PubMed PMID: 12925520.
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001 Jan 2;98(1):31–6. PubMed PMID: 11134512. Pubmed Central PMCID: 14539.
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105–11. PubMed PMID: 19289445. Pubmed Central PMCID: 2672628.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36. PubMed PMID: 23618408. Pubmed Central PMCID: 4053844.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-Seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21. PubMed PMID: 23104886. Pubmed Central PMCID: 3530905.
Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, et al. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Front Genet. 2011;2:46. PubMed PMID: 22303342. Pubmed Central PMCID: 3268599.
Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 Jan 15;31(2):166–9. PubMed PMID: 25260700. Pubmed Central PMCID: 4287950.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9. PubMed PMID: 19505943. Pubmed Central PMCID: 2723002.
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. PubMed PMID: 21816040. Pubmed Central PMCID: 3163565.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511–5. PubMed PMID: 20436464. Pubmed Central PMCID: 3146043.
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 Sep 1;27(17):2325–9. PubMed PMID: 21697122.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 July;5(7):621–8. PubMed PMID: 18516045.
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. PubMed PMID: 20167110. Pubmed Central PMCID: 2838869.
Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. PubMed PMID: 22177264. Pubmed Central PMCID: 3315510.
Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-Seq data using conditional quantile normalization. Biostatistics. 2012 Apr;13(2):204–16. PubMed PMID: 22285995. Pubmed Central PMCID: 3297825.
Waldron L, Ogino S, Hoshida Y, Shima K, McCart Reed AE, Simpson PT, et al. Expression profiling of archival tumors for long-term health studies. Clin Cancer Res. 2012 Nov 15;18(22):6136–46. PubMed PMID: 23136189. Pubmed Central PMCID: 3500412.
Tyekucheva S, Martin NE, Stack EC, Wei W, Vathipadiekal V, Waldron L, et al. Comparing platforms for messenger RNA expression profiling of archival formalin-fixed, Paraffin-embedded tissues. J Mol Diagn. 2015 July;17(4):374–81. PubMed PMID: 25937617. Pubmed Central PMCID: 4483460.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-Sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. PubMed PMID: 25605792. Pubmed Central PMCID: 4402510.
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. PubMed PMID: 16646809.
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biol. 2014;15(2):R29. PubMed PMID: 24485249. Pubmed Central PMCID: 4053721.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014;15(12):550. PubMed PMID: 25516281. Pubmed Central PMCID: 4302049.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57(1):289–300.
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88.
Storey JD. A direct approach to false discovery rates. J Roy Stat Soc B. 2002;64(3):479–98.
Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002 Jun;23(1):70–86. PubMed PMID: 12112249.
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904–9. PubMed PMID: 16862161.
Braun R, Cope L, Parmigiani G. Identifying differential correlation in gene/pathway combinations. BMC Bioinformatics. 2008;9:488. PubMed PMID: 19017408. Pubmed Central PMCID: 2613418.
Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47. PubMed PMID: 19192285. Pubmed Central PMCID: 2661051.
Broad Institute. Molecular signatures database. Available from http://www.broadinstitute.org/gsea/msigdb/index.jsp. July 2015.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005 Oct 25;102(43):15545–50. PubMed PMID: 16199517. Pubmed Central PMCID: 1239896.
Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009 Nov 5;462(7269):108–12. PubMed PMID: 19847166. Pubmed Central PMCID: 2783335.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Tyekucheva, S., Parmigiani, G. (2017). Bioinformatic Analysis of Epidemiological and Pathological Data. In: Loda, M., Mucci, L., Mittelstadt, M., Van Hemelrijck, M., Cotter, M. (eds) Pathology and Epidemiology of Cancer. Springer, Cham. https://doi.org/10.1007/978-3-319-35153-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-35153-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-35151-3
Online ISBN: 978-3-319-35153-7
eBook Packages: MedicineMedicine (R0)