Bioinformatic Analysis of Epidemiological and Pathological Data

Svitlana Tyekucheva^6,7 &
Giovanni Parmigiani^6,7

3098 Accesses
1 Citations

Abstract

Analysis of high-throughput genomic data is challenging and requires specialized knowledge of experimental design, genomic data preprocessing and quality control, high-dimensional data analysis, and machine learning. Each research project involving high-throughput genomic data is unique, and there is no recipe for data analysis that will fit every project and research question. In this chapter, we will introduce concepts of designing genomic experiments, basic principles of bioinformatic analysis of high-throughput genomic data, and discuss best practices for design and reproducibility of computational analyses. Our main purpose is to introduce the basic concepts of planning successful genomic studies using analysis of gene expression data as an example in order to facilitate communication with the statisticians and bioinformaticians who will be involved in designing the studies and data analysis. We will also briefly introduce statistical and bioinformatic methods commonly used for the analysis of high-throughput genomic data to help the readers follow computational analyses in the cancer research literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Computational analysis of cancer genome sequencing data

Article 08 December 2021

Biostatistics, Data Mining and Computational Modeling

Statistics for Bioinformatics

References

The Cancer Genome Atlas. 2015. Available from http://cancergenome.nih.gov/abouttcga/overview. Cited July 2015.
Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004 Dec 30;351(27):2817–26. PubMed PMID: 15591335.
Google Scholar
Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol. 2006 Aug 10;24(23):3726–34. PubMed PMID: 16720680.
Google Scholar
The TAILORx Breast Cancer Trial. 2015. Available from http://www.cancer.gov/types/breast/research/tailorx. Cited Dec 2015.
The RxPONDER Breast Cancer Trial. 2015. Available from http://www.cancer.gov/about-cancer/treatment/clinical-trials/search/view?cdrid=692475. Cited Dec 2015.
Retraction. Pharmacogenomic strategies provide a rational approach to the treatment of cisplatin-resistant patients with advanced cancer. J Clin Oncol. 2007;25:4350–7. J Clin Oncol. 2010 Dec 10;28(35):5229. PubMed PMID: 21148129.
Google Scholar
Retraction. An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. J Clin Oncol. 2012 Feb 20;30(6):678. PubMed PMID: 22451975.
Google Scholar
Baggerly K, Coombes K. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann Appl Stat. 2009;26(7):1186–7.
Google Scholar
Baron AE, Bandeen-Roche K, Berry DA, Bryan J, Carey VJ, Chaloner K, et al. Letter to Harold Varmus: concerns about prediction models used in duke clinical trials. MD: Bethesda; 2010.
Google Scholar
Bonnefoi H, Potti A, Delorenzi M, Mauriac L, Campone M, Tubiana-Hulin M, et al. Retraction-validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00–01 clinical trial. Lancet Oncol. 2011 Feb;12(2):116. PubMed PMID: 21277543.
Google Scholar
Potti A, Dressman HK, Bild A, Chan G, Sayer R, Cragun J, et al. Retraction: genomic signatures to guide the use of chemotherapeutics. Nature Med. 2011;17(1):135.
Article CAS PubMed Google Scholar
Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, et al. Retraction: a genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med. 2011;364(12):1176.
Article PubMed Google Scholar
Committee on the review of omics-based tests for predicting patient outcomes in clinical trials. In: Micheel CM, Nass SJ, Omenn GS, editors. Washington, DC; 2012.
Google Scholar
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. PubMed PMID: 15461798. Pubmed Central PMCID: 545600.
Google Scholar
Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010 Oct;11(10):733–9. PubMed PMID: 20838408. Pubmed Central PMCID: 3880143.
Google Scholar
Parker HS, Leek JT. The practical effect of batch on genomic prediction. Stat Appl Genet Mol Biol. 2012;11(3):Article 10. PubMed PMID: 22611599. Pubmed Central PMCID: 3760371.
Google Scholar
Baggerly KA, Coombes KR, Neeley ES. Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. J Clin Oncol. 2008 Mar 1;26(7):1186–7; Author reply 7–8. PubMed PMID: 18309960.
Google Scholar
van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Exp Cell Res. 2014 Mar 10;322(1):12–20. PubMed PMID: 24440557.
Google Scholar
Johnson EW, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.
Article PubMed Google Scholar
Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012 Mar 15;28(6):882–3. PubMed PMID: 22257669. Pubmed Central PMCID: 3307112.
Google Scholar
Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007 Sept;3(9):1724–35. PubMed PMID: 17907809. Pubmed Central PMCID: 1994707.
Google Scholar
Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014 Oct;30(19):2757–63. PubMed PMID: 24907368. Pubmed Central PMCID: 4173013.
Google Scholar
Gene Expression Omnibus. 2015. Available from http://www.ncbi.nlm.nih.gov/geo/. Cited July 2015.
Array Express. 2015. Available from http://www.ebi.ac.uk/arrayexpress/. Cited July 2015.
Sequence Reads Archive. 2015. Available from http://www.ncbi.nlm.nih.gov/sra/. Cited July 2015.
Waldron L, Riester M. doppelgangR: identify possibly duplicate samples in a list of ExpressionSets (July 2015). Available from https://github.com/lwaldron/doppelgangR.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr;4(2):249–64. PubMed PMID: 12925520.
Google Scholar
Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001 Jan 2;98(1):31–6. PubMed PMID: 11134512. Pubmed Central PMCID: 14539.
Google Scholar
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105–11. PubMed PMID: 19289445. Pubmed Central PMCID: 2672628.
Google Scholar
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36. PubMed PMID: 23618408. Pubmed Central PMCID: 4053844.
Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-Seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21. PubMed PMID: 23104886. Pubmed Central PMCID: 3530905.
Google Scholar
Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, et al. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Front Genet. 2011;2:46. PubMed PMID: 22303342. Pubmed Central PMCID: 3268599.
Google Scholar
Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 Jan 15;31(2):166–9. PubMed PMID: 25260700. Pubmed Central PMCID: 4287950.
Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9. PubMed PMID: 19505943. Pubmed Central PMCID: 2723002.
Google Scholar
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. PubMed PMID: 21816040. Pubmed Central PMCID: 3163565.
Google Scholar
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511–5. PubMed PMID: 20436464. Pubmed Central PMCID: 3146043.
Google Scholar
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 Sep 1;27(17):2325–9. PubMed PMID: 21697122.
Google Scholar
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 July;5(7):621–8. PubMed PMID: 18516045.
Google Scholar
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. PubMed PMID: 20167110. Pubmed Central PMCID: 2838869.
Google Scholar
Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. PubMed PMID: 22177264. Pubmed Central PMCID: 3315510.
Google Scholar
Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-Seq data using conditional quantile normalization. Biostatistics. 2012 Apr;13(2):204–16. PubMed PMID: 22285995. Pubmed Central PMCID: 3297825.
Google Scholar
Waldron L, Ogino S, Hoshida Y, Shima K, McCart Reed AE, Simpson PT, et al. Expression profiling of archival tumors for long-term health studies. Clin Cancer Res. 2012 Nov 15;18(22):6136–46. PubMed PMID: 23136189. Pubmed Central PMCID: 3500412.
Google Scholar
Tyekucheva S, Martin NE, Stack EC, Wei W, Vathipadiekal V, Waldron L, et al. Comparing platforms for messenger RNA expression profiling of archival formalin-fixed, Paraffin-embedded tissues. J Mol Diagn. 2015 July;17(4):374–81. PubMed PMID: 25937617. Pubmed Central PMCID: 4483460.
Google Scholar
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-Sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. PubMed PMID: 25605792. Pubmed Central PMCID: 4402510.
Google Scholar
Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. PubMed PMID: 16646809.
Google Scholar
Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biol. 2014;15(2):R29. PubMed PMID: 24485249. Pubmed Central PMCID: 4053721.
Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014;15(12):550. PubMed PMID: 25516281. Pubmed Central PMCID: 4302049.
Google Scholar
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57(1):289–300.
Google Scholar
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88.
Article Google Scholar
Storey JD. A direct approach to false discovery rates. J Roy Stat Soc B. 2002;64(3):479–98.
Article Google Scholar
Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002 Jun;23(1):70–86. PubMed PMID: 12112249.
Google Scholar
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904–9. PubMed PMID: 16862161.
Google Scholar
Braun R, Cope L, Parmigiani G. Identifying differential correlation in gene/pathway combinations. BMC Bioinformatics. 2008;9:488. PubMed PMID: 19017408. Pubmed Central PMCID: 2613418.
Google Scholar
Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47. PubMed PMID: 19192285. Pubmed Central PMCID: 2661051.
Google Scholar
Broad Institute. Molecular signatures database. Available from http://www.broadinstitute.org/gsea/msigdb/index.jsp. July 2015.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005 Oct 25;102(43):15545–50. PubMed PMID: 16199517. Pubmed Central PMCID: 1239896.
Google Scholar
Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009 Nov 5;462(7269):108–12. PubMed PMID: 19847166. Pubmed Central PMCID: 2783335.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA
Svitlana Tyekucheva & Giovanni Parmigiani
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Svitlana Tyekucheva & Giovanni Parmigiani

Authors

Svitlana Tyekucheva
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Parmigiani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Svitlana Tyekucheva .

Editor information

Editors and Affiliations

Center for Molecular Oncologic Pathology, Dana Farber Cancer Institute, Boston, Massachusetts, USA
Massimo Loda
Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, USA
Lorelei A. Mucci
Department of Genetics, Harvard Medical School, Boston, Massachusetts, USA
Megan L. Mittelstadt
King’s College London, Guy’s Hospital, London, United Kingdom
Mieke Van Hemelrijck
Department of Pathology, Dana-Farber Cancer Institute and Brigham & Women’s Hospital, Boston, Massachusetts, USA
Maura Bríd Cotter

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tyekucheva, S., Parmigiani, G. (2017). Bioinformatic Analysis of Epidemiological and Pathological Data. In: Loda, M., Mucci, L., Mittelstadt, M., Van Hemelrijck, M., Cotter, M. (eds) Pathology and Epidemiology of Cancer. Springer, Cham. https://doi.org/10.1007/978-3-319-35153-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-35153-7_8
Published: 02 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-35151-3
Online ISBN: 978-3-319-35153-7
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics

Bioinformatic Analysis of Epidemiological and Pathological Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Computational analysis of cancer genome sequencing data

Biostatistics, Data Mining and Computational Modeling

Statistics for Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Bioinformatic Analysis of Epidemiological and Pathological Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Computational analysis of cancer genome sequencing data

Biostatistics, Data Mining and Computational Modeling

Statistics for Bioinformatics

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation