Nothing Special   »   [go: up one dir, main page]

Skip to main content

Bioinformatic Analysis of Epidemiological and Pathological Data

  • Chapter
  • First Online:
Pathology and Epidemiology of Cancer

Abstract

Analysis of high-throughput genomic data is challenging and requires specialized knowledge of experimental design, genomic data preprocessing and quality control, high-dimensional data analysis, and machine learning. Each research project involving high-throughput genomic data is unique, and there is no recipe for data analysis that will fit every project and research question. In this chapter, we will introduce concepts of designing genomic experiments, basic principles of bioinformatic analysis of high-throughput genomic data, and discuss best practices for design and reproducibility of computational analyses. Our main purpose is to introduce the basic concepts of planning successful genomic studies using analysis of gene expression data as an example in order to facilitate communication with the statisticians and bioinformaticians who will be involved in designing the studies and data analysis. We will also briefly introduce statistical and bioinformatic methods commonly used for the analysis of high-throughput genomic data to help the readers follow computational analyses in the cancer research literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. The Cancer Genome Atlas. 2015. Available from http://cancergenome.nih.gov/abouttcga/overview. Cited July 2015.

  2. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004 Dec 30;351(27):2817–26. PubMed PMID: 15591335.

    Google Scholar 

  3. Paik S, Tang G, Shak S, Kim C, Baker J, Kim W, et al. Gene expression and benefit of chemotherapy in women with node-negative, estrogen receptor-positive breast cancer. J Clin Oncol. 2006 Aug 10;24(23):3726–34. PubMed PMID: 16720680.

    Google Scholar 

  4. The TAILORx Breast Cancer Trial. 2015. Available from http://www.cancer.gov/types/breast/research/tailorx. Cited Dec 2015.

  5. The RxPONDER Breast Cancer Trial. 2015. Available from http://www.cancer.gov/about-cancer/treatment/clinical-trials/search/view?cdrid=692475. Cited Dec 2015.

  6. Retraction. Pharmacogenomic strategies provide a rational approach to the treatment of cisplatin-resistant patients with advanced cancer. J Clin Oncol. 2007;25:4350–7. J Clin Oncol. 2010 Dec 10;28(35):5229. PubMed PMID: 21148129.

    Google Scholar 

  7. Retraction. An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer. J Clin Oncol. 2012 Feb 20;30(6):678. PubMed PMID: 22451975.

    Google Scholar 

  8. Baggerly K, Coombes K. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann Appl Stat. 2009;26(7):1186–7.

    Google Scholar 

  9. Baron AE, Bandeen-Roche K, Berry DA, Bryan J, Carey VJ, Chaloner K, et al. Letter to Harold Varmus: concerns about prediction models used in duke clinical trials. MD: Bethesda; 2010.

    Google Scholar 

  10. Bonnefoi H, Potti A, Delorenzi M, Mauriac L, Campone M, Tubiana-Hulin M, et al. Retraction-validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00–01 clinical trial. Lancet Oncol. 2011 Feb;12(2):116. PubMed PMID: 21277543.

    Google Scholar 

  11. Potti A, Dressman HK, Bild A, Chan G, Sayer R, Cragun J, et al. Retraction: genomic signatures to guide the use of chemotherapeutics. Nature Med. 2011;17(1):135.

    Article  CAS  PubMed  Google Scholar 

  12. Potti A, Mukherjee S, Petersen R, Dressman HK, Bild A, Koontz J, et al. Retraction: a genomic strategy to refine prognosis in early-stage non-small-cell lung cancer. N Engl J Med. 2011;364(12):1176.

    Article  PubMed  Google Scholar 

  13. Committee on the review of omics-based tests for predicting patient outcomes in clinical trials. In: Micheel CM, Nass SJ, Omenn GS, editors. Washington, DC; 2012.

    Google Scholar 

  14. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. PubMed PMID: 15461798. Pubmed Central PMCID: 545600.

    Google Scholar 

  15. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010 Oct;11(10):733–9. PubMed PMID: 20838408. Pubmed Central PMCID: 3880143.

    Google Scholar 

  16. Parker HS, Leek JT. The practical effect of batch on genomic prediction. Stat Appl Genet Mol Biol. 2012;11(3):Article 10. PubMed PMID: 22611599. Pubmed Central PMCID: 3760371.

    Google Scholar 

  17. Baggerly KA, Coombes KR, Neeley ES. Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. J Clin Oncol. 2008 Mar 1;26(7):1186–7; Author reply 7–8. PubMed PMID: 18309960.

    Google Scholar 

  18. van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Exp Cell Res. 2014 Mar 10;322(1):12–20. PubMed PMID: 24440557.

    Google Scholar 

  19. Johnson EW, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8(1):118–27.

    Article  PubMed  Google Scholar 

  20. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012 Mar 15;28(6):882–3. PubMed PMID: 22257669. Pubmed Central PMCID: 3307112.

    Google Scholar 

  21. Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007 Sept;3(9):1724–35. PubMed PMID: 17907809. Pubmed Central PMCID: 1994707.

    Google Scholar 

  22. Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, et al. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics. 2014 Oct;30(19):2757–63. PubMed PMID: 24907368. Pubmed Central PMCID: 4173013.

    Google Scholar 

  23. Gene Expression Omnibus. 2015. Available from http://www.ncbi.nlm.nih.gov/geo/. Cited July 2015.

  24. Array Express. 2015. Available from http://www.ebi.ac.uk/arrayexpress/. Cited July 2015.

  25. Sequence Reads Archive. 2015. Available from http://www.ncbi.nlm.nih.gov/sra/. Cited July 2015.

  26. Waldron L, Riester M. doppelgangR: identify possibly duplicate samples in a list of ExpressionSets (July 2015). Available from https://github.com/lwaldron/doppelgangR.

  27. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003 Apr;4(2):249–64. PubMed PMID: 12925520.

    Google Scholar 

  28. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA. 2001 Jan 2;98(1):31–6. PubMed PMID: 11134512. Pubmed Central PMCID: 14539.

    Google Scholar 

  29. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105–11. PubMed PMID: 19289445. Pubmed Central PMCID: 2672628.

    Google Scholar 

  30. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36. PubMed PMID: 23618408. Pubmed Central PMCID: 4053844.

    Google Scholar 

  31. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-Seq aligner. Bioinformatics. 2013 Jan 1;29(1):15–21. PubMed PMID: 23104886. Pubmed Central PMCID: 3530905.

    Google Scholar 

  32. Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, et al. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-Seq data. Front Genet. 2011;2:46. PubMed PMID: 22303342. Pubmed Central PMCID: 3268599.

    Google Scholar 

  33. Anders S, Pyl PT, Huber W. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015 Jan 15;31(2):166–9. PubMed PMID: 25260700. Pubmed Central PMCID: 4287950.

    Google Scholar 

  34. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078–9. PubMed PMID: 19505943. Pubmed Central PMCID: 2723002.

    Google Scholar 

  35. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. PubMed PMID: 21816040. Pubmed Central PMCID: 3163565.

    Google Scholar 

  36. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511–5. PubMed PMID: 20436464. Pubmed Central PMCID: 3146043.

    Google Scholar 

  37. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics. 2011 Sep 1;27(17):2325–9. PubMed PMID: 21697122.

    Google Scholar 

  38. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008 July;5(7):621–8. PubMed PMID: 18516045.

    Google Scholar 

  39. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. PubMed PMID: 20167110. Pubmed Central PMCID: 2838869.

    Google Scholar 

  40. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-Seq data. BMC Bioinformatics. 2011;12:480. PubMed PMID: 22177264. Pubmed Central PMCID: 3315510.

    Google Scholar 

  41. Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-Seq data using conditional quantile normalization. Biostatistics. 2012 Apr;13(2):204–16. PubMed PMID: 22285995. Pubmed Central PMCID: 3297825.

    Google Scholar 

  42. Waldron L, Ogino S, Hoshida Y, Shima K, McCart Reed AE, Simpson PT, et al. Expression profiling of archival tumors for long-term health studies. Clin Cancer Res. 2012 Nov 15;18(22):6136–46. PubMed PMID: 23136189. Pubmed Central PMCID: 3500412.

    Google Scholar 

  43. Tyekucheva S, Martin NE, Stack EC, Wei W, Vathipadiekal V, Waldron L, et al. Comparing platforms for messenger RNA expression profiling of archival formalin-fixed, Paraffin-embedded tissues. J Mol Diagn. 2015 July;17(4):374–81. PubMed PMID: 25937617. Pubmed Central PMCID: 4483460.

    Google Scholar 

  44. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-Sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20;43(7):e47. PubMed PMID: 25605792. Pubmed Central PMCID: 4402510.

    Google Scholar 

  45. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:Article3. PubMed PMID: 16646809.

    Google Scholar 

  46. Law CW, Chen Y, Shi W, Smyth GK. voom: precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biol. 2014;15(2):R29. PubMed PMID: 24485249. Pubmed Central PMCID: 4053721.

    Google Scholar 

  47. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014;15(12):550. PubMed PMID: 25516281. Pubmed Central PMCID: 4302049.

    Google Scholar 

  48. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57(1):289–300.

    Google Scholar 

  49. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88.

    Article  Google Scholar 

  50. Storey JD. A direct approach to false discovery rates. J Roy Stat Soc B. 2002;64(3):479–98.

    Article  Google Scholar 

  51. Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays. Genet Epidemiol. 2002 Jun;23(1):70–86. PubMed PMID: 12112249.

    Google Scholar 

  52. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006 Aug;38(8):904–9. PubMed PMID: 16862161.

    Google Scholar 

  53. Braun R, Cope L, Parmigiani G. Identifying differential correlation in gene/pathway combinations. BMC Bioinformatics. 2008;9:488. PubMed PMID: 19017408. Pubmed Central PMCID: 2613418.

    Google Scholar 

  54. Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47. PubMed PMID: 19192285. Pubmed Central PMCID: 2661051.

    Google Scholar 

  55. Broad Institute. Molecular signatures database. Available from http://www.broadinstitute.org/gsea/msigdb/index.jsp. July 2015.

  56. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005 Oct 25;102(43):15545–50. PubMed PMID: 16199517. Pubmed Central PMCID: 1239896.

    Google Scholar 

  57. Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009 Nov 5;462(7269):108–12. PubMed PMID: 19847166. Pubmed Central PMCID: 2783335.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Svitlana Tyekucheva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Tyekucheva, S., Parmigiani, G. (2017). Bioinformatic Analysis of Epidemiological and Pathological Data. In: Loda, M., Mucci, L., Mittelstadt, M., Van Hemelrijck, M., Cotter, M. (eds) Pathology and Epidemiology of Cancer. Springer, Cham. https://doi.org/10.1007/978-3-319-35153-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-35153-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-35151-3

  • Online ISBN: 978-3-319-35153-7

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics