Genomics is expected to soon overtake astronomy, particle physics, and even YouTube as the biggest creator of digital information (1). Analysis of this information has already led to important and ground breaking discoveries relevant to our health, but ongoing work will require creative solutions to the multitude of challenges arising from this volume of data. Practically speaking, one such challenge comes from determining what data should be collected and how it is to be managed. As cohort sizes in population based studies grow into the hundreds of thousands, practical issues about collection, storage, and filtering have begun to come more into focus. Additionally, frameworks that seamlessly integrate disparate datasets and also allow for flexible analysis will be required. Finally, as technical challenges and limitations arise, new analytical approaches and designs will have to be considered.
This dissertation work was comprised of three projects relating to these questions as approached from the perspective of a bioinformatician. These projects describe the development of new software and methods for sample management, data integration and analysis, and design strategies to improve signal in noisy data.
The first chapter of this dissertation consists of background material relating to the projects, including a description about the state of prostate cancer genomics, the development of biomarkers for its detection, and an exploration of a promising new biomarker, cell-free DNA (cfDNA). It also includes a discussion about some of the overarching questions of my PhD.
The second chapter describes a web based sample management system, called Samasy. Born out of necessity, this tool addresses a very practical issue of sample subsetting that is often required of resequencing studies. Samasy was used to facilitate the selection of 16,600 samples from a much larger cohort of 54,000 while preserving ethnicity and age balance among cases and controls. This tool integrates with liquid handling systems and provides a visually intuitive interface for plate/sample management and batch sample transfer execution.
The third chapter details Orchid, a framework designed to make machine learning of cancer variant data easy and extendible. It does so by integrating a variety of biological annotations (or features) and simple somatic tumor data available from large repositories like the The Cancer Genome Atlas (TCGA) or the International Cancer Genome Consortium (ICGC). This tool supports an efficient data store, MemSQL, that allows for very fast retrieval and filtering, and extends the popular python pandas and scikit-learn packages to facilitate machine learning of this data.
Finally, the fourth chapter outlines the creation of a custom targeted sequencing panel for prostate cancer that was designed for screening tumor variants in cfDNA. Building upon the power of Orchid, we detail how machine learning on whole genome prostate tumor datasets can be used to rank mutations by likelihood of being found in a patient with few mutations, or in other words, involved in early state disease. This ranking was used to build a targeted sequencing panel for detection of tumor-derived cfDNA variants. This panel was then validated and applied to a cohort of nine UCSF prostate cancer patients with multiple tumor foci that were collected at time of Radical Prostatectomy (RP).
Taken together, the information described in this dissertation provides tools and methodologies for the analysis of germline and somatic variants in prostate and other cancers. It also attempts to further technological development of cfDNA as biomarker for the detection or monitoring of diseases like cancer.