Abstract
The continuous growth of experimental data generated by Next Generation Sequencing (NGS) machines has led to the adoption of advanced techniques to intelligently manage them. The advent of the Big Data era posed new challenges that led to the development of novel methods and tools, which were initially born to face with computational science problems, but which nowadays can be widely applied on biomedical data. In this work, we address two biomedical data management issues: (i) how to reduce the redundancy of genomic and clinical data, and (ii) how to make this big amount of data easily accessible. Firstly, we propose an approach to optimally organize genomic and clinical data by taking into account data redundancy and propose a method able to save as much space as possible by exploiting the power of no-SQL technologies. Then, we propose design principles for organizing biomedical data and make them easily accessible through the development of a collection of Application Programming Interfaces (APIs), in order to provide a flexible framework that we called OpenOmics. To prove the validity of our approach, we apply it on data extracted from The Genomic Data Commons repository. OpenOmics is free and open source for allowing everyone to extend the set of provided APIs with new features that may be able to answer specific biological questions. They are hosted on GitHub at the following address https://github.com/fabio-cumbo/open-omics-api/, publicly queryable at http://bioinformatics.iasi.cnr.it/openomics/api/routes, and their documentation is available at https://openomics.docs.apiary.io/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Stenson, P.D., et al.: The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 136(6), 665–677 (2017)
Barrett, T., et al.: NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res. 37(Suppl. 1), D885–D890 (2008)
Benson, D.A., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: GenBank. Nucleic Acids Res. 42(D1), D32–D37 (2013)
Chen, Q., Zobel, J., and Verspoor, K.: Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study. In: Database 2017, baw163 (2017)
Cumbo, F., Fiscon, G., Ceri, S., Masseroli, M., Weitschek, E.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(1), 6 (2017)
Cappelli, E., Cumbo, F., Bernasconi, A., Masseroli, M., Weitschek, E.: OpenGDC: standardizing, extending, and integrating genomics data of cancer. In ESCS 2018: 8th European Student Council Symposium, International Society for Computational Biology (ISCB), p. 1 (2018)
Weinstein, J.N., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45(10), 1113 (2013)
Jensen, M.A., Ferretti, V., Grossman, R.L., Staudt, L.M.: The NCI genomic data commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5(7), 621 (2008)
Bibikova, M., et al.: High density DNA methylation array with single CpG site resolution. Genomics 98(4), 288–295 (2011)
Trapnell, C., et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28(5), 511 (2010)
Zeng, Y., Cullen, B.R.: Sequence requirements for micro RNA processing and function in human cells. RNA 9(1), 112–123 (2003)
Timmermann, B., et al.: Somatic mutation profiles of MSI and MSS colorectal cancer identified by whole exome next generation sequencing and bioinformatics analysis. PLoS ONE 5(12), e15661 (2010)
Conrad, D.F., et al.: Origins and functional impact of copy number variation in the human genome. Nature 464(7289), 704 (2010)
Cumbo, F., Weitschek, E., Bertolazzi, P., Felici, G.: IRIS-TCGA: an information retrieval and integration system for genomic data of cancer. In: Bracciali, A., Caravagna, G., Gilbert, D., Tagliaferri, R. (eds.) CIBB 2016. LNCS, vol. 10477, pp. 160–171. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67834-4_13
Cumbo, F., Felici, G.: GDCWebApp: filtering, extracting, and converting genomic and clinical data from the Genomic Data Commons portal. In: Genome Informatics, Cold Spring Harbor Meeting (2017)
Weitschek, E., Cumbo, F., Cappelli, E., Felici, G.: Genomic data integration: a case study on next generation sequencing of cancer. In: International Workshop on Database and Expert Systems Applications, pp. 49–53, IEEE Computer Society, Los Alamitos (2016)
Weitschek, E., Cumbo, F., Cappelli, E., Felici, G., Bertolazzi, P.: Classifying big DNA methylation data: a gene-oriented approach. In: Elloumi, M., et al. (eds.) DEXA 2018. CCIS, vol. 903, pp. 138–149. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99133-7_11
Cappelli, E., Felici, G., Weitschek, E.: Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction. BioData Min. 11(1), 22 (2018)
Weitschek, E., Di Lauro, S., Cappelli, E., Bertolazzi, P., Felici, G.: CamurWeb: a classification software and a large knowledge base for gene expression data of cancer. BMC Bioinform. 19(10), 245 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Cappelli, E., Weitschek, E., Cumbo, F. (2019). Smart Persistence and Accessibility of Genomic and Clinical Data. In: Anderst-Kotsis, G., et al. Database and Expert Systems Applications. DEXA 2019. Communications in Computer and Information Science, vol 1062. Springer, Cham. https://doi.org/10.1007/978-3-030-27684-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-27684-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27683-6
Online ISBN: 978-3-030-27684-3
eBook Packages: Computer ScienceComputer Science (R0)