Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Data Management for Heterogeneous Genomic Datasets

Published: 01 November 2017 Publication History

Abstract

Next Generation Sequencing NGS, a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important ”big data” problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language GMQL and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website http://www.bioinformatics.deib.polimi.it/GMQL/; GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/.

References

[1]
J. Shendure and H. Ji, "Next-generation DNA sequencing," Nat. Biotechnol., vol. 26, no. 10, pp. 1135-1145, 2008.
[2]
S. C. Schuster, "Next-generation sequencing transforms today's biology," Nat. Methods., vol. 5, no. 1, pp. 16-18, 2008.
[3]
NIH National Human Genome Research Institute, DNA Sequencing Costs. [Online]. Available: http://www.genome.gov/sequencingcosts/
[4]
ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human genome," Nature, vol. 489, no. 7414, pp. 57-74, 2012.
[5]
Cancer Genome Atlas Research Network, J. N. Weinstein, et al., "The Cancer Genome Atlas pan-cancer analysis project," Nat. Genet., vol. 45, no. 10, pp. 1113-1120, 2013.
[6]
1000 Genomes Project Consortium, G. R. Abecasis, et al., "A map of human genome variation from population-scale sequencing," Nature, vol. 467, no. 7319, pp. 1061-1073, 2010.
[7]
C. E. Romanoski, C. K. Glass, H. G. Stunnenberg, L. Wilson, and G. Almouzni, "Epigenomics: Roadmap for regulation," Nature, vol. 518, no. 7539, pp. 314-316, 2015.
[8]
M. Masseroli, et al., "GenoMetric Query Language: A novel approach to large-scale genomic data management," Bioinformatics, vol. 12, no. 4, pp. 837-843, 2015.
[9]
Apache Pig, (2016). [Online]. Available: http://pig.apache.org/
[10]
Apache Hadoop YARN, (2016). [Online]. Available: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/YARN.html
[11]
L. D. Stein, "The case for cloud computing in genome informatics," Genome Biol., vol. 11, no. 5, 2010, Art. no. 207.
[12]
Global Alliance Genomics API, (2016). [Online]. Available: http://ga4gh.org/#/documentation
[13]
Google Genomics Could Platform, (2016). [Online]. Available: https://cloud.google.com/genomics/
[14]
U. Röhm and J. Blakeley, "Data management for high-throughput genomics," in Proc. Conf. Innovative Data Syst. Res., 2009, pp. 1-10.
[15]
S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop, "Declarative querying for biological sequences," in Proc. IEEE 22nd Int. Conf. Data Eng., 2006, pp. 87-99.
[16]
S. Tata, W. Lang, and J. M. Patel, "Periscope/SQL: Interactive exploration of biological sequence databases," in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 1406-1409.
[17]
V. Bafna, A. Deutsch, A. Heiberg, C. Kozanitis, L. Ohno-Machado, and G. Varghese, "Abstractions for genomics," Commun. ACM, vol. 56, no. 1, pp. 83-93, 2013.
[18]
C. Kozanitis, A. Heiberg, G. Varghese, and V. Bafna, "Using Genome query language to uncover genetic variation," Bioinf., vol. 30, no. 1, pp. 1-8, 2014.
[19]
M. Cereda, M. Sironi, M. Cavalleri, and U. Pozzoli, "GeCo++: A C+ + library for genomic features computation and annotation in the presence of variants," Bioinf., vol. 27, no. 9, pp. 1313-1315, 2011.
[20]
K. Ovaska, L. Lyly, B. Sahu, O. A. Jnne, and S. Hautaniemi, "Genomic region operation kit for flexible processing of deep sequencing data," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 10, no. 1, pp. 200-206, 2013.
[21]
A. R. Quinlan and I. M. Hall, "BEDTools: A flexible suite of utilities for comparing genomic features," Bioinf., vol. 26, no. 6, pp. 841-842, 2010.
[22]
S. Neph, et al., "BEDOPS: High-performance genomic feature operations," Bioinf., vol. 28, no. 14, pp. 1919-1920, 2012.
[23]
H. Nordberg, K. Bhatia, K. Wang, and Z. Wang, "BioPig: A Hadoop-based analytic toolkit for large-scale sequence data," Bioinf., vol. 29, no. 23, pp. 3014-3019, 2013.
[24]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A not-so-foreign language for data processing," in Proc. 2008 ACM SIGMOD Int. Conf. Manag. Data, 2008, pp. 1099-1110.
[25]
A. Schumacher, et al., "SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop," Bioinf., vol. 30, no. 1, pp. 119-120, 2014.
[26]
M. S. Weiwiorka, A. Messina, A. Pacholewska, S. Maffioletti, P. Gawrysiak, and M. J. Okoniewski, "SparkSeq: Fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision," Bioinf., vol. 30, no. 18, pp. 2652-2653, 2014.
[27]
M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proc. 9th USENIX Conf. Netw. Syst. Design Implementation, 2012, pp. 15-28.
[28]
Vertica, (2016). [Online]. Available: https://www.vertica.com/
[29]
SciDB, (2016). [Online]. Available: http://www.scidb.org/
[30]
F. Venco, Y. Vaskin, A. Ceol, and H. Muller, "SMITH: A LIMS for handling next-generation sequencing workflows," BMC Bioinf., vol. 15, no. Suppl. 14, pp. S3, 2014.
[31]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop distributed file system," in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2010, pp. 1-10.
[32]
Apache Lucene, (2016). [Online]. Available: http://lucene.apache.org/
[33]
Galaxy, (2016). [Online]. Available: http://galaxyproject.org/
[34]
F. DeRemer and T. Pennello, "Efficient computation of LALR(1) Look-Ahead sets," ACM Trans. Prog. Lang, Syst., vol. 4, no. 4, pp. 615-649, 1982.
[35]
Racket, (2016). [Online]. Available: http://racket-lang.org/
[36]
H. Gunadhi and A. Segev, "Query processing algorithms for temporal intersection joins," in Proc. IEEE 7th Int. Conf. Data Eng., 1991, pp. 336-344.
[37]
Z. D. Stephens, et al., "Big data: Astronomical or genomical?" PLoS Biol., vol. 13, no. 7,2015, Art. no. e1002195.
[38]
E. F. Codd, "A relational model of data for large shared data banks," Comm. ACM, vol. 13, no. 6, pp. 377-387, 1970.
[39]
Adam, (2016). [Online]. Available: http://www.bdgenomics.org/
[40]
Apache Mahout, (2016). [Online]. Available: http://mahout.apache.org/
[41]
Apache Flink, (2016). [Online]. Available: http://flink.apache.org/.

Cited By

View all
  1. Data Management for Heterogeneous Genomic Datasets

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
    IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 14, Issue 6
    November 2017
    287 pages

    Publisher

    IEEE Computer Society Press

    Washington, DC, United States

    Publication History

    Published: 01 November 2017
    Published in TCBB Volume 14, Issue 6

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 19 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Integrative Feature Ranking by Applying Deep Learning on Multi Source Genomic DataProceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3307339.3342139(207-216)Online publication date: 4-Sep-2019
    • (2019)Editorial from the Incoming Editor-in-ChiefIEEE Transactions on Computers10.1109/TC.2018.287942168:1(3-3)Online publication date: 1-Jan-2019
    • (2019)Spark-IDPPCluster Computing10.1007/s10586-018-2857-922:2(487-508)Online publication date: 1-Jun-2019
    • (2018)Optimal Binning for GenomicsIEEE Transactions on Computers10.1109/TC.2018.285488068:1(125-138)Online publication date: 10-Dec-2018

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media