research-article

Data Management for Heterogeneous Genomic Datasets

Authors:

Abdulrahman Kaitoua,

Marco Masseroli,

Francesco VencoAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 14, Issue 6

Pages 1251 - 1264

https://doi.org/10.1109/TCBB.2016.2576447

Published: 01 November 2017 Publication History

Abstract

Next Generation Sequencing NGS, a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important ”big data” problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language GMQL and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website http://www.bioinformatics.deib.polimi.it/GMQL/; GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/.

References

[1]

J. Shendure and H. Ji, "Next-generation DNA sequencing," Nat. Biotechnol., vol. 26, no. 10, pp. 1135-1145, 2008.

[2]

S. C. Schuster, "Next-generation sequencing transforms today's biology," Nat. Methods., vol. 5, no. 1, pp. 16-18, 2008.

[3]

NIH National Human Genome Research Institute, DNA Sequencing Costs. [Online]. Available: http://www.genome.gov/sequencingcosts/

[4]

ENCODE Project Consortium, "An integrated encyclopedia of DNA elements in the human genome," Nature, vol. 489, no. 7414, pp. 57-74, 2012.

[5]

Cancer Genome Atlas Research Network, J. N. Weinstein, et al., "The Cancer Genome Atlas pan-cancer analysis project," Nat. Genet., vol. 45, no. 10, pp. 1113-1120, 2013.

[6]

1000 Genomes Project Consortium, G. R. Abecasis, et al., "A map of human genome variation from population-scale sequencing," Nature, vol. 467, no. 7319, pp. 1061-1073, 2010.

[7]

C. E. Romanoski, C. K. Glass, H. G. Stunnenberg, L. Wilson, and G. Almouzni, "Epigenomics: Roadmap for regulation," Nature, vol. 518, no. 7539, pp. 314-316, 2015.

[8]

M. Masseroli, et al., "GenoMetric Query Language: A novel approach to large-scale genomic data management," Bioinformatics, vol. 12, no. 4, pp. 837-843, 2015.

[9]

Apache Pig, (2016). [Online]. Available: http://pig.apache.org/

[10]

Apache Hadoop YARN, (2016). [Online]. Available: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/YARN.html

[11]

L. D. Stein, "The case for cloud computing in genome informatics," Genome Biol., vol. 11, no. 5, 2010, Art. no. 207.

[12]

Global Alliance Genomics API, (2016). [Online]. Available: http://ga4gh.org/#/documentation

[13]

Google Genomics Could Platform, (2016). [Online]. Available: https://cloud.google.com/genomics/

[14]

U. Röhm and J. Blakeley, "Data management for high-throughput genomics," in Proc. Conf. Innovative Data Syst. Res., 2009, pp. 1-10.

[15]

S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop, "Declarative querying for biological sequences," in Proc. IEEE 22nd Int. Conf. Data Eng., 2006, pp. 87-99.

[16]

S. Tata, W. Lang, and J. M. Patel, "Periscope/SQL: Interactive exploration of biological sequence databases," in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 1406-1409.

[17]

V. Bafna, A. Deutsch, A. Heiberg, C. Kozanitis, L. Ohno-Machado, and G. Varghese, "Abstractions for genomics," Commun. ACM, vol. 56, no. 1, pp. 83-93, 2013.

[18]

C. Kozanitis, A. Heiberg, G. Varghese, and V. Bafna, "Using Genome query language to uncover genetic variation," Bioinf., vol. 30, no. 1, pp. 1-8, 2014.

[19]

M. Cereda, M. Sironi, M. Cavalleri, and U. Pozzoli, "GeCo++: A C+ + library for genomic features computation and annotation in the presence of variants," Bioinf., vol. 27, no. 9, pp. 1313-1315, 2011.

[20]

K. Ovaska, L. Lyly, B. Sahu, O. A. Jnne, and S. Hautaniemi, "Genomic region operation kit for flexible processing of deep sequencing data," IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 10, no. 1, pp. 200-206, 2013.

[21]

A. R. Quinlan and I. M. Hall, "BEDTools: A flexible suite of utilities for comparing genomic features," Bioinf., vol. 26, no. 6, pp. 841-842, 2010.

[22]

S. Neph, et al., "BEDOPS: High-performance genomic feature operations," Bioinf., vol. 28, no. 14, pp. 1919-1920, 2012.

[23]

H. Nordberg, K. Bhatia, K. Wang, and Z. Wang, "BioPig: A Hadoop-based analytic toolkit for large-scale sequence data," Bioinf., vol. 29, no. 23, pp. 3014-3019, 2013.

[24]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, "Pig Latin: A not-so-foreign language for data processing," in Proc. 2008 ACM SIGMOD Int. Conf. Manag. Data, 2008, pp. 1099-1110.

[25]

A. Schumacher, et al., "SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop," Bioinf., vol. 30, no. 1, pp. 119-120, 2014.

[26]

M. S. Weiwiorka, A. Messina, A. Pacholewska, S. Maffioletti, P. Gawrysiak, and M. J. Okoniewski, "SparkSeq: Fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision," Bioinf., vol. 30, no. 18, pp. 2652-2653, 2014.

[27]

M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proc. 9th USENIX Conf. Netw. Syst. Design Implementation, 2012, pp. 15-28.

[28]

Vertica, (2016). [Online]. Available: https://www.vertica.com/

[29]

SciDB, (2016). [Online]. Available: http://www.scidb.org/

[30]

F. Venco, Y. Vaskin, A. Ceol, and H. Muller, "SMITH: A LIMS for handling next-generation sequencing workflows," BMC Bioinf., vol. 15, no. Suppl. 14, pp. S3, 2014.

[31]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The Hadoop distributed file system," in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2010, pp. 1-10.

[32]

Apache Lucene, (2016). [Online]. Available: http://lucene.apache.org/

[33]

Galaxy, (2016). [Online]. Available: http://galaxyproject.org/

[34]

F. DeRemer and T. Pennello, "Efficient computation of LALR(1) Look-Ahead sets," ACM Trans. Prog. Lang, Syst., vol. 4, no. 4, pp. 615-649, 1982.

[35]

Racket, (2016). [Online]. Available: http://racket-lang.org/

[36]

H. Gunadhi and A. Segev, "Query processing algorithms for temporal intersection joins," in Proc. IEEE 7th Int. Conf. Data Eng., 1991, pp. 336-344.

[37]

Z. D. Stephens, et al., "Big data: Astronomical or genomical?" PLoS Biol., vol. 13, no. 7,2015, Art. no. e1002195.

[38]

E. F. Codd, "A relational model of data for large shared data banks," Comm. ACM, vol. 13, no. 6, pp. 377-387, 1970.

[39]

Adam, (2016). [Online]. Available: http://www.bdgenomics.org/

[40]

Apache Mahout, (2016). [Online]. Available: http://mahout.apache.org/

[41]

Apache Flink, (2016). [Online]. Available: http://flink.apache.org/.

Cited By

Khoshghalbvash FGao JShi XBuck MMa JVeltri P(2019)Integrative Feature Ranking by Applying Deep Learning on Multi Source Genomic DataProceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3307339.3342139(207-216)Online publication date: 4-Sep-2019
https://dl.acm.org/doi/10.1145/3307339.3342139
Louri A(2019)Editorial from the Incoming Editor-in-ChiefIEEE Transactions on Computers10.1109/TC.2018.287942168:1(3-3)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1109/TC.2018.2879421
Małysiak-Mrozek BBaron TMrozek D(2019)Spark-IDPPCluster Computing10.1007/s10586-018-2857-922:2(487-508)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10586-018-2857-9
Show More Cited By

Data Management for Heterogeneous Genomic Datasets
1. Applied computing
  1. Life and medical sciences

Recommendations

sRNA associated genomic islands in Salmonella spp.
ISB '10: Proceedings of the International Symposium on Biocomputing

Genomic Islands are parts of a genome that has evidence of horizontal origins. The present work is a continuation of our earlier work that identified 25 regions downstream of the small RNAs as hotspots of genomic island integration by analyzing three ...
Genomic dark matter

Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the ...
Inferring genomic sequences

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 14, Issue 6

November 2017

287 pages

ISSN:1545-5963

Issue’s Table of Contents

Copyright © 2017.

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 November 2017

Published in TCBB Volume 14, Issue 6

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
42
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Khoshghalbvash FGao JShi XBuck MMa JVeltri P(2019)Integrative Feature Ranking by Applying Deep Learning on Multi Source Genomic DataProceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3307339.3342139(207-216)Online publication date: 4-Sep-2019
https://dl.acm.org/doi/10.1145/3307339.3342139
Louri A(2019)Editorial from the Incoming Editor-in-ChiefIEEE Transactions on Computers10.1109/TC.2018.287942168:1(3-3)Online publication date: 1-Jan-2019
https://dl.acm.org/doi/10.1109/TC.2018.2879421
Małysiak-Mrozek BBaron TMrozek D(2019)Spark-IDPPCluster Computing10.1007/s10586-018-2857-922:2(487-508)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s10586-018-2857-9
Gulino AKaitoua ACeri S(2018)Optimal Binning for GenomicsIEEE Transactions on Computers10.1109/TC.2018.285488068:1(125-138)Online publication date: 10-Dec-2018
https://dl.acm.org/doi/10.1109/TC.2018.2854880

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents