article

RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Authors:

Ashraf AboulnagaAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 10

Pages 865 - 876

https://doi.org/10.14778/2536206.2536214

Published: 01 August 2013 Publication History

Abstract

A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetition-finding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences.

In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

References

[1]

M. Abouelhoda, S. Kurtz, and E. Ohlebusch. Enhanced Suffix Arrays and Application, Handbook of Computational Molecular Biology (Chapter 7). Information Science Series. 2006.

[2]

A. Anandasivam. Consumer Preferences and Bid-Price Control for Cloud Services. PhD thesis, Faculty of Economics, Karlsruhe Institute of Technology (KIT), 2010.

[3]

N. Askitis and R. Sinha. RepMaestro: scalable repeat detection on disk-based genome sequences. Bioinformatics, 26(19):2368-2374, 2010.

[4]

T. Beller, K. Berger, and E. Ohlebusch. Space-efficient computation of maximal and supermaximal repeats in genome sequences. In Proc. of the Int. Conf. on String Processing and Information Retrieval (SPIRE), pages 99-110, 2012.

[5]

R. P. J. C. Bose and W. M. P. van der Aalst. Abstractions in process mining: A taxonomy of patterns. In Proc. of the Int. Conf. on Business Process Management (BPM), pages 159-175, 2009.

[6]

G. S. Brodal, R. B. Lyngsø, C. N. S. Pedersen, and J. Stoye. Finding maximal pairs with bounded gap. In Proc. of the Annual Symposium on Combinatorial Pattern Matching (CPM), pages 134-149. 1999.

[7]

D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup versus efficiency in parallel systems. IEEE Trans. Comput., 38(3):408-423, 1989.

[8]

A. Ghoting and K. Makarychev. Serial and parallel methods for i/o efficient suffix tree construction. In Proc. of the ACM Int. Conf. on Management of data (SIGMOD), pages 827-840. ACM, 2009.

[9]

D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997.

[10]

W. Haque, A. Aravind, and B. Reddy. Pairwise sequence alignment algorithms: a survey. In Proc. of the Conf. on Information Science, Technology and Applications (ISTA), pages 96-103, 2009.

[11]

E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11:256-271, 2002.

[12]

M. O. Kulekci, J. S. Vitter, and B. Xu. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 9(2):421-429, 2012.

[13]

S. Kurtz. The Vmatch large scale sequence analysis software. Center for Bioinformatics, University of Hamburg, 2011.

[14]

S. Kurtz and C. Schleiermacher. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426-427, 1999.

[15]

Y. Ledeneva, A. Gelbukh, and R. A. García-Hernández. Terms derived from frequent sequences for extractive text summarization. In Proc. of the Int. Conf. on Computational linguistics and intelligent text processing, pages 593-604, 2008.

[16]

M. Linhard. Data structure for representation of maximal repeats in strings. Master's thesis, Comenius University, 2007.

[17]

E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. ERA: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow., 5(1):49-60, Sept. 2011.

[18]

S. Maruyama, H. Miyagawa, and H. Sakamoto. Improving time and space complexity for compressed pattern matching. In the Int. Symposium on Algorithms and Computation (ISAAC), pages 484-493, 2006.

[19]

E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262-272, 1976.

[20]

B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of the ACM Int. Conf. on Management of data (SIGMOD), pages 833-844, 2007.

[21]

V. Saxena, Y. Sabharwal, and P. Bhatotia. Performance evaluation and optimization of random memory access on multicores with high productivity. In Proc. of the Int. Conf. on High Performance Computing (HiPC), pages 1-10, 2010.

[22]

H. Stehr. Constrained matching of large biological sequences. Master's thesis, University Hamburg, 2005.

[23]

A. Udechukwu, K. Barker, and R. Alhajj. Discovering all frequent trends in time series. In Proc. of the winter Int. symposium on Information and communication technologies (WISICT), pages 1-6, 2004.

[24]

N. Whiteford, N. Haslam, G. Weber, A. Prugel-Bennett, J. Essex, and C. Neylon. Visualising the repeat structure of genomic sequences. Complex Systems, 17(4):381-398, April 2008.

RACE: a scalable and elastic parallel system for discovering repeats in very long sequences
1. Computer systems organization
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

Deterministic OpenMP for race-free parallelism
HotPar'11: Proceedings of the 3rd USENIX conference on Hot topic in parallelism

Recent deterministic parallel programming models show promise for their ability to replay computations and reproduce bugs, but they currently require the programmer to adopt restrictive or unfamiliar parallel constructs. Deterministic OpenMP (DOMP) is a ...
Data race: tame the beast

Data races hamper parallel programming and threaten the reliability of future software. This paper proposes the data race prevention scheme View-Oriented Data race Prevention (VODAP), which can prevent data races in the View-Oriented Parallel ...
Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs
SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures

If a parallel program has determinacy race(s), different schedules can result in memory accesses that observe different values --- various race-detection tools have been designed to find such bugs. A key component of race detectors is an algorithm for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 10

August 2013

180 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 10

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
130
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents