Nothing Special   »   [go: up one dir, main page]

skip to main content
article

RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Published: 01 August 2013 Publication History

Abstract

A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetition-finding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences.
In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider's prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

References

[1]
M. Abouelhoda, S. Kurtz, and E. Ohlebusch. Enhanced Suffix Arrays and Application, Handbook of Computational Molecular Biology (Chapter 7). Information Science Series. 2006.
[2]
A. Anandasivam. Consumer Preferences and Bid-Price Control for Cloud Services. PhD thesis, Faculty of Economics, Karlsruhe Institute of Technology (KIT), 2010.
[3]
N. Askitis and R. Sinha. RepMaestro: scalable repeat detection on disk-based genome sequences. Bioinformatics, 26(19):2368-2374, 2010.
[4]
T. Beller, K. Berger, and E. Ohlebusch. Space-efficient computation of maximal and supermaximal repeats in genome sequences. In Proc. of the Int. Conf. on String Processing and Information Retrieval (SPIRE), pages 99-110, 2012.
[5]
R. P. J. C. Bose and W. M. P. van der Aalst. Abstractions in process mining: A taxonomy of patterns. In Proc. of the Int. Conf. on Business Process Management (BPM), pages 159-175, 2009.
[6]
G. S. Brodal, R. B. Lyngsø, C. N. S. Pedersen, and J. Stoye. Finding maximal pairs with bounded gap. In Proc. of the Annual Symposium on Combinatorial Pattern Matching (CPM), pages 134-149. 1999.
[7]
D. L. Eager, J. Zahorjan, and E. D. Lazowska. Speedup versus efficiency in parallel systems. IEEE Trans. Comput., 38(3):408-423, 1989.
[8]
A. Ghoting and K. Makarychev. Serial and parallel methods for i/o efficient suffix tree construction. In Proc. of the ACM Int. Conf. on Management of data (SIGMOD), pages 827-840. ACM, 2009.
[9]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997.
[10]
W. Haque, A. Aravind, and B. Reddy. Pairwise sequence alignment algorithms: a survey. In Proc. of the Conf. on Information Science, Technology and Applications (ISTA), pages 96-103, 2009.
[11]
E. Hunt, M. P. Atkinson, and R. W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11:256-271, 2002.
[12]
M. O. Kulekci, J. S. Vitter, and B. Xu. Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 9(2):421-429, 2012.
[13]
S. Kurtz. The Vmatch large scale sequence analysis software. Center for Bioinformatics, University of Hamburg, 2011.
[14]
S. Kurtz and C. Schleiermacher. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15(5):426-427, 1999.
[15]
Y. Ledeneva, A. Gelbukh, and R. A. García-Hernández. Terms derived from frequent sequences for extractive text summarization. In Proc. of the Int. Conf. on Computational linguistics and intelligent text processing, pages 593-604, 2008.
[16]
M. Linhard. Data structure for representation of maximal repeats in strings. Master's thesis, Comenius University, 2007.
[17]
E. Mansour, A. Allam, S. Skiadopoulos, and P. Kalnis. ERA: efficient serial and parallel suffix tree construction for very long strings. Proc. VLDB Endow., 5(1):49-60, Sept. 2011.
[18]
S. Maruyama, H. Miyagawa, and H. Sakamoto. Improving time and space complexity for compressed pattern matching. In the Int. Symposium on Algorithms and Computation (ISAAC), pages 484-493, 2006.
[19]
E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of ACM, 23:262-272, 1976.
[20]
B. Phoophakdee and M. J. Zaki. Genome-scale disk-based suffix tree indexing. In Proc. of the ACM Int. Conf. on Management of data (SIGMOD), pages 833-844, 2007.
[21]
V. Saxena, Y. Sabharwal, and P. Bhatotia. Performance evaluation and optimization of random memory access on multicores with high productivity. In Proc. of the Int. Conf. on High Performance Computing (HiPC), pages 1-10, 2010.
[22]
H. Stehr. Constrained matching of large biological sequences. Master's thesis, University Hamburg, 2005.
[23]
A. Udechukwu, K. Barker, and R. Alhajj. Discovering all frequent trends in time series. In Proc. of the winter Int. symposium on Information and communication technologies (WISICT), pages 1-6, 2004.
[24]
N. Whiteford, N. Haslam, G. Weber, A. Prugel-Bennett, J. Essex, and C. Neylon. Visualising the repeat structure of genomic sequences. Complex Systems, 17(4):381-398, April 2008.
  1. RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 6, Issue 10
      August 2013
      180 pages

      Publisher

      VLDB Endowment

      Publication History

      Published: 01 August 2013
      Published in PVLDB Volume 6, Issue 10

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 130
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media