Abstract
With the fast development of genome sequencing technology, genome sequencing become faster and affordable. Consequently, genomic scientists are now facing an explosive increase of genomic data. Managing, storing and analyzing this quickly growing amount of data is challenging. It is desirable to apply some compression techniques to reduce storage and transferring cost. Referential genome compression is one of these techniques, which exploited the highly similarity of the same or an evolutionary close species (e.g., two randomly selected humans have at least 99% of genetic similarity) and store only the differences between the compressed file and well-known reference genome sequence. In this paper, we port two referential compression algorithm to Loongson platform and profiling their performance. And we use multi-process technology to improve the speed of compression.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., et al.: Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001)
Illumina Int: HiSeq X Series of Sequencing Systems Specification Sheet (2016). https://www.illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf
Reuter, J.A., Spacek, D.V., Snyder, M.P.: High-throughput sequencing technologies. Mol. Cell 58, 586–597 (2015)
Joly, Y., Dove, E.S., Knoppers, B.M., Bobrow, M., Chalmers, D.: Data sharing in the post-genomic world: the experience of the International Cancer Genome Consortium (ICGC) Data Access Compliance Office (DACO). PLoS Comput. Biol. 8, e1002549 (2012)
Collins, F.S., Barker, A.D.: Mapping the cancer genome. Sci. Am. 296, 50–57 (2007)
ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)
Kahn, S.D.: On the future of genomic data. Science 331, 728–729 (2011)
Nalbantoglu, Ö.U., Russell, D.J., Sayood, K.: Data compression concepts and algorithms and their applications to bioinformatics. Entropy 12, 34–52 (2009)
Antoniou, D., Theodoridis, E., Tsakalidis, A.: Compressing biological sequences using self adjusting data structures. In: 2010 10th IEEE International Conference on Information Technology and Applications in Biomedicine (ITAB), pp. 1–5 (2010)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30, 875–886 (1994)
Bose, T., Mohammed, M.H., Dutta, A., Mande, S.S.: BIND–an algorithm for loss-less compression of nucleotide sequence data. J. Biosci. 37, 785–789 (2012)
Cao, M.D., Dix, T.I., Allison, L., Mears, C.: A simple statistical algorithm for biological sequence compression. In: 2007 Data Compression Conference, DCC 2007, pp. 43–52 (2007)
Deorowicz, S., Grabowski, S.: Robust relative compression of genomes with random access. Bioinformatics 27, 2979–2986 (2011)
Wandelt, S., Leser, U.: FRESCO: referential compression of highly similar sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 10, 1275–1288 (2013)
Alves, F., Cogo, V., Wandelt, S., Leser, U., Bessani, A.: On-demand indexing for referential compression of DNA sequences. PLoS ONE 10, e0132460 (2015)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 337–343 (1977)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40, 1098–1101 (1952)
Pinho, A.J., Ferreira, P.J., Neves, A.J., Bastos, C.A.: On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS ONE 6, e21588 (2011)
Rajarajeswari, P., Apparao, A.: DNABIT compress-genome compression algorithm. Bioinformation 5, 350–360 (2011)
Kuruppu, S., Beresford-Smith, B., Conway, T., Zobel, J.: Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 9, 137–149 (2012)
Pratas, D., Pinho, A.J.: Compressing the human genome using exclusively Markov models. In: Rocha, M.P., Rodríguez, J.M.C., Fdez-Riverola, F., Valencia, A. (eds.) PACBB 2011, pp. 213–220. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19914-1_29
Wang, C., Zhang, D.: A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, e45 (2011)
Saha, S., Rajasekaran, S.: ERGC: an efficient referential genome compression algorithm. Bioinformatics, btv399 (2015)
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., et al.: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009)
Luo, Q., Liu, G., Ming, Z., Xiao, F.: Porting and optimizing SOAP2 on Loongson Architecture. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), pp. 566–570 (2015)
Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
Acknowledgment
The research was jointly supported by Shenzhen Science & Technology Foundation: JCYJ20150930105133185, National Natural Science Foundation of China: NSF/GDU1301252, and State Key Laboratory of Computer Architecture ICTCA: CARCH 201405. Guangdong Province Key Laboratory Project: 2012A061400024.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd
About this paper
Cite this paper
Du, Z., Guo, C., Zhang, Y., Luo, Q. (2017). Porting Referential Genome Compression Tool on Loongson Platform. In: Chen, G., Shen, H., Chen, M. (eds) Parallel Architecture, Algorithm and Programming. PAAP 2017. Communications in Computer and Information Science, vol 729. Springer, Singapore. https://doi.org/10.1007/978-981-10-6442-5_43
Download citation
DOI: https://doi.org/10.1007/978-981-10-6442-5_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6441-8
Online ISBN: 978-981-10-6442-5
eBook Packages: Computer ScienceComputer Science (R0)