Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches
Ebook1,329 pages13 hours

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

Rating: 0 out of 5 stars

()

Read preview

About this ebook

A comprehensive overview of high-performance pattern recognition techniques and approaches to Computational Molecular Biology

This book surveys the developments of techniques and approaches on pattern recognition related to Computational Molecular Biology. Providing a broad coverage of the field, the authors cover fundamental and technical information on these techniques and approaches, as well as discussing their related problems. The text consists of twenty nine chapters, organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks.

  • Surveys the development of techniques and approaches on pattern recognition in biomolecular data
  • Discusses pattern recognition in primary, secondary, tertiary and quaternary structures, as well as microarrays, phylogenetic trees and biological networks
  • Includes case studies and examples to further illustrate the concepts discussed in the book
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches is a reference for practitioners and professional researches in Computer Science, Life Science, and Mathematics. This book also serves as a supplementary reading for graduate students and young researches interested in Computational Molecular Biology.
LanguageEnglish
PublisherWiley
Release dateDec 24, 2015
ISBN9781119078869
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

Related to Pattern Recognition in Computational Molecular Biology

Titles in the series (16)

View More

Related ebooks

Electrical Engineering & Electronics For You

View More

Related articles

Reviews for Pattern Recognition in Computational Molecular Biology

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Pattern Recognition in Computational Molecular Biology - Mourad Elloumi

    List of Contributors

    Andrej Aderhold, School of Biology, University of St Andrews, St Andrews, UK

    Lefteris Angelis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

    Miguel Arenas, Bioinformatics Unit, Centre for Molecular Biology Severo Ochoa (CSIC), Madrid, Spain; Institute of Molecular Pathology and Immunology, University of Porto (IPATIMUP), Porto, Portugal

    Dunarel Badescu, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

    Carl Barton, The Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK

    Sima Behpour, Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

    Paola Bertolazzi, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

    Chengpeng Bi, Bioinformatics and Intelligent Computing Lab, Division of Clinical Pharmacology, Children's Mercy Hospitals, Kansas City, MO, USA

    Kevin Byron, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

    Weiwen Cai, Institute of Applied Genomics, College of Biological Science and Engineering, Fuzhou University, Fuzhou, China

    Virginio Cantoni, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

    Kuo-Chen Chou, Gordon Life Science Institute, Belmont, MA, USA; King Abdulaziz University, Jeddah, Saudi Arabia

    Matteo Comin, Department of Information Engineering, University of Padova, Padova, Italy

    David Dao, Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach, Karlsruhe, Germany

    Bhaskar DasGupta, Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

    Wajdi Dhifli, Clermont Université, Université Blaise Pascal, LIMOS, BP, Clermont-Ferrand, France; CNRS, UMR, LIMOS, Aubiére, France; Department of Computer Science, University of Quebec At Montreal, Downtown station, Montreal (Quebec) Canada

    Carlotta Domeniconi, Department of Computer Science, George Mason University, Fairfax, VA, USA

    Maryam Faridounnia, Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, The Netherlands

    Giovanni Felici, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

    Marco Ferretti, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

    Giulia Fiscon, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy; Department of Computer, Control and Management Engineering, Sapienza University, Rome, Italy

    Tomáš Flouri, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

    Adelaide Freitas, Department of Mathematics, and Center for Research & Development in Mathematics and Applications (CIDMA), University of Aveiro, Aveiro, Portugal

    Valentina Fustaino, Cellular Biology and Neurobiology Institute, National Research Council, Rome, Italy; Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

    Yann Guermeur, LORIA, Université de Lorraine-CNRS, Nancy, France

    Michael Hall, Department of Computing Science, University of Alberta, Edmonton, Canada

    Robert Harrison, Department of Computer Science, Georgia State University, Atlanta, GA, USA

    Dirk Husmeier, School of Mathematics and Statistics, University of Glasgow, Glasgow, UK

    Costas S. Iliopoulos, Department of Informatics, King's College London, London, United Kingdom

    Samad Jahandideh, Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, La Jolla, CA, USA

    Giuseppe Lancia, Dipartimento di Matematica e Informatica, University of Udine, Udine, Italy

    Fabien Lauer, LORIA, Université de Lorraine-CNRS, Nancy, France

    Vladimir Makarenkov, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

    Christos Makris, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece

    Mina Maleki, School of Computer Science, University of Windsor, Windsor, Canada

    Diego Mallo, Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain

    David A. Morrison, Systematic Biology, Uppsala University, Norbyvägen, Uppsala, Sweden

    Ahmed Moussa, LabTIC Laboratory, ENSA, Abdelmalek Essaadi University, Tangier, Morocco

    Sami Muhaidat, ECE Department, Khalifa University, Abu Dhabi, UAE; EE Department, University of Surrey, Guildford, UK

    Mirto Musci, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

    Engelbert Mephu Nguifo, LIMOS–Blaise Pascal University–Clermont University, Clermont-Ferrand, France; LIMOS–CNRS UMR, Aubiére, France

    Stavroula Ntoufa, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

    Nahumi Nugrahaningsih, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 3, 27100 Pavia, Italy

    Hasan Oğul, Department of Computer Engineering, Başkent University, Ankara, Turkey

    Nikos Papakonstantinou, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

    Sheng-Lung Peng, Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan

    Solon P. Pissis, Department of Informatics, King's College London, London, United Kingdom

    Huzefa Rangwala, Department of Computer Science, George Mason University, Fairfax, VA, USA

    Sara Roque, Department of Mathematics, University of Aveiro, Aveiro, Portugal

    Luis Rueda, School of Computer Science, University of Windsor, Windsor, Canada

    Agustín Sánchez-Cobos, Bioinformatics Unit, Centre for Molecular Biology Severo Ochoa (CSIC), Madrid, Spain

    Maad Shatnawi, Higher Colleges of Technology, Abu Dhabi, UAE

    Soheila Shokrollahzade, Department of Medicinal Biotechnology, Iran University of Medical Sciences, Tehran, Iran

    Carina Silva, Lisbon School of Health Technology, Lisbon, Portugal; Center of Statistics and Applications of Lisbon University (CEAUL), Lisbon, Portugal

    Tiratha Raj Singh, Biotechnolgy and Bioinformatics Department, Jaypee University of Information and Technology (JUIT), Solan, Himachal Pradesh, India

    V Anne Smith, School of Biology, University of St Andrews, St Andrews, UK

    Jiangning Song, National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China; Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia

    Yang Song, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

    Lisete Sousa, Department of Statistics and Operations Research, Lisbon University and Center of Statistics and Applications of Lisbon University (CEAUL), Lisbon, Portugal, Lisbon, Portugal

    Alexandros Stamatakis, Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach, Karlsruhe, Germany; Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

    Kostas Stamatopoulos, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

    Kamal Taha, ECE Department, Khalifa University, Abu Dhabi, UAE

    Nadia Tahiri, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

    Li Teng, Department of Internal Medicine, University of Iowa, Iowa City, IA, USA

    Evangelos Theodoridis, Computer Technology Institute and Press Diophantus, Patras, Greece

    Athina Tsanousa, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

    Yu-Wei Tsay, Institute of Information Science, Academia Sinica, Taipei, Taiwan

    Chinua Umoja, Department of Computer Science, Georgia State University, Atlanta, GA, USA

    Brigitte Vannier, Receptors, Regulation and Tumor Cells (2RTC) Laboratory, University of Poitiers, Poitiers, France

    Meghana Vasavada, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

    Akbar Vaseghi, Department of Genetics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran

    Davide Verzotto, Computational and Systems Biology, Genome Institute of Singapore, Singapore

    Jason T.L. Wang, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

    Emanuel Weitschek, Department of Engineering, Uninettuno International University, Rome, Italy; Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

    Renxiang Yan, Institute of Applied Genomics, College of Biological Science and Engineering, Fuzhou University, Fuzhou, China

    Paul D. Yoo, Department of Computing and Informatics, Bournemouth University, UK; Centre for Distributed and High Performance Computing, University of Sydney, Sydney, Australia

    Guoxian Yu, College of Computer and Information Science, Southwest University, Chongqing, China

    Xiaxia Yu, Department of Computer Science, Georgia State University, Atlanta, GA, USA

    Ziding Zhang, State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China

    Albert Y. Zomaya, Centre for Distributed and High Performance Computing, University of Sydney, Sydney, Australia

    Preface

    Pattern recognition is the automatic identification of regularities, that is, figures, characters, shapes, and forms present in data. Pattern recognition is the core process of many scientific discoveries, whereby researchers detect regularities in large amounts of data in fields as diverse as Geology, Physics, Astronomy, and Molecular Biology. Pattern recognition in biomolecular data is at the core of Molecular Biology research. Indeed, pattern recognition makes a very important contribution to the analysis of biomolecular data. In fact, it can reveal information about shared biological functions of biological macromolecules, that is, DNA, RNA, and proteins, originating from several different organisms, by the identification of patterns that are shared by structures related to these macromolecules. These patterns, which have been conserved during evolution, often play an important structural and/or functional role, and consequently, shed light on the mechanisms and the biological processes in which these macromolecules participate. Pattern recognition in biomolecular data is also used in evolutionary studies, in order to analyze relationships that exist between species and establish whether two, or several, biological macromolecules are homologous, that is, have a common biological ancestor, and to reconstruct the phylogenetic tree that links them to this ancestor. On the other hand, with the new sequencing technologies, the number of biological sequences in databases is increasing exponentially. In addition, the lengths of these sequences are large. Hence, the recognition of patterns in such databases requires the development of fast, low-memory requirements and high-performance techniques and approaches. This book provides an up-to-date forum of such techniques and approaches that deal with the most studied, the most important, and/or the newest topics in the field of pattern recognition. Some of these techniques and approaches represent improvements on old ones, while others are completely new. Most of current books on pattern recognition in biomolecular data either lack technical depth or focus on specific, narrow topics. This book is the first overview on techniques and approaches on pattern recognition in biomolecular data with both a broad coverage of this field and enough depth to be of practical use to working professionals. It surveys the most recent developments of techniques and approaches on pattern recognition in biomolecular data, offering enough fundamental and technical information on these techniques and approaches and the related problems, without overloading the reader. This book will thus be invaluable not only for practitioners and professional researchers in Computer Science, Life Science, and Mathematics but also for graduate students and young researchers looking for promising directions in their work. It will certainly point them to new techniques and approaches that may be the key to new and important discoveries in Molecular Biology.

    This book is organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks. The 29 chapters, which make up the seven parts of this book, were carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplications. Each contributor was asked to cover review material as well as current developments in his/her chapter. In addition, the choice of authors was made by selecting those who are leaders in their respective fields.

    Mourad Elloumi

    Tunis, Tunisia

    Costas S. Iliopoulos

    London, UK

    Jason T. L. Wang

    Newark, USA

    Albert Y. Zomaya

    Sydney, Australia

    November 1, 2015

    Part I

    Pattern Recognition in Sequences

    Chapter 1

    Combinatorial Haplotyping Problems

    Giuseppe Lancia

    Dipartimento di Matematica e Informatica, University of Udine, Udine, Italy

    1.1 Introduction

    A few years back, the process of collection of a vast amount of genomic data culminated with the completion of the Human Genome Project [22]. The outcome of the project has brought the confirmation that the genetic makeup of humans (as well as other species) is remarkably well conserved. Generally speaking, the differences at DNA level between any two individuals amount to less than 5% of their genomic sequences. As a consequence, the differences at the phenotype level (i.e., in the way the organisms look) must be caused by small regions of differences in the genomes. The smallest possible region consists of a single nucleotide and is called the Single Nucleotide Polymorphism (SNP, pronounced snip). SNPs are the predominant form of human genetic variation and their importance can hardly be overestimated; they are used, for example, in medical, drug-design, diagnostic, and forensic applications.

    Broadly speaking, a polymorphism is a trait common to everybody that can take different values, ranging in a limited set of possibilities, called alleles (for simple examples, think of blood group or eye color). In particular, a SNP is a nucleotide site, placed in the middle of a DNA region that is otherwise identical in everybody, at which we observe a statistically significant variability. A SNP is almost always a polymorphism with only two alleles (out of the four possible). For a nucleotide site to be considered a SNP, it must be the case that the less frequent allele is found in the population with some nonnegligible frequency.

    For many living species (e.g., mammals), there exist two sexes and each individual has two parents, one of each sex. The genome of these species is organized in pairs of chromosomes and a single copy of each chromosome pair is inherited from each of the two parents. Organisms in which the genome is organized in pairs of homologous chromosomes are called diploid organisms. For a diploid organism, at each SNP, an individual can either be homozygous (i.e., possess the same allele on both chromosomes) or heterozygous (i.e., possess two different alleles). The values of a set of SNPs on a particular chromosome copy define a haplotype.

    In Figure 1.1, we illustrate a simplistic example, showing a specific chromosome in three individuals. For each individual, the pair of her chromosome copies is reported. There are four SNPs. The alleles for SNP 1, in this example, are C and G, while for SNP 4 they are T and C. Individual 1 is heterozygous for SNPs 1, 2, and 3, and homozygous for SNP 4. Her haplotypes are CCCT and GAGT. The haplotypes of individual 3 are CAGT and GACC.

    Figure 1.1 A chromosome in three individuals. There are four SNPs.

    Haplotyping an individual consists in determining her two haplotypes for a given chromosome. With the larger availability in SNP genomic data, recent years have seen the birth of many new computational problems related to haplotyping. These problems are motivated by the fact that it can be difficult and/or very expensive to determine the haplotypes experimentally, so ad hoc algorithms have been used to correct data errors or to infer missing data.

    In the remainder of this chapter, we will address the haplotyping problem for both a single individual and a set of individuals (a population). In the former case, described in Section 1.2, the input is haplotype data inconsistent with the existence of exactly two parents for an individual. This inconsistency is due to experimental errors and/or missing data. In the latter case, described in Section 1.3, the input data specify for each SNP and each individual in a population whether her parents have contributed the same or different alleles, but do not specify which parent contributed which allele.

    1.2 Single Individual Haplotyping

    The process of passing from the sequence of nucleotides in a DNA molecule to a string over the DNA alphabet is called sequencing. A sequencer is a machine that is fed some DNA and whose output is a string of As, Ts, Cs, and Gs. To each letter, the sequencer attaches a value (confidence level) that essentially represents the probability that the letter has been correctly read.

    The main problem with sequencing is that, owing to technological limitations, it is not possible to sequence a long DNA molecule at once. What we can do, however, is to sequence short DNA fragments (also called reads), of length of about 1000 nucleotides each, which provide small windows to look at the target molecule. To sequence a long DNA molecule, the molecule must first be replicated (amplified) by creating many copies of it. These copies are then broken, at random, into several smaller fragments, which are fed to a sequencer that will sequence those of the right size. The amplification phase is necessary so that the reads can have nonempty overlap. From the overlap of two reads, one may infer (through a process called assembly) a longer fragment, and so on, until the original DNA sequence has been reconstructed. This is, in essence, the principle of shotgun sequencing, a method used by Celera Genomics in the late 1990s to allow the completion of the sequencing of the human genome faster compared with other experimental techniques of the time [62]. In shotgun sequencing, the fragments were read and then assembled back into the original sequence by using sophisticated algorithms and powerful computers.

    In Figure 1.2, we illustrate an example in which the two chromosome copies (a) and (b) have been amplified, and then some fragments (denoted by rectangular boxes) have been sequenced. The goal would then be to retrieve (a) and (b), given the set of sequenced fragments. The major difficulty obstructing this goal is that, during the amplification phase, both the paternal and the maternal chromosome copies are amplified together so that the random reads can belong to either the paternal or the maternal original copy. Of course, it is unknown which fragments are paternal and which are maternal, and one of the problems in reconstructing the haplotypes consists in segregating the paternal fragments from the maternal ones. In some cases, there may be pairs of reads, called mate pairs, for which it is known that they are either both paternal or both maternal. Mate pairs are due to a particular technique that allows to sequence the ends of a fragment several thousand nucleotides long. Again, one can only read about 1000 nucleotides at each end of the target, but the result is stronger than reading two individual fragments as now it is known that the two reads come from the same chromosome copy. Furthermore, the experiment returns a fairly precise estimate of the distance, expressed in bases, between the two reads of a mate pair.

    c01f002

    Figure 1.2 Sequence reads and assembly of the two chromosome copies.

    Even with the best possible technology, sequencing errors are unavoidable. The main errors consist in bases that have been miscalled or skipped altogether. Furthermore, contaminants can be present, that is, DNA coming from organisms other than the one that had to be sequenced. Owing to the experimental errors, the reconstruction of the haplotypes, given the reads coming from sequencing, is not always straightforward and may require correction of the input data. In a general way, the haplotyping problem for an individual can then be informally stated as follows:

    Given inconsistent haplotype data coming from sequencing of an individual's chromosome, find and correct the errors in the data so as to retrieve a consistent pair of haplotypes.

    Depending on what type of errors one is after, there can be many versions of this problem (for surveys on individual haplotyping problems, see, e.g., Schwartz [59] or Zhang et al. [70]). Historically, the formalization of the first haplotyping problems for an individual was given by Lancia et al. [47]. In their work, the Minimum Fragment Removal (MFR) and Minimum SNP Removal (MSR) problems were introduced, which we briefly discuss here.

    Given the fact that at each SNP only two alleles are possible, we can encode them by using a binary alphabet. Hence, in the sequel, the two values that a SNP can take are denoted by 0 and 1. A haplotype is then simply a string over the alphabet c01-math-0001 .

    The basic framework for a single individual haplotyping problem is as follows. There is a set c01-math-0002 of SNPs and a set c01-math-0003 of fragments (i.e., reads coming from sequencing). Each SNP is covered by some of the fragments and can take the values 0 or 1. Since there is a natural ordering of the SNPs, given by their physical location on the chromosome, the data can be represented by an c01-math-0004 matrix c01-math-0005 over the alphabet c01-math-0006 , called the SNP matrix. Each column corresponds to a SNP and each row corresponds to a fragment. If fragment c01-math-0007 covers a SNP c01-math-0008 , then c01-math-0009 is the value of the allele for SNP c01-math-0010 appearing in fragment c01-math-0011 . The symbol "-" is used to represent a SNP not covered by a fragment (see Figure 1.3a and b for an example of a SNP matrix).

    c01f003

    Figure 1.3 SNP matrix and conflict graphs.

    A gapless fragment is one covering a set of consecutive SNPs (i.e., the 0s and 1s appear consecutively in that row). We say that a fragment has c01-math-0012 gaps if it covers c01-math-0013 blocks of consecutive SNPs (e.g., the fragment 00--101---01 has two gaps). Gaps are mainly due to two reasons: (i) thresholding of low-quality reads (when the sequencer cannot call a SNP 0 or 1 with enough confidence, the SNP is marked with a "-" ); (ii) mate-pairing in shotgun sequencing (in this case c01-math-0014 , and the situation is equivalent to having two gapless fragments coming from the same chromosome copy).

    Two fragments c01-math-0015 and c01-math-0016 are said to be in conflict if there exists a SNP c01-math-0017 such that neither c01-math-0018 nor c01-math-0019 and it is c01-math-0020 . The conflict of two fragments implies that either the fragments do not come from the same chromosome copy or there are errors in the data. Given a SNP matrix c01-math-0021 , the fragment conflict graph is the graph c01-math-0022 with an edge for each pair of fragments in conflict (see Figure 1.3c). Two SNPs c01-math-0023 and c01-math-0024 are said to be in conflict if there exist two fragments c01-math-0025 and c01-math-0026 such that the c01-math-0027 submatrix defined by rows c01-math-0028 and c01-math-0029 and columns c01-math-0030 and c01-math-0031 contains three 0s and one 1, or three 1s and one 0. Given a SNP matrix c01-math-0032 , the SNP conflict graph is the graph c01-math-0033 , with an edge for each pair of SNPs in conflict (see Figure 1.3d).

    If c01-math-0034 is a bipartite graph, c01-math-0035 can be segregated into two sets c01-math-0036 and c01-math-0037 of pairwise compatible fragments. From each set, one can infer one haplotype by fragment overlap. Let c01-math-0038 and c01-math-0039 be the haplotypes thus obtained. Since c01-math-0040 and c01-math-0041 correspond to the assembly of the fragments, the single individual haplotyping problems are sometimes also referred to as fragment assembly haplotyping problems.

    We call a SNP matrix c01-math-0042 feasible if c01-math-0043 is bipartite and infeasible otherwise. Note that a SNP matrix for error-free data must be feasible. Hence, the optimization problems to be defined aim at correcting an infeasible SNP matrix so that it becomes feasible.

    The following are the first optimization problems defined with the above goal [47]. They arose at Celera Genomics in the context of sequencing the human genome.

    MFR: Given a SNP matrix, remove the minimum number of fragments (rows) so that the resulting matrix is feasible.

    MSR: Given a SNP matrix, remove the minimum number of SNPs (columns) so that the resulting matrix is feasible.

    The first problem is mainly suited for a situation in which, more than sequencing errors, one is worried about the presence of contaminants. The second problem is more suited in the presence of sequencing errors only, when all the fragments are to be retained. These problems were shown to be NP-hard in general [47] so that exact algorithms for their solution are expected to be exponential branch-and-bound procedures. Lippert et al. [53] described a combinatorial branch-and-bound algorithm for MFR. They also described an Integer Linear Programming (ILP) formulation of the problem, based on the correspondence between MFR and the maximum node-induced bipartite subgraph problem.

    While the general case for MFR and MSR is NP-hard, these problems were shownto be polynomial for gapless data [47]. Let us call c01-math-0044 a gapless matrix if each row of c01-math-0045 is a gapless fragment. The main connection between MFR and MSR is given by the following theorem.

    Theorem 1.1

    [47] Let c01-math-0046 be a gapless SNP matrix. Then, c01-math-0047 is a bipartite graph if and only if c01-math-0048 is a stable set.

    Later theoretical improvements extended these results to fragments with gaps of bounded length, giving c01-math-0049 dynamic programming algorithms for MFR and c01-math-0050 for MSR for instances with gaps of total length c01-math-0051 Rizzi et al. [57], Bafna et al. [58]. These algorithms are hardly practical, however, in instances for which the gaps can be rather large. To overcome this problem, Xie and Wang [67] (see also Xie et al. [68]) proposed an algorithm for MFR based on two new parameters: c01-math-0052 , the maximum number of SNPs covered by any fragment and c01-math-0053 , the maximum number of fragments covering any SNP site (also called coverage). Their solution has complexity c01-math-0054 . Since c01-math-0055 and c01-math-0056 is generally at most 10, this method should be more suitable for mate-paired data, where c01-math-0057 can be quite large.

    1.2.1 The Minimum Error Correction Model

    Owing to the nature of the sequencing process, most data errors are due to miscalling or missing a base in a read. As a consequence, a particular version of the single individual haplotyping problem was proposed in Reference [53] and has received great attention because of its practical relevance. This version is called Minimum Error Correction (MEC) problem (the problem is also known as Minimum Letter Flip (MLF), see Reference [32]):

    MEC: Given a SNP matrix, change the minimum number of entries (0 into 1, or 1 into 0) so that the resulting matrix is feasible.

    Particularly interesting is a weighted version of MEC in which each entry is associated with a nonnegative weight proportional to the confidence level with which the base corresponding to the entry has been called. The solution seeks to flip a set of entries with minimum total weight so as to make the matrix feasible.

    The MEC problem was shown to be NP-hard, both for general [53] and for gapless matrices [20]. Many approaches were proposed in the literature for the solution of MEC. For a comparative survey of the various procedures, the reader is referred to Geraci [29] and to Geraci and Pellegrini [30].

    One of the first heuristics for MEC was FastHare by Panconesi and Suozo [56]. FastHare starts by first sorting the fragments according to their leftmost ends and then it reconstructs the two final haplotypes by correcting the sorted fragments in a greedy manner. Being simple, this heuristic is very fast but nevertheless it provides quite accurate solutions in general, especially when the error rate in the data is not too high. For high error-rate data, Genovese et al. proposed SpeedHap [28], an effective heuristic procedure organized in phases. For each phase, they perform three tasks: (i) detect likely positions of errors, (ii) allocate fragments to the two partially built final haplotypes, and (iii) decide the final alleles in the two haplotypes via a majority rule on ambiguous SNPs.

    FAHR (Fast and Accurate Haplotype Reconstruction by Wu and Liang [6]) is a somewhat similar greedy heuristic procedure, which builds the final haplotypes by partitioning the fragments at each SNP in a greedy manner. Yet another heuristic for MEC is HMEC, proposed by Bayzid et al. [7]. HMEC is a steepest–descent local search procedure in which each iteration takes time c01-math-0058 . Because of its time complexity, HMEC may provide results slowly for instances with a large number of fragments.

    Zhao et al. [72] considered the weighted version of MLF (called WMLF), for which they proposed the use of a dynamic clustering heuristic. Furthermore, they introduced an extended version of the problem to deal with the presence of contaminants. In this version, denoted as Complete Weighted Minimum Letter Flip (CWMLF), one is allowed not only to flip entries of the SNP matrix but also to remove some of the matrix rows. In addition, for CWMLF, they proposed the use of a dynamic clustering algorithm. Wu et al. [66] proposed a heuristic procedure for WMLF, with a performance similar to that of the best previous methods.

    Among the various types of approaches to the solution of MEC, there is the use of evolutionary algorithms such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO). Wang et al. [63] proposed both a branch-and-bound and a GA for MEC. Being exponential, the branch-and-bound is only applicable to instances of very small size, but the GA can be used for large instances (e.g., more than 50 fragments over more than 50 SNPs). A PSO heuristic for MEC was proposed by Kargar et al. [46]. PSO turns out to be fast and suitable for instances with a low error rate in the data.

    One of the most successful heuristics for MEC is HapCUT, proposed by Bansal and Bafna [3]. The algorithm makes use of a suitable graph obtained from the fragments, where each SNP corresponds to a node in the graph and two nodes are joined by an edge if there exists a fragment that covers both SNPs. The procedure then tries to minimize the MEC cost of the reconstructed haplotypes by iteratively finding max-cuts in the associated graph.

    He et al. [41] proposed a dynamic programming algorithm for MEC with time complexity c01-math-0059 , where c01-math-0060 is the maximum length of a fragment. The algorithm can be used for values of c01-math-0061 up to about 15, but beyond that, it becomes impractical. For large values of c01-math-0062 , the authors suggest to model the problem as a MaxSAT problem, to be solved by a MaxSAT solver. The quality of the solutions obtained through this approach is shown to be quite high, but the solving process is still slow.

    Another dynamic programming algorithm, this time parameterized by the maximum coverage c01-math-0063 of each SNP site, was proposed by Deng et al. [23]. The complexity of the dynamic program is c01-math-0064 , which can be quite high when c01-math-0065 is large. For large values of c01-math-0066 , the authors propose a heuristic procedure based on the dynamic programming algorithm. Experiments showed that the heuristic returns very accurate solutions on average.

    Chen et al. [18] proposed an exact algorithm for MEC based on an ILP formulation of the problem. The solving process is considerably speeded up by a preprocessing phase, in which the input SNP matrix is decomposed into smaller, independent, blocks that are then individually solved by the ILP solver.

    1.2.2 Probabilistic Approaches and Alternative Models

    Some of the solutions to the assembly haplotyping problems have employed a probabilistic approach, either because of the type of algorithm used or because of the type of model and problem definition.

    An example of probabilistic algorithm is HASH (Haplotype Assembly for Single Human), an MCMC (Markov Chain Monte Carlo) algorithm proposed by Bansal et al. [4] for assembling haplotype fragments under the MEC objective function. In HASH, the transitions of the Markov chain are generated using min-cut computations on graphs derived from the reads. The method was applied to infer the haplotypes from real data consisting of whole-genome shotgun sequence fragments of a human individual. The results showed the haplotypes inferred by using HASH to be significantly more accurate than the haplotypes estimated by using the best heuristics available at the time.

    Another type of Markov chain approach was proposed by Wang et al. [64]. In this approach, the assignment of fragments to haplotypes is regarded as a Markov process in which successive allele pairs are generated on the basis of the value of a small number of preceding SNP sites in the sequence. In order to find the most probable haplotypes for the set of fragments, the authors propose a Viterbi-like dynamic programming procedure.

    Chen et al. [19] described a probabilistic model for MEC in which they considered three error parameters, called c01-math-0067 , c01-math-0068 , and c01-math-0069 . The parameter c01-math-0070 is the error rate with which entries of the SNP matrix have been miscalled(i.e., a 0 in place of 1 or a 1 in place of a 0). The value c01-math-0071 is the error rate with which a "-" in the SNP matrix appears where, in fact, a SNP is covered by a fragment and therefore an allele 0 or 1 should be present. Finally, c01-math-0072 is a measure of the expected dissimilarity between the haplotypes to be reconstructed. The authors gave a linear-time (in the size of c01-math-0073 ) probabilistic algorithm that reconstructs the correct haplotypes with high probability when the parameters are known. Even for the case when some of the parameters are unknown, they provided a probabilistic algorithm that outputs the correct haplotypes with probability depending on c01-math-0074 , c01-math-0075 , and c01-math-0076 .

    Li et al. [52] proposed a probabilistic model for the haplotype inference problem. In their model, they studied the conditional probability c01-math-0077 of c01-math-0078 and c01-math-0079 being the correct haplotypes given that c01-math-0080 was the SNP matrix measured. They then pursued the objective of determining the most likely pair of correct haplotypes, that is, a pair c01-math-0081 maximizing the above conditional probability. In order to solve this problem, they used Gibbs sampling and an Expectation Maximization (EM) algorithm.

    Among the works proposed for the solution of the MEC model, there are some papers that introduced variants to the original framework in order to deal with some specific data problems. For instance, Zhang et al. [71] proposed a model called Minimum Conflict Individual Haplotyping (MCIH), suitable for data sets with particularly high error rate. The problem was shown to be NP-hard and a dynamic programming procedure for its solution was described.

    Duitama et al. [25] proposed a model called Maximum Fragments Cut (MCF) whose objective is to identify a set of fragments (i.e., rows of the SNP matrix) maximizing a score proportional to their conflict with respect to the remaining rows. This set of fragments can be interpreted as the shore of a cut in a suitable graph, so that the problem can be reduced to a max-cut problem. The authors utilized a local optimization heuristic, called ReFHap, for the problem solution.

    Xie et al. [69] introduced a model called Balanced Optimal Partition (BOP), which generalizes both MEC and MCF. The model is, in a way, a weighted combination of MEC and MCF, and by setting some model parameters to proper values, BOP degenerates into pure MEC or pure MCF. To solve the model, the authors proposed a dynamic programming algorithm called H-BOP.

    Aguiar and Istrail [1] proposed HapCompass, an algorithm for haplotype assembly of densely sequenced human genome data. HapCompass operates on a graph where each SNP is a node and the edges are associated with the fragments. The edges are weighted, and the weight of an edge indicates what the best phasing of the alleles for the edge endpoints is, given the fragments in the data set that cover the corresponding SNPs. In this model, the reconstruction of the underlying haplotypes corresponds to a minimum-weight edge-removal problem until a special type of subgraph, called happy graph by the authors, is obtained. HapCompass is a heuristic with a local reoptimization step. Computational results showed the effectiveness ofthe procedure and the good quality of its solution also with respect to the MEC objective.

    1.3 Population Haplotyping

    Haplotype data are particularly sought after in the study of complex diseases (those affected by more than one gene) because they convey precious information about which set of gene alleles are inherited together. These types of polymorphism screenings are usually conducted on a population, that is, on sets of related individuals.

    We have discussed at length the problem of haplotyping a single individual, which consists in determining her two haplotypes for a given chromosome. In a similar way, haplotyping a population consists in haplotyping each individual of the given population. Of course, one way to do this would be to solve a set of single individual haplotyping problems, one for each element of the population, but there might be better ways to proceed. In particular, with the larger availability in SNP data, recent years have seen the birth of a set of new computational problems related to population haplotyping. Most of these problems are motivated by the fact that while it is economically inconvenient to determine the haplotypes by sequencing and assembling, there is a cheap experiment that can determine a less informative and often ambiguous type of data, that is, the genotypes, from which the haplotypes can then be retrieved computationally.

    A genotype of an individual contains the information about the two (possibly identical) alleles at each SNP, but without specifying their paternal or maternal origin. There may be many possible pairs of haplotypes that are consistent with a given genotype. For example, assume we only know that an individual is heterozygous for the alleles c01-math-0082 at SNP 1 and for the alleles c01-math-0083 at SNP 2. Then, either one of these alternatives may be true:

    One parent has contributed the alleles C and A and the other the alleles T and G.

    One parent has contributed the alleles C and G and the other the alleles T and A.

    Both possibilities are plausible. Associating the alleles with the parents is called phasing the alleles. For c01-math-0084 heterozygous SNPs, there are c01-math-0085 possible phasings, which makes choosing the correct one a difficult problem. Once the alleles have been phased, the two haplotypes are inferred as phasing and haplotyping are in fact the same problem. The two haplotypes that are obtained by phasing the alleles are said to resolve, or explain, the genotype.

    The most general population haplotyping problem can be stated as follows:

    Given a set c01-math-0086 of genotypes, corresponding to an existing, unknown, set c01-math-0087 of haplotypes, retrieve c01-math-0088 .

    In other words, the goal is to compute a set c01-math-0089 of haplotypes that contains, for each genotype c01-math-0090 , the two haplotypes c01-math-0091 and c01-math-0092 obtained by the correct phasing of c01-math-0093 . As could be expected, on the basis of only the knowledge of c01-math-0094 it, is not easy to describe constraints that define precisely which of the exponentially many phasings of a genotype is the correct one. Biologists have therefore described several sensible criteria for good phasings. For instance, under a widely accepted parsimony principle (inspired by the Occam's razor principle), a good solution may be one that minimizes the number of distinct haplotypes inferred.

    Once it has been mathematically modeled, haplotyping gives rise to several nice and challenging combinatorial problems (for surveys on population haplotyping problems, see, e.g., References [11, 38, 39]). These problems have been extensively studied in the last few years. Some of them have been proven NP-hard and solved by exponential-time algorithms, while for others polynomial-time algorithms have been designed.

    In this chapter, we address some of the most interesting combinatorial haplotyping models proposed in the literature. Each model and objective function has specific biological motivations, which are discussed in the following sections. While our focus will be on the combinatorial approach to haplotyping problems, it should be remarked that there is also a very important statistical approach to population haplotyping problems, which does not fall within the scope of this survey. The statistical approach has led to widely used software tools for haplotyping, such as the program PHASE [60].

    Given a set of c01-math-0095 SNPs, fix arbitrarily a binary encoding of the two alleles for each SNP (i.e., call one of the two alleles 0 and the other 1). Once the encoding has been fixed, each haplotype becomes a binary vector of length c01-math-0096 .

    A haplotype c01-math-0097 is denoted by c01-math-0098 , the value of its c01-math-0099 th component, with c01-math-0100 . Given two haplotypes c01-math-0101 and c01-math-0102 , we define a special sum whose result is a vector c01-math-0103 . The vector c01-math-0104 has length c01-math-0105 , and its components take values in c01-math-0106 , according to the following rule:

    equation

    We call a vector c01-math-0108 with entries in c01-math-0109 a genotype. Each position c01-math-0110 such that c01-math-0111 is called an ambiguous position (or ambiguous site). We denote by c01-math-0112 the set of ambiguous positions of c01-math-0113 . Biologically, genotype entries of value 0 or 1 correspond to homozygous SNP sites, while entries of value 2 correspond to heterozygous sites. In Figure 1.4, we illustrate a case of three individuals, showing their haplotypes and genotypes.

    c01f004

    Figure 1.4 Haplotypes and corresponding genotypes.

    A resolution of a genotype c01-math-0114 is given by a pair of haplotypes c01-math-0115 such that c01-math-0116 . The haplotypes c01-math-0117 and c01-math-0118 are said to resolve c01-math-0119 . A genotype is ambiguous if it has more than one possible resolution, that is, if it has at least two ambiguous positions. A haplotype c01-math-0120 is said to be compatible with a genotype c01-math-0121 if c01-math-0122 can be used in a resolution of c01-math-0123 . It can immediately be seen that c01-math-0124 is compatible with c01-math-0125 if and only if c01-math-0126 at each position where c01-math-0127 . Two genotypes c01-math-0128 and c01-math-0129 are compatible if there exists at least one haplotype compatible with both of them, otherwise, they are incompatible. It can immediately be seen that c01-math-0130 and c01-math-0131 are compatible if and only if c01-math-0132 at each position c01-math-0133 where they are both nonambiguous.

    As previously discussed, the experiment yielding each genotype is such that, at each SNP, it is known whether an individual is homozygous for allele 0 (in which case, we may set c01-math-0134 ), homozygous for allele 1 (in which case, we may set c01-math-0135 ), or heterozygous (in which case, we may set c01-math-0136 ). Therefore, for the haplotyping problems described in this section, the input data consist in a set c01-math-0137 of c01-math-0138 genotypes c01-math-0139 , corresponding to c01-math-0140 individuals in a population. The output is a set c01-math-0141 of haplotypes such that, for each c01-math-0142 , there is at least one pair of haplotypes c01-math-0143 with c01-math-0144 . Such a set c01-math-0145 of haplotypes is said to explain c01-math-0146 . In addition to explaining c01-math-0147 , the set c01-math-0148 is also required to satisfy some particular constraints. These constraints are different for different specific types of haplotyping problems. For each problem described in this survey, the particular constraints are given in the corresponding section.

    1.3.1 Clark's Rule

    The geneticist Clark proposed in Reference [21] a rule (today known as Clark's rule) to derive new haplotypes by inference from known ones as follows:

    Clark's Inference Rule: Given a genotype c01-math-0149 and a compatible haplotype c01-math-0150 , obtain a new haplotype c01-math-0151 by setting c01-math-0152 at all positions c01-math-0153 and c01-math-0154 at the remaining positions.

    Note that c01-math-0155 and c01-math-0156 resolve c01-math-0157 . In order to resolve all genotypes of c01-math-0158 , Clark suggested the use of successive applications of his inference rule. The procedure requires a bootstrap set of haplotypes used to derive new haplotypes at the very beginning. These starting haplotypes are obtained by resolving, in the unique possible way, the unambiguous genotypes in c01-math-0159 (of which it is assumed there is always at least one).

    The following is the procedure that Clark proposed for haplotyping a population. He supported the validity of the approach by arguments from theoretical population genetics:

    Clark's Algorithm: Let c01-math-0160 be the set of nonambiguous genotypes and let c01-math-0161 be the set of haplotypes obtained from c01-math-0162 . Reset c01-math-0163 . Then, repeat the following. If they exist, take a c01-math-0164 and a compatible c01-math-0165 and apply the inference rule, obtaining c01-math-0166 . Set c01-math-0167 , c01-math-0168 , and iterate. When no such c01-math-0169 and c01-math-0170 exist, the algorithm succeeds if c01-math-0171 and fails otherwise.

    Note that the procedure is nondeterministic as it does not specify how to choose c01-math-0172 and c01-math-0173 whenever there are more candidates to the application of the rule. For example, suppose c01-math-0174 . The algorithm starts by setting c01-math-0175 and c01-math-0176 . The inference rule can be used to resolve c01-math-0177 from c01-math-0178 , obtaining c01-math-0179 , which can, in turn, be used to resolve c01-math-0180 , obtaining c01-math-0181 . However, one could have started by using c01-math-0182 to resolve c01-math-0183 obtaining c01-math-0184 . At that point, there would be no way to resolve c01-math-0185 . The nondeterminism in the choice of the genotype c01-math-0186 and haplotype c01-math-0187 to which the inference rule is applied can be settled by fixing a deterministic rule based on the initial sorting of the data. In Reference [21], a large number of random sorting are used to run the algorithm and the best solution overall is reported. Tests on real and simulated data sets showed that although most of the time the algorithm could resolve all genotypes, often the algorithm failed.

    The problem of finding an ordering of application of Clark's inference rule that leaves the fewest number of unresolved genotypes in the end was first defined and studied by Gusfield [33], who proved it is NP-hard and APX-hard (i.e., there is a value c01-math-0188 for which it is NP-hard even to give a c01-math-0189 -approximation algorithm). As for practical algorithms, Gusfield [34, 35] proposed an integer programming approach for a graph-theoretic reformulation of the problem. The problem is first transformed, by an exponential-time reduction, into a problem on a digraph c01-math-0190 defined as follows. Let c01-math-0191 , where

    c01-math-0192

    . Let c01-math-0193 be the subset of haplotypes determined from the set c01-math-0194 of unambiguous genotypes. For each c01-math-0195 and c01-math-0196 in c01-math-0197 , there is an arc c01-math-0198 if c01-math-0199 is ambiguous, c01-math-0200 , and c01-math-0201 (i.e., c01-math-0202 can be inferred from c01-math-0203 via c01-math-0204 ). Then, any directed tree rooted ata node c01-math-0205 specifies a feasible history of successive applications of the inference rule starting at node c01-math-0206 . The problem can then be stated as follows: find the largest number of nodes that can be reached by a set of node-disjoint-directed trees, where each tree is rooted at a node in c01-math-0207 and for every ambiguous genotype c01-math-0208 , at most one node in c01-math-0209 is reached.

    The above graph problem was shown to be NP-hard [34]. For its solution, Gusfield proposed an ILP formulation and noticed that the solution of the LP relaxation was very often integer for the real-life instances tested. The model was applied to real data as well as random instances, with up to 80 genotypes, of which 60 were ambiguous, over 15 SNPs.

    1.3.2 Pure Parsimony

    Clark's algorithm tries to reuse existing haplotypes as much as possible and it introduces new haplotypes only when strictly needed. As a result, the procedure usually tends to produce solutions of small size. Note that the maximum size for a set c01-math-0210 resolving c01-math-0211 is c01-math-0212 , while the smallest possible size is c01-math-0213 .

    Pure Parsimony Haplotyping (PPH) has the explicit objective of minimizing the size of c01-math-0214 . This objective has several biological motivations. For one, the number of distinct haplotypes observed in nature is vastly smaller than the number of possible haplotypes. Furthermore, as we all descend from a small number of ancestors, their haplotypes should be the same we possess today (if it were not for some recombination events and mutations). Finally, pure parsimony follows a general principle according to which, of many possible explanations of an observed phenomenon, one should always favor the simplest.

    The PPH problem is NP-hard, as first shown by Hubbel [43]. Lancia et al. [48] showed that, in fact, the problem is APX-hard. This result holds also if each genotype is restricted to possess at most three ambiguous sites. Note that although the problem is APX-hard, there exist constant-ratio approximation algorithms under the restriction that each genotype has at most c01-math-0215 ambiguous sites for each constant c01-math-0216 [48].

    Several algorithmic approaches, many of them employing mathematical programming techniques, have been used in PPH (for a survey of models and approaches for PPH, see Reference [17]). In particular, we recall the following works.

    1.3.2.1 Integer Programming Formulations of Exponential Size

    Let us denote by c01-math-0217 the set of all haplotypes compatible with at least onegenotype of c01-math-0218 and let c01-math-0219 . The first ILP formulation for PPH, called TIP, was provided by Gusfield [37]. TIP has c01-math-0220 variables and c01-math-0221 constraints. There is a binary variable c01-math-0222 associated with every c01-math-0223 ( c01-math-0224 means that c01-math-0225 is included in the solution, whereas c01-math-0226 means that c01-math-0227 is not included). After fixing a total ordering on c01-math-0228 , denote by c01-math-0229 the set of those ordered pairs c01-math-0230 with c01-math-0231 , c01-math-0232 . For every c01-math-0233 , let c01-math-0234 . For every c01-math-0235 and c01-math-0236 , there is a binary variable c01-math-0237 used to select an ordered pair c01-math-0238 to resolve the genotype c01-math-0239 . The following is then a valid ILP formulation of the PPH problem [37, 48]:

    1.1 equation

    subject to

    1.2 equation

    1.3 equation

    1.4 equation

    1.5

    equation

    Constraints (1.2) impose that each genotype is resolved by at least one pair of haplotypes. Constraints (1.3) and (1.4) make sure that c01-math-0245 can be used to resolve a genotype only if both c01-math-0246 and c01-math-0247 are included in the solution.

    Gusfield managed to employ some preprocessing rules to get rid of variables that can be proved to be nonessential in the formulation. The resulting model is called RTIP (Reduced TIP). Although practically useful, the preprocessing still leaves an exponential model whose size grows quite quickly with respect to the instance size. The experimental results of Reference [37] showed that RTIP can be used to tackle problems with up to 50 genotypes over 30 SNPs, but there must be relatively small levels of heterozygosity in the input genotypes.

    Lancia and Rizzi [50] showed that when each genotype has at most two ambiguous sites, the above ILP is totally unimodular. As a consequence, the ILP solution is always naturally integer and hence the problem can be solved in polynomial time.

    Lancia and Serafini [51] proposed a new exponential ILP model for PPH. The model exploits an interpretation of PPH as a particular type of Set Covering (SC) based on the following observation: if c01-math-0248 is a set of haplotypes resolving c01-math-0249 , then for each genotype c01-math-0250 , ambiguous position c01-math-0251 , and value c01-math-0252 , there is a haplotype c01-math-0253 such that c01-math-0254 .

    This condition is a covering condition because it represents the covering constraint for a set cover problem in which the universe is the set of all triples c01-math-0255 , for c01-math-0256 , c01-math-0257 , and c01-math-0258 , and each haplotype c01-math-0259 represents a subset of the universe, namely, c01-math-0260 . The condition is only necessary, but not sufficient, for c01-math-0261 to be a feasible solution of PPH. Consider, for example, the following diagonal instance c01-math-0262 . The set c01-math-0263 satisfies the covering condition but does not resolve c01-math-0264 .

    The SC associated with PPH seeks to minimize the number of haplotypes needed to satisfy all covering conditions for the given set of genotypes c01-math-0265 . As we have seen, this SC is in fact a relaxation of PPH. The formulation of the SC model has an exponential number of variables and constraints, and can be solved, as described in Reference [51], by branch-and-cut-and-price, that is, via the generation of variables and constraints at run-time. It is possible that the optimal solution of SC resolves c01-math-0266 , in which case it is an optimal PPH solution as well. If not, one can try to obtain a good feasible PPH solution from the optimal cover by adding only a small number of haplotypes. This idea is exploited by an effective heuristic presented in Reference [51]. The computational results show that the SC approach can be orders of magnitude faster than RTIP and can be applied to instances with c01-math-0267 and c01-math-0268 up to c01-math-0269 . These are among the largest-size instances for which provably optimal solutions have been obtained in the literature.

    1.3.2.2 Integer Programming Formulations of Polynomial Size and Hybrid Formulations

    Many authors Brown and Harrower [13], Bertolazzi et al. [8], Lancia et al. [48], Halldorsson et al. [40] have independently proposed polynomial-size integer programming formulations. More or less, all these formulations employ variables to represent the bits of the haplotypes in the solution (so there are, at most, c01-math-0270 such variables). The basic idea is that for each genotype c01-math-0271 one must determine two haplotypes c01-math-0272 and c01-math-0273 such that c01-math-0274 . This implies that, for each position c01-math-0275 such that c01-math-0276 , one needs to decide a variable c01-math-0277 , and then set c01-math-0278 and c01-math-0279 . The polynomial formulations for PPH consist in expressing the PPH objective function and constraints in terms of the c01-math-0280 variables and possibly a (polynomial-size) set of additional variables.

    The LP relaxation of these formulations is generally quite weak. The use of some valid cuts [8, 13] improves the quality of the bound, but the integrality gap between the integer optimum and the LP relaxation remains large. Brown and Harrower [14] also proposed a hybrid model in which variables for a fixed subset of haplotypes are explicitly present, while the rest of the haplotypes are implicitly represented by polynomially many variables and constraints. These polynomial/hybrid formulations were successfully used for the solution of problems of similar size as those solvable by using TIP. Furthermore, some tests were conducted on larger problems (30 genotypes over up to 75 SNPs), on which the exponential formulation could not be applied successfully owing to the IP size.

    The last polynomial model for PPH was proposed by Catanzaro et al. [16] and is based on the class representatives with smallest index (a technique adopted by Campelo et al. for the vertex color problem [15]). The authors observed that a feasible solution to PPH can be represented by means of a bipartite graph whose two shores correspond to haplotypes and genotypes. Each vertex c01-math-0281 has degree 2 and there are two vertices, say c01-math-0282 and c01-math-0283 adjacent to c01-math-0284 such that c01-math-0285 . The bipartite graph representation of a solution suggests that, in a feasible solution to PPH, the haplotypes induce a family of subsets of genotypes satisfying the following three properties: (i) each subset of genotypes shares one haplotype, (ii) each genotype belongs to exactly two subsets, and (iii) every pair of subsets intersects in at most one genotype. Catanzaro et al. [16] exploited the bipartite graph representation of a solution to PPH to provide a polynomial-size ILP model that turns out to be quite effective and to outperform other polynomial models for the PPH problem.

    1.3.2.3 Quadratic, Semidefinite Programming Approaches

    A quadratic formulation solved by semidefinite programming was proposed byKalpakis and Namjoshi [45]. Similarly to the RTIP model, the formulation has a variable for each possible haplotype and hence it cannot be used to tackle instances for which c01-math-0286 is too large. According to the computational results, the size of the problems solved is comparable to that of RTIP and of the best polynomial ILP models. On the basis of a similar formulation, an (exponential-time) approximation algorithm was presented in Reference [42].

    1.3.2.4 Combinatorial branch-and-bound Approaches

    Wang and Xu [65] proposed a simple combinatorial branch-and-bound approach. The solution is built by enumerating all possible resolutions for each of the genotypes in turn. The lower bound is the number of haplotypes used so far. Since the search space is exponential and the bound is weak, the method is not able to solve instances of size comparable to the other approaches. Even the solution for 20 genotypes over 20 SNPs can sometimes take an extremely long time to be found.

    1.3.2.5 Boolean Optimization

    One of the approaches applied to PPH is Pseudo Boolean Optimization (PBO). PBO is a technique by which a problem can be modeled via integer linear constraints over a set of boolean variables. The goal is to find an assignment of the variables that satisfies all constraints and minimizes a given objective function. The model is solved by a SAT solver (much in the same way as an ILP model is solved by an ILP solver), which employs solution-searching techniques specific for boolean models.

    The PBO models for PPH are mostly based on the following feasibility question: given a tentative cardinality c01-math-0287 , does there exist a set c01-math-0288 of c01-math-0289 haplotypes that resolves c01-math-0290 ? The feasibility question is expressed in terms of boolean variables similar to the binary variables of the polynomial ILP models. Starting at a lower bound, c01-math-0291 is increased until the feasibility question has a positive answer. At that point, c01-math-0292 represents the optimal solution.

    The first PBO algorithm for PPH was presented by Lynce and Marques-Silva [54]. Later works aimed at breaking the symmetries in the original PBO model, and computing tight lower and upper bounds to be used for pruning the search space Graca et al. [31], Marques-Silva et al. [55]. The SAT approaches showed to be competitive with the best mathematical programming approaches for PPH.

    1.3.2.6 Heuristics

    In general, all the exact models mentioned so far run into troubles when trying to solve large instances (where the most critical parameter is the number of ambiguous positions per genotype). The exponential models imply the creation of too many variables and/or constraints for obtaining a solution within a reasonable time. The polynomial and combinatorial models, on the other hand, employ quite weak lower bounds so that closing the gap and terminating the branch-and-bound search is again impossible within a reasonable time. In order to solve large instances of PPH, one needs then to resort to the use of effective heuristics that can find near-optimal solutions to instances with hundreds of genotypes over hundreds of SNPs with high levels of heterozygosity.

    One such heuristic procedure is CollHaps, proposed by Tininini et al. [61]. CollHaps is based on the representation of the solution as a matrix c01-math-0293 of c01-math-0294 rows (the solving haplotypes, also called symbolic haplotypes because they may contain variables), in which some entries are fixed, while others have to be determined. The entries to be determined correspond to heterozygous positions in the input genotypes. To each such entry, CollHaps associates a 0–1 variable, which is then set to a fixed value in the course of the procedure. At each step, CollHaps tries to maintain as few distinct symbolic haplotypes as possible, which is done by trying to make identical some rows of c01-math-0295 via a greedy setting of the variables. The idea is to eventually end up with as few distinct actual haplotypes as possible. Experimental results have shown that CollHaps is a very effective and accurate heuristic for PPH, and ranks among the best available procedures for this problem.

    In another heuristic approach for PPH, Di Gaspero and Roli [27] proposed the use of stochastic local search, which they considered to yield a reasonable compromise between the quality of the solutions and running time of the procedure. In their work, Di Gaspero and Roli utilized a family of local search strategies, such as Best Improvement (BI), Stochastic First Improvement (SFI), Simulated Annealing (SA), and Tabu Search (TS).

    1.3.3 Perfect Phylogeny

    One limitation of the PPH model is that it does not take into account the fact that haplotypes may evolve over time. Haplotypes evolve by mutation and recombination, and an individual can possess two haplotypes such that neither one is possessed by one of her parents. The haplotype regions in which no recombination and/or mutations have ever happened in a population over time are called blocks. For reconstructing haplotype blocks, starting from genotype data of a block in a population, the parsimony model is the most appropriate [37]. However, when genotypes span several blocks, differentmodels of haplotyping have been considered. One of these is haplotyping for perfect phylogeny.

    The perfect phylogeny model is used under the hypothesis that no recombination events happened, but there were only mutations. It is assumed that at the beginning there existed only one ancestral haplotype, and new haplotypes were derived over time from existing haplotypes as follows. If at some point there existed a haplotype c01-math-0296 in the population and then a mutation of c01-math-0297 happened, which was then passed on to the new generation, a new haplotype c01-math-0298 started to exist, with c01-math-0299 for c01-math-0300 , and c01-math-0301 . We say that c01-math-0302 is the father of c01-math-0303 in the tree of haplotype evolution. In the infinite-sites coalescent model

    Enjoying the preview?
    Page 1 of 1