Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

Ebook1,329 pages13 hours

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

By Mourad Elloumi, Costas Iliopoulos, Jason T. L. Wang and Albert Y. Zomaya

Rating: 0 out of 5 stars

()

Read preview

About this ebook

A comprehensive overview of high-performance pattern recognition techniques and approaches to Computational Molecular Biology

This book surveys the developments of techniques and approaches on pattern recognition related to Computational Molecular Biology. Providing a broad coverage of the field, the authors cover fundamental and technical information on these techniques and approaches, as well as discussing their related problems. The text consists of twenty nine chapters, organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks.

Surveys the development of techniques and approaches on pattern recognition in biomolecular data
Discusses pattern recognition in primary, secondary, tertiary and quaternary structures, as well as microarrays, phylogenetic trees and biological networks
Includes case studies and examples to further illustrate the concepts discussed in the book

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches is a reference for practitioners and professional researches in Computer Science, Life Science, and Mathematics. This book also serves as a supplementary reading for graduate students and young researches interested in Computational Molecular Biology.

Skip carousel

Electrical Engineering & Electronics

LanguageEnglish

PublisherWiley

Release dateDec 24, 2015

ISBN9781119078869

Author

Mourad Elloumi

Related authors

Skip carousel

Related to Pattern Recognition in Computational Molecular Biology

Titles in the series (16)

Skip carousel

Machine Learning in Bioinformatics
Ebook
Machine Learning in Bioinformatics
byYanqing Zhang
Rating: 0 out of 5 stars
0 ratings
Grid Computing for Bioinformatics and Computational Biology
Ebook
Grid Computing for Bioinformatics and Computational Biology
byEl-Ghazali Talbi
Rating: 1 out of 5 stars
1/5
Analysis of Biological Networks
Ebook
Analysis of Biological Networks
byBjörn H. Junker
Rating: 0 out of 5 stars
0 ratings
Elements of Computational Systems Biology
Ebook
Elements of Computational Systems Biology
byHuma M. Lodhi
Rating: 0 out of 5 stars
0 ratings
Bioinformatics Algorithms: Techniques and Applications
Ebook
Bioinformatics Algorithms: Techniques and Applications
byIon Mandoiu
Rating: 0 out of 5 stars
0 ratings
Biomolecular Networks: Methods and Applications in Systems Biology
Ebook
Biomolecular Networks: Methods and Applications in Systems Biology
byLuonan Chen
Rating: 0 out of 5 stars
0 ratings
Mathematics of Bioinformatics: Theory, Methods and Applications
Ebook
Mathematics of Bioinformatics: Theory, Methods and Applications
byMatthew He
Rating: 0 out of 5 stars
0 ratings
Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data
Ebook
Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data
byMourad Elloumi
Rating: 5 out of 5 stars
5/5
Computational Intelligence and Pattern Analysis in Biology Informatics
Ebook
Computational Intelligence and Pattern Analysis in Biology Informatics
byUjjwal Maulik
Rating: 0 out of 5 stars
0 ratings
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches
Ebook
Pattern Recognition in Computational Molecular Biology: Techniques and Approaches
byMourad Elloumi
Rating: 0 out of 5 stars
0 ratings
Evolutionary Computation in Gene Regulatory Network Research
Ebook
Evolutionary Computation in Gene Regulatory Network Research
byHitoshi Iba
Rating: 0 out of 5 stars
0 ratings
Computational Methods for Next Generation Sequencing Data Analysis
Ebook
Computational Methods for Next Generation Sequencing Data Analysis
byIon Mandoiu
Rating: 0 out of 5 stars
0 ratings
Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Evaluation
Ebook
Multiple Biological Sequence Alignment: Scoring Functions, Algorithms and Evaluation
byKen Nguyen
Rating: 0 out of 5 stars
0 ratings

Related ebooks

Skip carousel

Biomolecular Networks: Methods and Applications in Systems Biology
Ebook
Biomolecular Networks: Methods and Applications in Systems Biology
byLuonan Chen
Rating: 0 out of 5 stars
0 ratings
Computational Intelligence and Pattern Analysis in Biology Informatics
Ebook
Computational Intelligence and Pattern Analysis in Biology Informatics
byUjjwal Maulik
Rating: 0 out of 5 stars
0 ratings
Computational Methods for Next Generation Sequencing Data Analysis
Ebook
Computational Methods for Next Generation Sequencing Data Analysis
byIon Mandoiu
Rating: 0 out of 5 stars
0 ratings
Integration of Omics Approaches and Systems Biology for Clinical Applications
Ebook
Integration of Omics Approaches and Systems Biology for Clinical Applications
byAntonia Vlahou
Rating: 0 out of 5 stars
0 ratings
Biomolecular and Bioanalytical Techniques: Theory, Methodology and Applications
Ebook
Biomolecular and Bioanalytical Techniques: Theory, Methodology and Applications
byVasudevan Ramesh
Rating: 0 out of 5 stars
0 ratings
Handbook of Statistical Systems Biology
Ebook
Handbook of Statistical Systems Biology
byMichael Stumpf
Rating: 0 out of 5 stars
0 ratings
Protein Purification: Principles, High Resolution Methods, and Applications
Ebook
Protein Purification: Principles, High Resolution Methods, and Applications
byJan-Christer Janson
Rating: 0 out of 5 stars
0 ratings
Pharmacometrics: The Science of Quantitative Pharmacology
Ebook
Pharmacometrics: The Science of Quantitative Pharmacology
byEne I. Ette
Rating: 0 out of 5 stars
0 ratings
Sample Preparation in LC-MS Bioanalysis
Ebook
Sample Preparation in LC-MS Bioanalysis
byWenkui Li
Rating: 0 out of 5 stars
0 ratings
Applied Bayesian Modelling
Ebook
Applied Bayesian Modelling
byPeter Congdon
Rating: 0 out of 5 stars
0 ratings
Nonlinear Parameter Optimization Using R Tools
Ebook
Nonlinear Parameter Optimization Using R Tools
byJohn C. Nash
Rating: 4 out of 5 stars
4/5
Methods and Applications of Statistics in Clinical Trials, Volume 1: Concepts, Principles, Trials, and Designs
Ebook
Methods and Applications of Statistics in Clinical Trials, Volume 1: Concepts, Principles, Trials, and Designs
byN. Balakrishnan
Rating: 0 out of 5 stars
0 ratings
A Course in Statistics with R
Ebook
A Course in Statistics with R
byPrabhanjan N. Tattar
Rating: 0 out of 5 stars
0 ratings
Causality: Statistical Perspectives and Applications
Ebook
Causality: Statistical Perspectives and Applications
byCarlo Berzuini
Rating: 0 out of 5 stars
0 ratings
Mathematics of Bioinformatics: Theory, Methods and Applications
Ebook
Mathematics of Bioinformatics: Theory, Methods and Applications
byMatthew He
Rating: 0 out of 5 stars
0 ratings
Elements of Computational Systems Biology
Ebook
Elements of Computational Systems Biology
byHuma M. Lodhi
Rating: 0 out of 5 stars
0 ratings
Computational Approaches in Cheminformatics and Bioinformatics
Ebook
Computational Approaches in Cheminformatics and Bioinformatics
byRajarshi Guha
Rating: 0 out of 5 stars
0 ratings
Analysis of Biological Networks
Ebook
Analysis of Biological Networks
byBjörn H. Junker
Rating: 0 out of 5 stars
0 ratings
Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective
Ebook
Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective
byHugo Kubinyi
Rating: 0 out of 5 stars
0 ratings
Grid Computing for Bioinformatics and Computational Biology
Ebook
Grid Computing for Bioinformatics and Computational Biology
byEl-Ghazali Talbi
Rating: 1 out of 5 stars
1/5
Principles of Genome Analysis and Genomics
Ebook
Principles of Genome Analysis and Genomics
bySandy B. Primrose
Rating: 2 out of 5 stars
2/5
Statistical Bioinformatics: For Biomedical and Life Science Researchers
Ebook
Statistical Bioinformatics: For Biomedical and Life Science Researchers
byJae K. Lee
Rating: 0 out of 5 stars
0 ratings
Biophysical Methods for Biotherapeutics: Discovery and Development Applications
Ebook
Biophysical Methods for Biotherapeutics: Discovery and Development Applications
byTapan K. Das
Rating: 0 out of 5 stars
0 ratings
Molecular and Cellular Therapeutics
Ebook
Molecular and Cellular Therapeutics
byDavid Whitehouse
Rating: 0 out of 5 stars
0 ratings
Systems Biology: A Textbook
Ebook
Systems Biology: A Textbook
byEdda Klipp
Rating: 0 out of 5 stars
0 ratings
Knowledge-Based Bioinformatics: From Analysis to Interpretation
Ebook
Knowledge-Based Bioinformatics: From Analysis to Interpretation
byGil Alterovitz
Rating: 0 out of 5 stars
0 ratings
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
Ebook
Bioinformatics: Algorithms, Coding, Data Science And Biostatistics
byRob Botwright
Rating: 0 out of 5 stars
0 ratings
Introducing Proteomics: From Concepts to Sample Separation, Mass Spectrometry and Data Analysis
Ebook
Introducing Proteomics: From Concepts to Sample Separation, Mass Spectrometry and Data Analysis
byJosip Lovric
Rating: 0 out of 5 stars
0 ratings
Protein Kinase CK2
Ebook
Protein Kinase CK2
byLorenzo A. Pinna
Rating: 0 out of 5 stars
0 ratings
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Ebook
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
byFouad Sabry
Rating: 0 out of 5 stars
0 ratings

Electrical Engineering & Electronics For You

Skip carousel

The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
Ebook
The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
How to Diagnose and Fix Everything Electronic, Second Edition
Ebook
How to Diagnose and Fix Everything Electronic, Second Edition
byMichael Jay Geier
Rating: 4 out of 5 stars
4/5
Electrician's Pocket Manual
Ebook
Electrician's Pocket Manual
byRex Miller
Rating: 0 out of 5 stars
0 ratings
The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026
Ebook
The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026
byMichael Burnette, AF7KB
Rating: 5 out of 5 stars
5/5
Off-Grid Projects: Step-by-Step Guide to Building Your Own Off-Grid System
Ebook
Off-Grid Projects: Step-by-Step Guide to Building Your Own Off-Grid System
byRachel Pratt
Rating: 0 out of 5 stars
0 ratings
Beginner's Guide to Reading Schematics, Fourth Edition
Ebook
Beginner's Guide to Reading Schematics, Fourth Edition
byStan Gibilisco
Rating: 4 out of 5 stars
4/5
The Homeowner's DIY Guide to Electrical Wiring
Ebook
The Homeowner's DIY Guide to Electrical Wiring
byDavid Herres
Rating: 5 out of 5 stars
5/5
Beginner's Guide to Reading Schematics, Third Edition
Ebook
Beginner's Guide to Reading Schematics, Third Edition
byStan Gibilisco
Rating: 0 out of 5 stars
0 ratings
Understanding Electricity
Ebook
Understanding Electricity
byDr.Ilango Sivaraman
Rating: 4 out of 5 stars
4/5
Upcycled Technology: Clever Projects You Can Do With Your Discarded Tech (Tech gift)
Ebook
Upcycled Technology: Clever Projects You Can Do With Your Discarded Tech (Tech gift)
byDaniel Davis
Rating: 5 out of 5 stars
5/5
A Degree in a Book: Electrical And Mechanical Engineering: Everything You Need to Know to Master the Subject - in One Book!
Ebook
A Degree in a Book: Electrical And Mechanical Engineering: Everything You Need to Know to Master the Subject - in One Book!
byDavid Baker
Rating: 5 out of 5 stars
5/5
DIY Lithium Battery
Ebook
DIY Lithium Battery
byJeremy A. Hampton
Rating: 3 out of 5 stars
3/5
Electricity for Beginners
Ebook
Electricity for Beginners
byTrevor Wrightson
Rating: 5 out of 5 stars
5/5
THE Amateur Radio Dictionary: The Most Complete Glossary of Ham Radio Terms Ever Compiled
Ebook
THE Amateur Radio Dictionary: The Most Complete Glossary of Ham Radio Terms Ever Compiled
byDon Keith
Rating: 4 out of 5 stars
4/5
Basic Electricity
Ebook
Basic Electricity
byU.S. Bureau of Naval Personnel
Rating: 4 out of 5 stars
4/5
Programming Arduino: Getting Started with Sketches
Ebook
Programming Arduino: Getting Started with Sketches
bySimon Monk
Rating: 4 out of 5 stars
4/5
Raspberry Pi Projects for the Evil Genius
Ebook
Raspberry Pi Projects for the Evil Genius
byDonald Norris
Rating: 0 out of 5 stars
0 ratings
Electric Circuits Essentials
Ebook
Electric Circuits Essentials
byThe Editors of REA
Rating: 5 out of 5 stars
5/5
Practical Electrical Wiring: Residential, Farm, Commercial, and Industrial
Ebook
Practical Electrical Wiring: Residential, Farm, Commercial, and Industrial
byF. P. Hartwell
Rating: 4 out of 5 stars
4/5
Off-Grid Projects: Innovative and Practical Projects for Living Off the Grid
Ebook
Off-Grid Projects: Innovative and Practical Projects for Living Off the Grid
byRachel Pratt
Rating: 0 out of 5 stars
0 ratings
The Electrician's Trade Demystified
Ebook
The Electrician's Trade Demystified
byDavid Herres
Rating: 0 out of 5 stars
0 ratings
Build Your Own Electronics Workshop
Ebook
Build Your Own Electronics Workshop
byThomas Petruzzellis
Rating: 4 out of 5 stars
4/5
Electrical Engineering | Step by Step
Ebook
Electrical Engineering | Step by Step
byM.Eng. Johannes Wild
Rating: 0 out of 5 stars
0 ratings
Ramblings of a Mad Scientist: 100 Ideas for a Stranger Tomorrow
Ebook
Ramblings of a Mad Scientist: 100 Ideas for a Stranger Tomorrow
byZimmer Barnes
Rating: 0 out of 5 stars
0 ratings
Steampunk Gear, Gadgets, and Gizmos: A Maker's Guide to Creating Modern Artifacts
Ebook
Steampunk Gear, Gadgets, and Gizmos: A Maker's Guide to Creating Modern Artifacts
byThomas Willeford
Rating: 4 out of 5 stars
4/5
Basic Electronics: Book 2
Ebook
Basic Electronics: Book 2
byPaul Daak
Rating: 5 out of 5 stars
5/5
Digital Filmmaking for Beginners A Practical Guide to Video Production
Ebook
Digital Filmmaking for Beginners A Practical Guide to Video Production
byMichael K. Hughes
Rating: 0 out of 5 stars
0 ratings
Raspberry Pi Electronics Projects for the Evil Genius
Ebook
Raspberry Pi Electronics Projects for the Evil Genius
byDonald Norris
Rating: 3 out of 5 stars
3/5
Arduino Projects with Tinkercad: Designing and programming Arduino-based electronics projects using Tinkercad
Ebook
Arduino Projects with Tinkercad: Designing and programming Arduino-based electronics projects using Tinkercad
byM.Eng. Johannes Wild
Rating: 4 out of 5 stars
4/5
Very Truly Yours, Nikola Tesla
Ebook
Very Truly Yours, Nikola Tesla
byNikola Tesla
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Episode 003 - Dr Morgan Langille – microbiome bioinformatics, DNA sequencing, metagenomics: Dr Morgan Langille, an expert in bioinformatics, talks about developing novel technologies and techniques that enable a better understanding of human-microbial interactions. Dr Langille discusses advances in the microbiome field as well as combini...
Podcast episode
Episode 003 - Dr Morgan Langille – microbiome bioinformatics, DNA sequencing, metagenomics: Dr Morgan Langille, an expert in bioinformatics, talks about developing novel technologies and techniques that enable a better understanding of human-microbial interactions. Dr Langille discusses advances in the microbiome field as well as combini...
byInside Matters
0 ratings
0% found this document useful
How neuroepigenetics can help us to better understand spinal cord injury -Jess Chadwick
Podcast episode
How neuroepigenetics can help us to better understand spinal cord injury -Jess Chadwick
byThe Genomics Lab
100%
100% found this document useful
Addressing the Challenge of Making Antibody-Drug Conjugates: Antibody-drug conjugates marry the precise targeting of an antibody to a cytotoxic payload. That has the potential to provide a powerful treatment approach to a variety of cancers with less toxicity than systemically delivered chemotherapy. The probl...
Podcast episode
Addressing the Challenge of Making Antibody-Drug Conjugates: Antibody-drug conjugates marry the precise targeting of an antibody to a cytotoxic payload. That has the potential to provide a powerful treatment approach to a variety of cancers with less toxicity than systemically delivered chemotherapy. The probl...
byThe Bio Report
0 ratings
0% found this document useful
#312: Mapping Genetics and Race with Dr. Janina Jeff (Replay): Dr. Jeff, a Senior Bioinformatics scientist at Illumina and host of the podcast ‘In Those Genes’, joins us today for an insightful conversation around genetics and race, helping us to understand that race is, indeed, a social construct.
Podcast episode
#312: Mapping Genetics and Race with Dr. Janina Jeff (Replay): Dr. Jeff, a Senior Bioinformatics scientist at Illumina and host of the podcast ‘In Those Genes’, joins us today for an insightful conversation around genetics and race, helping us to understand that race is, indeed, a social construct.
by15 Minute Matrix
0 ratings
0% found this document useful
Tax Code Switch
Podcast episode
Tax Code Switch
byPlanet Money
0 ratings
0% found this document useful
200: Inside Molecular Pathology With an MD/PHD
Podcast episode
200: Inside Molecular Pathology With an MD/PHD
bySpecialty Stories
0 ratings
0% found this document useful
Episode: 42 - Machine Learning Informatics for Antibody Discovery
Podcast episode
Episode: 42 - Machine Learning Informatics for Antibody Discovery
byThe Chain: Protein Engineering Podcast
0 ratings
0% found this document useful
Setting the Standard: Impact of Method Standardization in Chromatography
Podcast episode
Setting the Standard: Impact of Method Standardization in Chromatography
byThe Analytical Wavelength
0 ratings
0% found this document useful
"Keeping it Fresh" with Bilal Hankins and Anna Dorigo: In Office Hours Episode 6, SmartLogic Developers Anna Dorigo and Bilal Hankins join Elixir Wizards Sundi and Dan to discuss their experiences maintaining a decade-old Ruby on Rails codebase. The conversation spans a range of topics, including accessibility, testing, monitoring, and the challenges of deploying database migrations in production environments
Podcast episode
"Keeping it Fresh" with Bilal Hankins and Anna Dorigo: In Office Hours Episode 6, SmartLogic Developers Anna Dorigo and Bilal Hankins join Elixir Wizards Sundi and Dan to discuss their experiences maintaining a decade-old Ruby on Rails codebase. The conversation spans a range of topics, including accessibility, testing, monitoring, and the challenges of deploying database migrations in production environments
byElixir Wizards
0 ratings
0% found this document useful
Why Microservices Are Better Than Cloud Computing: This episode on Systems—one of the four Domains of Data Science UVA uses to define the field—explores the challenges of cloud computing within the framework of biomedical research. Phil Bourne, Dean of the UVA School of Data Science, speaks with computational biologist and associate professor Nathan Sheffield about a paper they co-wrote on systemic issues from cloud platforms that do not support FAIRness, including platform lock-in, poor integration across platforms, and duplicated efforts for users and developers. They suggest instead prioritizing microservices and access to modular data in smaller chunks or summarized form. Emphasizing modularity and interoperability would lead to a more powerful Unix-like ecosystem of web services for biomedical analysis and data retrieval. The two discuss how funders, developers, and researchers can support microservices as the next generation of cloud-based bioinformatics. From Cloud Computing to
Podcast episode
Why Microservices Are Better Than Cloud Computing: This episode on Systems—one of the four Domains of Data Science UVA uses to define the field—explores the challenges of cloud computing within the framework of biomedical research. Phil Bourne, Dean of the UVA School of Data Science, speaks with computational biologist and associate professor Nathan Sheffield about a paper they co-wrote on systemic issues from cloud platforms that do not support FAIRness, including platform lock-in, poor integration across platforms, and duplicated efforts for users and developers. They suggest instead prioritizing microservices and access to modular data in smaller chunks or summarized form. Emphasizing modularity and interoperability would lead to a more powerful Unix-like ecosystem of web services for biomedical analysis and data retrieval. The two discuss how funders, developers, and researchers can support microservices as the next generation of cloud-based bioinformatics. From Cloud Computing to
byUVA Data Points
0 ratings
0% found this document useful
Static Code Analysis in Elixir vs. Ruby with René Föhring & Marc-André Lafortune: In this episode of Elixir Wizards, hosts Owen and Dan are joined by René Föhring, creator of Credo for Elixir, and Marc-André LaFortune, head maintainer of the RuboCop AST library for Ruby. They compare static code analysis in Ruby versus Elixir.
Podcast episode
Static Code Analysis in Elixir vs. Ruby with René Föhring & Marc-André Lafortune: In this episode of Elixir Wizards, hosts Owen and Dan are joined by René Föhring, creator of Credo for Elixir, and Marc-André LaFortune, head maintainer of the RuboCop AST library for Ruby. They compare static code analysis in Ruby versus Elixir.
byElixir Wizards
0 ratings
0% found this document useful
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
Podcast episode
MLOps Coffee Sessions #11: Analyzing “Continuous Delivery and Automation Pipelines in ML" // Part 3
byMLOps.community
0 ratings
0% found this document useful
Guide to CRISPR/Cas9 Delivery How to Maximize Editing
Podcast episode
Guide to CRISPR/Cas9 Delivery How to Maximize Editing
byListen In - Bitesize Bio Webinar Audios
0 ratings
0% found this document useful
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
Podcast episode
RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
byPapers Read on AI
0 ratings
0% found this document useful
Seven Failure Points When Engineering a Retrieval Augmented Generation System
Podcast episode
Seven Failure Points When Engineering a Retrieval Augmented Generation System
byPapers Read on AI
0 ratings
0% found this document useful
Conquering the Last Mile in Data - Caitlin Moorman
Podcast episode
Conquering the Last Mile in Data - Caitlin Moorman
byDataTalks.Club
0 ratings
0% found this document useful
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Podcast episode
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
byPapers Read on AI
0 ratings
0% found this document useful
Analyzing the Google Paper on Continuous Delivery in ML // Part 4 // MLOps Coffee Sessions #17
Podcast episode
Analyzing the Google Paper on Continuous Delivery in ML // Part 4 // MLOps Coffee Sessions #17
byMLOps.community
0 ratings
0% found this document useful
Patrick Lewis (Cohere) - Retrieval Augmented Generation
Podcast episode
Patrick Lewis (Cohere) - Retrieval Augmented Generation
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
ATLAS with Dr. Mario Lassnig: Our guest today is Dr. Mario Lassnig, a software engineer working on the ATLAS Experiment at CERN!
Podcast episode
ATLAS with Dr. Mario Lassnig: Our guest today is Dr. Mario Lassnig, a software engineer working on the ATLAS Experiment at CERN!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Machine Learning and Artificial Intelligence in the Clinical Microbiology Laboratory (JCM ed.): The idea of applying machine learning and digital pathology platforms to everyday workflows in the clinical microbiology laboratory has become increasing intriguing and appealing, especially as labs continue to optimize efficiency in the midst of...
Podcast episode
Machine Learning and Artificial Intelligence in the Clinical Microbiology Laboratory (JCM ed.): The idea of applying machine learning and digital pathology platforms to everyday workflows in the clinical microbiology laboratory has become increasing intriguing and appealing, especially as labs continue to optimize efficiency in the midst of...
byEditors in Conversation
0 ratings
0% found this document useful
Democratizing Causality - Aleksander Molak
Podcast episode
Democratizing Causality - Aleksander Molak
byDataTalks.Club
0 ratings
0% found this document useful
FREEDA: an automated computational pipeline guides experimental testing of protein innovation by detecting positive selection
Podcast episode
FREEDA: an automated computational pipeline guides experimental testing of protein innovation by detecting positive selection
byPaperPlayer biorxiv cell biology
0 ratings
0% found this document useful
rerankers: A Lightweight Python Library to Unify Ranking Methods
Podcast episode
rerankers: A Lightweight Python Library to Unify Ranking Methods
byPapers Read on AI
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
Package Management in Elixir vs. JavaScript with Wojtek Mach & Amal Hussein: Wojtek Mach of HexPM and Amal Hussein, engineering leader and former NPM team member, join Owen Bickford to compare notes on package management in Elixir vs. JavaScript.
Podcast episode
Package Management in Elixir vs. JavaScript with Wojtek Mach & Amal Hussein: Wojtek Mach of HexPM and Amal Hussein, engineering leader and former NPM team member, join Owen Bickford to compare notes on package management in Elixir vs. JavaScript.
byElixir Wizards
0 ratings
0% found this document useful
Mat Trudel on the Future of Phoenix and Web Transports: In today's episode of Elixir Wizards, Owen and Dan dive deep into the world of web protocols and transports with Mat Trudel, a Phoenix contributor and the creator of the Bandit Web Server. They explore the challenges and benefits of implementing HTTP/3 support in Phoenix, the importance of community involvement in open-source projects, and the future of web transports.
Podcast episode
Mat Trudel on the Future of Phoenix and Web Transports: In today's episode of Elixir Wizards, Owen and Dan dive deep into the world of web protocols and transports with Mat Trudel, a Phoenix contributor and the creator of the Bandit Web Server. They explore the challenges and benefits of implementing HTTP/3 support in Phoenix, the importance of community involvement in open-source projects, and the future of web transports.
byElixir Wizards
0 ratings
0% found this document useful
#037 - Tour De Bayesian with Connor Tann
Podcast episode
#037 - Tour De Bayesian with Connor Tann
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
Relevance-guided Supervision for OpenQA with ColBERT: Abstract Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned com...
Podcast episode
Relevance-guided Supervision for OpenQA with ColBERT: Abstract Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned com...
byPapers Read on AI
0 ratings
0% found this document useful
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
Podcast episode
Radio Frequency Data Collection: Aerial collection of data from smart meters. In this episode, we discuss why this is better than mesh networks, and other ground collection options and also look at the limitations of drones in terms of data collection platforms. Join the email list f...
byThe MapScaping Podcast - GIS, Geospatial, Remote Sensing, earth observation and digital geography
0 ratings
0% found this document useful

Skip carousel

You Had Questions For David Liu About CRISPR, Prime Editing, And Advice To Young Scientists. He Has Answers
STAT
Article
You Had Questions For David Liu About CRISPR, Prime Editing, And Advice To Young Scientists. He Has Answers
Nov 6, 2019
You had questions for David Liu about CRISPR, prime editing, and advice to young scientists. He has answers.
17 min read
Tiny ‘Envelopes’ Show Promise For Sun-damaged Skin Repair
Futurity
Article
Tiny ‘Envelopes’ Show Promise For Sun-damaged Skin Repair
Sep 19, 2019
2 min read
Nanoparticles Let Cancer ‘Leak’ From Blood Vessels. Is This A Fix?
Futurity
Article
Nanoparticles Let Cancer ‘Leak’ From Blood Vessels. Is This A Fix?
Aug 27, 2019
1 min read
Gene Therapy Could Treat Eye Disease Without Surgery
Futurity
Article
Gene Therapy Could Treat Eye Disease Without Surgery
Aug 5, 2021
2 min read
CRISPR Can Turn Human Cells Into Biocomputers
Futurity
Article
CRISPR Can Turn Human Cells Into Biocomputers
Apr 16, 2019
3 min read
Ceramic Design with Artificial Intelligence
Ceramics: Art and Perception
Article
Ceramic Design with Artificial Intelligence
Sep 29, 2023
Technology determines design in different phases of time, and must adapt to corresponding methods and media. With the continuous development of science and technology, traditional ceramic technology and culture faces on-going transformation and upgra
8 min read
How Spooky Science Helps Us Peer Inside The Planets
All About Space
Article
How Spooky Science Helps Us Peer Inside The Planets
Dec 3, 2020
An assistant professor of computational science at the EPFL research centre in Lausanne, Switzerland, involved in the current research on metallic hydrogen. Could you explain how the machine-learning techniques used in your research work? Why were th
1 min read
Circuit Programs Human Cells to Add and Subtract
Futurity
Article
Circuit Programs Human Cells to Add and Subtract
Apr 15, 2017
A new platform offers a fast and more efficient way to target and program mammalian cells as genetic circuits, even complex ones. “The problem synthetic biologists are trying to solve is how we ask cells to make decisions and try to design a strategy
2 min read
This Lens-free Microscope Fits On A Fingertip
Futurity
Article
This Lens-free Microscope Fits On A Fingertip
Mar 5, 2018
3 min read
Quantum Simulators An Overview
Techfastly
Article
Quantum Simulators An Overview
Oct 1, 2021
4 min read
System Shaves 75% Off Electric Vehicle Battery Test Time
Futurity
Article
System Shaves 75% Off Electric Vehicle Battery Test Time
Jun 29, 2022
3 min read
Quantum Computing and The Rise Of Machine Learning
Techfastly
Article
Quantum Computing and The Rise Of Machine Learning
Oct 1, 2021
2 min read
How To Train Computers Faster For ‘Extreme’ Datasets
Futurity
Article
How To Train Computers Faster For ‘Extreme’ Datasets
Dec 12, 2019
4 min read
Metabolomes: A New Way To Store Data In Little Space
Futurity
Article
Metabolomes: A New Way To Store Data In Little Space
Jul 5, 2019
3 min read
New Quantum Algorithms Finally Crack Nonlinear Equations
Quanta
Article
New Quantum Algorithms Finally Crack Nonlinear Equations
Jan 5, 2021
4 min read
Machine Learning Could Cut Delays From Traffic Lights
Futurity
Article
Machine Learning Could Cut Delays From Traffic Lights
Jan 20, 2021
2 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
‘DNA Origami’ Motor Rolls Over Nano Speed Record
Futurity
Article
‘DNA Origami’ Motor Rolls Over Nano Speed Record
Mar 3, 2020
4 min read
‘Rope-jumping’ Rotor Could Pave Way For Molecular Machines
Futurity
Article
‘Rope-jumping’ Rotor Could Pave Way For Molecular Machines
Jul 15, 2018
Researchers have created a new type of molecular rotor that shows promise for future development as a functional machine capable of manipulating matter at atomic and subatomic levels. The research could transform multiple branches of chemistry, along
2 min read
Comparing Time Series Data Like A Pro
Linux Format
Article
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Business applications For Quantum computing
Rotman Management
Article
Business applications For Quantum computing
May 1, 2022
COMPUTERS DO ARITHMETIC. Underlying every amazing application of computers today is math, calculated using binary digits or ‘bits.’ The original computers of the early 1950s could perform about 465 multiplications per second — much faster than the ‘h
11 min read
Prototype Paves Way For ‘Computer-on-a-chip’
Futurity
Article
Prototype Paves Way For ‘Computer-on-a-chip’
Feb 22, 2019
2 min read
Remember, Remember The 2020 November
PC Pro Magazine
Article
Remember, Remember The 2020 November
Jan 7, 2021
World-changing innovations are like London buses: you wait for years and then three come along at once. The recent wait has been particularly irksome, as virology and epidemiology felt like the only relevant sciences in lockdown – apart from rocket s
3 min read
Scanning Ahead…
Digital Camera World
Article
Scanning Ahead…
Jan 7, 2022
2 min read
This Material Makes Beautiful, Potentially Useful Rainbows
Futurity
Article
This Material Makes Beautiful, Potentially Useful Rainbows
Sep 8, 2021
2 min read
Decoding the Origami That Drives All Life
The Atlantic
Article
Decoding the Origami That Drives All Life
Jan 19, 2017
5 min read
Short Tail On RNA Makes CRISPR 50X More Accurate
Futurity
Article
Short Tail On RNA Makes CRISPR 50X More Accurate
Apr 15, 2019
3 min read
Kings And Databases
Linux Format
Article
Kings And Databases
Oct 20, 2020
“Are architects the new kingmakers of the database world? To get market insight, Percona conducts an annual Open Source Data Management Software survey [http://bit.ly/lxf269sur]. When it comes to actual decision-making, architects (43 per cent) were
1 min read
A.I. Speeds Up Battery Testing For Electric Vehicles
Futurity
Article
A.I. Speeds Up Battery Testing For Electric Vehicles
Feb 24, 2020
4 min read

Related categories

Skip carousel

Reviews for Pattern Recognition in Computational Molecular Biology

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Pattern Recognition in Computational Molecular Biology - Mourad Elloumi

List of Contributors

Andrej Aderhold, School of Biology, University of St Andrews, St Andrews, UK

Lefteris Angelis, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Miguel Arenas, Bioinformatics Unit, Centre for Molecular Biology Severo Ochoa (CSIC), Madrid, Spain; Institute of Molecular Pathology and Immunology, University of Porto (IPATIMUP), Porto, Portugal

Dunarel Badescu, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

Carl Barton, The Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK

Sima Behpour, Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

Paola Bertolazzi, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

Chengpeng Bi, Bioinformatics and Intelligent Computing Lab, Division of Clinical Pharmacology, Children's Mercy Hospitals, Kansas City, MO, USA

Kevin Byron, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

Weiwen Cai, Institute of Applied Genomics, College of Biological Science and Engineering, Fuzhou University, Fuzhou, China

Virginio Cantoni, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

Kuo-Chen Chou, Gordon Life Science Institute, Belmont, MA, USA; King Abdulaziz University, Jeddah, Saudi Arabia

Matteo Comin, Department of Information Engineering, University of Padova, Padova, Italy

David Dao, Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach, Karlsruhe, Germany

Bhaskar DasGupta, Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA

Wajdi Dhifli, Clermont Université, Université Blaise Pascal, LIMOS, BP, Clermont-Ferrand, France; CNRS, UMR, LIMOS, Aubiére, France; Department of Computer Science, University of Quebec At Montreal, Downtown station, Montreal (Quebec) Canada

Carlotta Domeniconi, Department of Computer Science, George Mason University, Fairfax, VA, USA

Maryam Faridounnia, Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, The Netherlands

Giovanni Felici, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

Marco Ferretti, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

Giulia Fiscon, Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy; Department of Computer, Control and Management Engineering, Sapienza University, Rome, Italy

Tomáš Flouri, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

Adelaide Freitas, Department of Mathematics, and Center for Research & Development in Mathematics and Applications (CIDMA), University of Aveiro, Aveiro, Portugal

Valentina Fustaino, Cellular Biology and Neurobiology Institute, National Research Council, Rome, Italy; Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

Yann Guermeur, LORIA, Université de Lorraine-CNRS, Nancy, France

Michael Hall, Department of Computing Science, University of Alberta, Edmonton, Canada

Robert Harrison, Department of Computer Science, Georgia State University, Atlanta, GA, USA

Dirk Husmeier, School of Mathematics and Statistics, University of Glasgow, Glasgow, UK

Costas S. Iliopoulos, Department of Informatics, King's College London, London, United Kingdom

Samad Jahandideh, Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, La Jolla, CA, USA

Giuseppe Lancia, Dipartimento di Matematica e Informatica, University of Udine, Udine, Italy

Fabien Lauer, LORIA, Université de Lorraine-CNRS, Nancy, France

Vladimir Makarenkov, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

Christos Makris, Department of Computer Engineering and Informatics, University of Patras, Patras, Greece

Mina Maleki, School of Computer Science, University of Windsor, Windsor, Canada

Diego Mallo, Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain

David A. Morrison, Systematic Biology, Uppsala University, Norbyvägen, Uppsala, Sweden

Ahmed Moussa, LabTIC Laboratory, ENSA, Abdelmalek Essaadi University, Tangier, Morocco

Sami Muhaidat, ECE Department, Khalifa University, Abu Dhabi, UAE; EE Department, University of Surrey, Guildford, UK

Mirto Musci, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata, Pavia, Italy

Engelbert Mephu Nguifo, LIMOS–Blaise Pascal University–Clermont University, Clermont-Ferrand, France; LIMOS–CNRS UMR, Aubiére, France

Stavroula Ntoufa, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

Nahumi Nugrahaningsih, Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 3, 27100 Pavia, Italy

Hasan Oğul, Department of Computer Engineering, Başkent University, Ankara, Turkey

Nikos Papakonstantinou, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

Sheng-Lung Peng, Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan

Solon P. Pissis, Department of Informatics, King's College London, London, United Kingdom

Huzefa Rangwala, Department of Computer Science, George Mason University, Fairfax, VA, USA

Sara Roque, Department of Mathematics, University of Aveiro, Aveiro, Portugal

Luis Rueda, School of Computer Science, University of Windsor, Windsor, Canada

Agustín Sánchez-Cobos, Bioinformatics Unit, Centre for Molecular Biology Severo Ochoa (CSIC), Madrid, Spain

Maad Shatnawi, Higher Colleges of Technology, Abu Dhabi, UAE

Soheila Shokrollahzade, Department of Medicinal Biotechnology, Iran University of Medical Sciences, Tehran, Iran

Carina Silva, Lisbon School of Health Technology, Lisbon, Portugal; Center of Statistics and Applications of Lisbon University (CEAUL), Lisbon, Portugal

Tiratha Raj Singh, Biotechnolgy and Bioinformatics Department, Jaypee University of Information and Technology (JUIT), Solan, Himachal Pradesh, India

V Anne Smith, School of Biology, University of St Andrews, St Andrews, UK

Jiangning Song, National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China; Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, VIC, Australia

Yang Song, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

Lisete Sousa, Department of Statistics and Operations Research, Lisbon University and Center of Statistics and Applications of Lisbon University (CEAUL), Lisbon, Portugal, Lisbon, Portugal

Alexandros Stamatakis, Karlsruhe Institute of Technology, Institute for Theoretical Informatics, Postfach, Karlsruhe, Germany; Heidelberg Institute for Theoretical Studies, Heidelberg, Germany

Kostas Stamatopoulos, Hematology Department and HCT Unit, G. Papanikolaou Hospital, Thessaloniki, Greece; Institute of Applied Biosciences, C.E.R.TH, Thessaloniki, Greece

Kamal Taha, ECE Department, Khalifa University, Abu Dhabi, UAE

Nadia Tahiri, Département d'informatique, Université du Québec à Montréal, Succ. Centre-Ville, Montréal, Québec, Canada

Li Teng, Department of Internal Medicine, University of Iowa, Iowa City, IA, USA

Evangelos Theodoridis, Computer Technology Institute and Press Diophantus, Patras, Greece

Athina Tsanousa, Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece

Yu-Wei Tsay, Institute of Information Science, Academia Sinica, Taipei, Taiwan

Chinua Umoja, Department of Computer Science, Georgia State University, Atlanta, GA, USA

Brigitte Vannier, Receptors, Regulation and Tumor Cells (2RTC) Laboratory, University of Poitiers, Poitiers, France

Meghana Vasavada, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

Akbar Vaseghi, Department of Genetics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran

Davide Verzotto, Computational and Systems Biology, Genome Institute of Singapore, Singapore

Jason T.L. Wang, Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA

Emanuel Weitschek, Department of Engineering, Uninettuno International University, Rome, Italy; Institute of Systems Analysis and Computer Science A. Ruberti, National Research Council, Rome, Italy

Renxiang Yan, Institute of Applied Genomics, College of Biological Science and Engineering, Fuzhou University, Fuzhou, China

Paul D. Yoo, Department of Computing and Informatics, Bournemouth University, UK; Centre for Distributed and High Performance Computing, University of Sydney, Sydney, Australia

Guoxian Yu, College of Computer and Information Science, Southwest University, Chongqing, China

Xiaxia Yu, Department of Computer Science, Georgia State University, Atlanta, GA, USA

Ziding Zhang, State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China

Albert Y. Zomaya, Centre for Distributed and High Performance Computing, University of Sydney, Sydney, Australia

Preface

Pattern recognition is the automatic identification of regularities, that is, figures, characters, shapes, and forms present in data. Pattern recognition is the core process of many scientific discoveries, whereby researchers detect regularities in large amounts of data in fields as diverse as Geology, Physics, Astronomy, and Molecular Biology. Pattern recognition in biomolecular data is at the core of Molecular Biology research. Indeed, pattern recognition makes a very important contribution to the analysis of biomolecular data. In fact, it can reveal information about shared biological functions of biological macromolecules, that is, DNA, RNA, and proteins, originating from several different organisms, by the identification of patterns that are shared by structures related to these macromolecules. These patterns, which have been conserved during evolution, often play an important structural and/or functional role, and consequently, shed light on the mechanisms and the biological processes in which these macromolecules participate. Pattern recognition in biomolecular data is also used in evolutionary studies, in order to analyze relationships that exist between species and establish whether two, or several, biological macromolecules are homologous, that is, have a common biological ancestor, and to reconstruct the phylogenetic tree that links them to this ancestor. On the other hand, with the new sequencing technologies, the number of biological sequences in databases is increasing exponentially. In addition, the lengths of these sequences are large. Hence, the recognition of patterns in such databases requires the development of fast, low-memory requirements and high-performance techniques and approaches. This book provides an up-to-date forum of such techniques and approaches that deal with the most studied, the most important, and/or the newest topics in the field of pattern recognition. Some of these techniques and approaches represent improvements on old ones, while others are completely new. Most of current books on pattern recognition in biomolecular data either lack technical depth or focus on specific, narrow topics. This book is the first overview on techniques and approaches on pattern recognition in biomolecular data with both a broad coverage of this field and enough depth to be of practical use to working professionals. It surveys the most recent developments of techniques and approaches on pattern recognition in biomolecular data, offering enough fundamental and technical information on these techniques and approaches and the related problems, without overloading the reader. This book will thus be invaluable not only for practitioners and professional researchers in Computer Science, Life Science, and Mathematics but also for graduate students and young researchers looking for promising directions in their work. It will certainly point them to new techniques and approaches that may be the key to new and important discoveries in Molecular Biology.

This book is organized into seven parts: Pattern Recognition in Sequences, Pattern Recognition in Secondary Structures, Pattern Recognition in Tertiary Structures, Pattern Recognition in Quaternary Structures, Pattern Recognition in Microarrays, Pattern Recognition in Phylogenetic Trees, and Pattern Recognition in Biological Networks. The 29 chapters, which make up the seven parts of this book, were carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplications. Each contributor was asked to cover review material as well as current developments in his/her chapter. In addition, the choice of authors was made by selecting those who are leaders in their respective fields.

Mourad Elloumi

Tunis, Tunisia

Costas S. Iliopoulos

London, UK

Jason T. L. Wang

Newark, USA

Albert Y. Zomaya

Sydney, Australia

November 1, 2015

Part I

Pattern Recognition in Sequences

Chapter 1

Combinatorial Haplotyping Problems

Giuseppe Lancia

Dipartimento di Matematica e Informatica, University of Udine, Udine, Italy

1.1 Introduction

A few years back, the process of collection of a vast amount of genomic data culminated with the completion of the Human Genome Project [22]. The outcome of the project has brought the confirmation that the genetic makeup of humans (as well as other species) is remarkably well conserved. Generally speaking, the differences at DNA level between any two individuals amount to less than 5% of their genomic sequences. As a consequence, the differences at the phenotype level (i.e., in the way the organisms look) must be caused by small regions of differences in the genomes. The smallest possible region consists of a single nucleotide and is called the Single Nucleotide Polymorphism (SNP, pronounced snip). SNPs are the predominant form of human genetic variation and their importance can hardly be overestimated; they are used, for example, in medical, drug-design, diagnostic, and forensic applications.

Broadly speaking, a polymorphism is a trait common to everybody that can take different values, ranging in a limited set of possibilities, called alleles (for simple examples, think of blood group or eye color). In particular, a SNP is a nucleotide site, placed in the middle of a DNA region that is otherwise identical in everybody, at which we observe a statistically significant variability. A SNP is almost always a polymorphism with only two alleles (out of the four possible). For a nucleotide site to be considered a SNP, it must be the case that the less frequent allele is found in the population with some nonnegligible frequency.

For many living species (e.g., mammals), there exist two sexes and each individual has two parents, one of each sex. The genome of these species is organized in pairs of chromosomes and a single copy of each chromosome pair is inherited from each of the two parents. Organisms in which the genome is organized in pairs of homologous chromosomes are called diploid organisms. For a diploid organism, at each SNP, an individual can either be homozygous (i.e., possess the same allele on both chromosomes) or heterozygous (i.e., possess two different alleles). The values of a set of SNPs on a particular chromosome copy define a haplotype.

In Figure 1.1, we illustrate a simplistic example, showing a specific chromosome in three individuals. For each individual, the pair of her chromosome copies is reported. There are four SNPs. The alleles for SNP 1, in this example, are C and G, while for SNP 4 they are T and C. Individual 1 is heterozygous for SNPs 1, 2, and 3, and homozygous for SNP 4. Her haplotypes are CCCT and GAGT. The haplotypes of individual 3 are CAGT and GACC.

Figure 1.1 A chromosome in three individuals. There are four SNPs.

Haplotyping an individual consists in determining her two haplotypes for a given chromosome. With the larger availability in SNP genomic data, recent years have seen the birth of many new computational problems related to haplotyping. These problems are motivated by the fact that it can be difficult and/or very expensive to determine the haplotypes experimentally, so ad hoc algorithms have been used to correct data errors or to infer missing data.

In the remainder of this chapter, we will address the haplotyping problem for both a single individual and a set of individuals (a population). In the former case, described in Section 1.2, the input is haplotype data inconsistent with the existence of exactly two parents for an individual. This inconsistency is due to experimental errors and/or missing data. In the latter case, described in Section 1.3, the input data specify for each SNP and each individual in a population whether her parents have contributed the same or different alleles, but do not specify which parent contributed which allele.

1.2 Single Individual Haplotyping

The process of passing from the sequence of nucleotides in a DNA molecule to a string over the DNA alphabet is called sequencing. A sequencer is a machine that is fed some DNA and whose output is a string of As, Ts, Cs, and Gs. To each letter, the sequencer attaches a value (confidence level) that essentially represents the probability that the letter has been correctly read.

The main problem with sequencing is that, owing to technological limitations, it is not possible to sequence a long DNA molecule at once. What we can do, however, is to sequence short DNA fragments (also called reads), of length of about 1000 nucleotides each, which provide small windows to look at the target molecule. To sequence a long DNA molecule, the molecule must first be replicated (amplified) by creating many copies of it. These copies are then broken, at random, into several smaller fragments, which are fed to a sequencer that will sequence those of the right size. The amplification phase is necessary so that the reads can have nonempty overlap. From the overlap of two reads, one may infer (through a process called assembly) a longer fragment, and so on, until the original DNA sequence has been reconstructed. This is, in essence, the principle of shotgun sequencing, a method used by Celera Genomics in the late 1990s to allow the completion of the sequencing of the human genome faster compared with other experimental techniques of the time [62]. In shotgun sequencing, the fragments were read and then assembled back into the original sequence by using sophisticated algorithms and powerful computers.

In Figure 1.2, we illustrate an example in which the two chromosome copies (a) and (b) have been amplified, and then some fragments (denoted by rectangular boxes) have been sequenced. The goal would then be to retrieve (a) and (b), given the set of sequenced fragments. The major difficulty obstructing this goal is that, during the amplification phase, both the paternal and the maternal chromosome copies are amplified together so that the random reads can belong to either the paternal or the maternal original copy. Of course, it is unknown which fragments are paternal and which are maternal, and one of the problems in reconstructing the haplotypes consists in segregating the paternal fragments from the maternal ones. In some cases, there may be pairs of reads, called mate pairs, for which it is known that they are either both paternal or both maternal. Mate pairs are due to a particular technique that allows to sequence the ends of a fragment several thousand nucleotides long. Again, one can only read about 1000 nucleotides at each end of the target, but the result is stronger than reading two individual fragments as now it is known that the two reads come from the same chromosome copy. Furthermore, the experiment returns a fairly precise estimate of the distance, expressed in bases, between the two reads of a mate pair.

c01f002

Figure 1.2 Sequence reads and assembly of the two chromosome copies.

Even with the best possible technology, sequencing errors are unavoidable. The main errors consist in bases that have been miscalled or skipped altogether. Furthermore, contaminants can be present, that is, DNA coming from organisms other than the one that had to be sequenced. Owing to the experimental errors, the reconstruction of the haplotypes, given the reads coming from sequencing, is not always straightforward and may require correction of the input data. In a general way, the haplotyping problem for an individual can then be informally stated as follows:

Given inconsistent haplotype data coming from sequencing of an individual's chromosome, find and correct the errors in the data so as to retrieve a consistent pair of haplotypes.

Depending on what type of errors one is after, there can be many versions of this problem (for surveys on individual haplotyping problems, see, e.g., Schwartz [59] or Zhang et al. [70]). Historically, the formalization of the first haplotyping problems for an individual was given by Lancia et al. [47]. In their work, the Minimum Fragment Removal (MFR) and Minimum SNP Removal (MSR) problems were introduced, which we briefly discuss here.

Given the fact that at each SNP only two alleles are possible, we can encode them by using a binary alphabet. Hence, in the sequel, the two values that a SNP can take are denoted by 0 and 1. A haplotype is then simply a string over the alphabet c01-math-0001 .

The basic framework for a single individual haplotyping problem is as follows. There is a set c01-math-0002 of SNPs and a set c01-math-0003 of fragments (i.e., reads coming from sequencing). Each SNP is covered by some of the fragments and can take the values 0 or 1. Since there is a natural ordering of the SNPs, given by their physical location on the chromosome, the data can be represented by an c01-math-0004 matrix c01-math-0005 over the alphabet c01-math-0006 , called the SNP matrix. Each column corresponds to a SNP and each row corresponds to a fragment. If fragment c01-math-0007 covers a SNP c01-math-0008 , then c01-math-0009 is the value of the allele for SNP c01-math-0010 appearing in fragment c01-math-0011 . The symbol "-" is used to represent a SNP not covered by a fragment (see Figure 1.3a and b for an example of a SNP matrix).

c01f003

Figure 1.3 SNP matrix and conflict graphs.

A gapless fragment is one covering a set of consecutive SNPs (i.e., the 0s and 1s appear consecutively in that row). We say that a fragment has c01-math-0012 gaps if it covers c01-math-0013 blocks of consecutive SNPs (e.g., the fragment 00--101---01 has two gaps). Gaps are mainly due to two reasons: (i) thresholding of low-quality reads (when the sequencer cannot call a SNP 0 or 1 with enough confidence, the SNP is marked with a "-" ); (ii) mate-pairing in shotgun sequencing (in this case c01-math-0014 , and the situation is equivalent to having two gapless fragments coming from the same chromosome copy).

Two fragments c01-math-0015 and c01-math-0016 are said to be in conflict if there exists a SNP c01-math-0017 such that neither c01-math-0018 nor c01-math-0019 and it is c01-math-0020 . The conflict of two fragments implies that either the fragments do not come from the same chromosome copy or there are errors in the data. Given a SNP matrix c01-math-0021 , the fragment conflict graph is the graph c01-math-0022 with an edge for each pair of fragments in conflict (see Figure 1.3c). Two SNPs c01-math-0023 and c01-math-0024 are said to be in conflict if there exist two fragments c01-math-0025 and c01-math-0026 such that the c01-math-0027 submatrix defined by rows c01-math-0028 and c01-math-0029 and columns c01-math-0030 and c01-math-0031 contains three 0s and one 1, or three 1s and one 0. Given a SNP matrix c01-math-0032 , the SNP conflict graph is the graph c01-math-0033 , with an edge for each pair of SNPs in conflict (see Figure 1.3d).

If c01-math-0034 is a bipartite graph, c01-math-0035 can be segregated into two sets c01-math-0036 and c01-math-0037 of pairwise compatible fragments. From each set, one can infer one haplotype by fragment overlap. Let c01-math-0038 and c01-math-0039 be the haplotypes thus obtained. Since c01-math-0040 and c01-math-0041 correspond to the assembly of the fragments, the single individual haplotyping problems are sometimes also referred to as fragment assembly haplotyping problems.

We call a SNP matrix c01-math-0042 feasible if c01-math-0043 is bipartite and infeasible otherwise. Note that a SNP matrix for error-free data must be feasible. Hence, the optimization problems to be defined aim at correcting an infeasible SNP matrix so that it becomes feasible.

The following are the first optimization problems defined with the above goal [47]. They arose at Celera Genomics in the context of sequencing the human genome.

MFR: Given a SNP matrix, remove the minimum number of fragments (rows) so that the resulting matrix is feasible.

MSR: Given a SNP matrix, remove the minimum number of SNPs (columns) so that the resulting matrix is feasible.

The first problem is mainly suited for a situation in which, more than sequencing errors, one is worried about the presence of contaminants. The second problem is more suited in the presence of sequencing errors only, when all the fragments are to be retained. These problems were shown to be NP-hard in general [47] so that exact algorithms for their solution are expected to be exponential branch-and-bound procedures. Lippert et al. [53] described a combinatorial branch-and-bound algorithm for MFR. They also described an Integer Linear Programming (ILP) formulation of the problem, based on the correspondence between MFR and the maximum node-induced bipartite subgraph problem.

While the general case for MFR and MSR is NP-hard, these problems were shownto be polynomial for gapless data [47]. Let us call c01-math-0044 a gapless matrix if each row of c01-math-0045 is a gapless fragment. The main connection between MFR and MSR is given by the following theorem.

Theorem 1.1

[47] Let c01-math-0046 be a gapless SNP matrix. Then, c01-math-0047 is a bipartite graph if and only if c01-math-0048 is a stable set.

Later theoretical improvements extended these results to fragments with gaps of bounded length, giving c01-math-0049 dynamic programming algorithms for MFR and c01-math-0050 for MSR for instances with gaps of total length c01-math-0051 Rizzi et al. [57], Bafna et al. [58]. These algorithms are hardly practical, however, in instances for which the gaps can be rather large. To overcome this problem, Xie and Wang [67] (see also Xie et al. [68]) proposed an algorithm for MFR based on two new parameters: c01-math-0052 , the maximum number of SNPs covered by any fragment and c01-math-0053 , the maximum number of fragments covering any SNP site (also called coverage). Their solution has complexity c01-math-0054 . Since c01-math-0055 and c01-math-0056 is generally at most 10, this method should be more suitable for mate-paired data, where c01-math-0057 can be quite large.

1.2.1 The Minimum Error Correction Model

Owing to the nature of the sequencing process, most data errors are due to miscalling or missing a base in a read. As a consequence, a particular version of the single individual haplotyping problem was proposed in Reference [53] and has received great attention because of its practical relevance. This version is called Minimum Error Correction (MEC) problem (the problem is also known as Minimum Letter Flip (MLF), see Reference [32]):

MEC: Given a SNP matrix, change the minimum number of entries (0 into 1, or 1 into 0) so that the resulting matrix is feasible.

Particularly interesting is a weighted version of MEC in which each entry is associated with a nonnegative weight proportional to the confidence level with which the base corresponding to the entry has been called. The solution seeks to flip a set of entries with minimum total weight so as to make the matrix feasible.

The MEC problem was shown to be NP-hard, both for general [53] and for gapless matrices [20]. Many approaches were proposed in the literature for the solution of MEC. For a comparative survey of the various procedures, the reader is referred to Geraci [29] and to Geraci and Pellegrini [30].

One of the first heuristics for MEC was FastHare by Panconesi and Suozo [56]. FastHare starts by first sorting the fragments according to their leftmost ends and then it reconstructs the two final haplotypes by correcting the sorted fragments in a greedy manner. Being simple, this heuristic is very fast but nevertheless it provides quite accurate solutions in general, especially when the error rate in the data is not too high. For high error-rate data, Genovese et al. proposed SpeedHap [28], an effective heuristic procedure organized in phases. For each phase, they perform three tasks: (i) detect likely positions of errors, (ii) allocate fragments to the two partially built final haplotypes, and (iii) decide the final alleles in the two haplotypes via a majority rule on ambiguous SNPs.

FAHR (Fast and Accurate Haplotype Reconstruction by Wu and Liang [6]) is a somewhat similar greedy heuristic procedure, which builds the final haplotypes by partitioning the fragments at each SNP in a greedy manner. Yet another heuristic for MEC is HMEC, proposed by Bayzid et al. [7]. HMEC is a steepest–descent local search procedure in which each iteration takes time c01-math-0058 . Because of its time complexity, HMEC may provide results slowly for instances with a large number of fragments.

Zhao et al. [72] considered the weighted version of MLF (called WMLF), for which they proposed the use of a dynamic clustering heuristic. Furthermore, they introduced an extended version of the problem to deal with the presence of contaminants. In this version, denoted as Complete Weighted Minimum Letter Flip (CWMLF), one is allowed not only to flip entries of the SNP matrix but also to remove some of the matrix rows. In addition, for CWMLF, they proposed the use of a dynamic clustering algorithm. Wu et al. [66] proposed a heuristic procedure for WMLF, with a performance similar to that of the best previous methods.

Among the various types of approaches to the solution of MEC, there is the use of evolutionary algorithms such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO). Wang et al. [63] proposed both a branch-and-bound and a GA for MEC. Being exponential, the branch-and-bound is only applicable to instances of very small size, but the GA can be used for large instances (e.g., more than 50 fragments over more than 50 SNPs). A PSO heuristic for MEC was proposed by Kargar et al. [46]. PSO turns out to be fast and suitable for instances with a low error rate in the data.

One of the most successful heuristics for MEC is HapCUT, proposed by Bansal and Bafna [3]. The algorithm makes use of a suitable graph obtained from the fragments, where each SNP corresponds to a node in the graph and two nodes are joined by an edge if there exists a fragment that covers both SNPs. The procedure then tries to minimize the MEC cost of the reconstructed haplotypes by iteratively finding max-cuts in the associated graph.

He et al. [41] proposed a dynamic programming algorithm for MEC with time complexity c01-math-0059 , where c01-math-0060 is the maximum length of a fragment. The algorithm can be used for values of c01-math-0061 up to about 15, but beyond that, it becomes impractical. For large values of c01-math-0062 , the authors suggest to model the problem as a MaxSAT problem, to be solved by a MaxSAT solver. The quality of the solutions obtained through this approach is shown to be quite high, but the solving process is still slow.

Another dynamic programming algorithm, this time parameterized by the maximum coverage c01-math-0063 of each SNP site, was proposed by Deng et al. [23]. The complexity of the dynamic program is c01-math-0064 , which can be quite high when c01-math-0065 is large. For large values of c01-math-0066 , the authors propose a heuristic procedure based on the dynamic programming algorithm. Experiments showed that the heuristic returns very accurate solutions on average.

Chen et al. [18] proposed an exact algorithm for MEC based on an ILP formulation of the problem. The solving process is considerably speeded up by a preprocessing phase, in which the input SNP matrix is decomposed into smaller, independent, blocks that are then individually solved by the ILP solver.

1.2.2 Probabilistic Approaches and Alternative Models

Some of the solutions to the assembly haplotyping problems have employed a probabilistic approach, either because of the type of algorithm used or because of the type of model and problem definition.

An example of probabilistic algorithm is HASH (Haplotype Assembly for Single Human), an MCMC (Markov Chain Monte Carlo) algorithm proposed by Bansal et al. [4] for assembling haplotype fragments under the MEC objective function. In HASH, the transitions of the Markov chain are generated using min-cut computations on graphs derived from the reads. The method was applied to infer the haplotypes from real data consisting of whole-genome shotgun sequence fragments of a human individual. The results showed the haplotypes inferred by using HASH to be significantly more accurate than the haplotypes estimated by using the best heuristics available at the time.

Another type of Markov chain approach was proposed by Wang et al. [64]. In this approach, the assignment of fragments to haplotypes is regarded as a Markov process in which successive allele pairs are generated on the basis of the value of a small number of preceding SNP sites in the sequence. In order to find the most probable haplotypes for the set of fragments, the authors propose a Viterbi-like dynamic programming procedure.

Chen et al. [19] described a probabilistic model for MEC in which they considered three error parameters, called c01-math-0067 , c01-math-0068 , and c01-math-0069 . The parameter c01-math-0070 is the error rate with which entries of the SNP matrix have been miscalled(i.e., a 0 in place of 1 or a 1 in place of a 0). The value c01-math-0071 is the error rate with which a "-" in the SNP matrix appears where, in fact, a SNP is covered by a fragment and therefore an allele 0 or 1 should be present. Finally, c01-math-0072 is a measure of the expected dissimilarity between the haplotypes to be reconstructed. The authors gave a linear-time (in the size of c01-math-0073 ) probabilistic algorithm that reconstructs the correct haplotypes with high probability when the parameters are known. Even for the case when some of the parameters are unknown, they provided a probabilistic algorithm that outputs the correct haplotypes with probability depending on c01-math-0074 , c01-math-0075 , and c01-math-0076 .

Li et al. [52] proposed a probabilistic model for the haplotype inference problem. In their model, they studied the conditional probability c01-math-0077 of c01-math-0078 and c01-math-0079 being the correct haplotypes given that c01-math-0080 was the SNP matrix measured. They then pursued the objective of determining the most likely pair of correct haplotypes, that is, a pair c01-math-0081 maximizing the above conditional probability. In order to solve this problem, they used Gibbs sampling and an Expectation Maximization (EM) algorithm.

Among the works proposed for the solution of the MEC model, there are some papers that introduced variants to the original framework in order to deal with some specific data problems. For instance, Zhang et al. [71] proposed a model called Minimum Conflict Individual Haplotyping (MCIH), suitable for data sets with particularly high error rate. The problem was shown to be NP-hard and a dynamic programming procedure for its solution was described.

Duitama et al. [25] proposed a model called Maximum Fragments Cut (MCF) whose objective is to identify a set of fragments (i.e., rows of the SNP matrix) maximizing a score proportional to their conflict with respect to the remaining rows. This set of fragments can be interpreted as the shore of a cut in a suitable graph, so that the problem can be reduced to a max-cut problem. The authors utilized a local optimization heuristic, called ReFHap, for the problem solution.

Xie et al. [69] introduced a model called Balanced Optimal Partition (BOP), which generalizes both MEC and MCF. The model is, in a way, a weighted combination of MEC and MCF, and by setting some model parameters to proper values, BOP degenerates into pure MEC or pure MCF. To solve the model, the authors proposed a dynamic programming algorithm called H-BOP.

Aguiar and Istrail [1] proposed HapCompass, an algorithm for haplotype assembly of densely sequenced human genome data. HapCompass operates on a graph where each SNP is a node and the edges are associated with the fragments. The edges are weighted, and the weight of an edge indicates what the best phasing of the alleles for the edge endpoints is, given the fragments in the data set that cover the corresponding SNPs. In this model, the reconstruction of the underlying haplotypes corresponds to a minimum-weight edge-removal problem until a special type of subgraph, called happy graph by the authors, is obtained. HapCompass is a heuristic with a local reoptimization step. Computational results showed the effectiveness ofthe procedure and the good quality of its solution also with respect to the MEC objective.

1.3 Population Haplotyping

Haplotype data are particularly sought after in the study of complex diseases (those affected by more than one gene) because they convey precious information about which set of gene alleles are inherited together. These types of polymorphism screenings are usually conducted on a population, that is, on sets of related individuals.

We have discussed at length the problem of haplotyping a single individual, which consists in determining her two haplotypes for a given chromosome. In a similar way, haplotyping a population consists in haplotyping each individual of the given population. Of course, one way to do this would be to solve a set of single individual haplotyping problems, one for each element of the population, but there might be better ways to proceed. In particular, with the larger availability in SNP data, recent years have seen the birth of a set of new computational problems related to population haplotyping. Most of these problems are motivated by the fact that while it is economically inconvenient to determine the haplotypes by sequencing and assembling, there is a cheap experiment that can determine a less informative and often ambiguous type of data, that is, the genotypes, from which the haplotypes can then be retrieved computationally.

A genotype of an individual contains the information about the two (possibly identical) alleles at each SNP, but without specifying their paternal or maternal origin. There may be many possible pairs of haplotypes that are consistent with a given genotype. For example, assume we only know that an individual is heterozygous for the alleles c01-math-0082 at SNP 1 and for the alleles c01-math-0083 at SNP 2. Then, either one of these alternatives may be true:

One parent has contributed the alleles C and A and the other the alleles T and G.

One parent has contributed the alleles C and G and the other the alleles T and A.

Both possibilities are plausible. Associating the alleles with the parents is called phasing the alleles. For c01-math-0084 heterozygous SNPs, there are c01-math-0085 possible phasings, which makes choosing the correct one a difficult problem. Once the alleles have been phased, the two haplotypes are inferred as phasing and haplotyping are in fact the same problem. The two haplotypes that are obtained by phasing the alleles are said to resolve, or explain, the genotype.

The most general population haplotyping problem can be stated as follows:

Given a set c01-math-0086 of genotypes, corresponding to an existing, unknown, set c01-math-0087 of haplotypes, retrieve c01-math-0088 .

In other words, the goal is to compute a set c01-math-0089 of haplotypes that contains, for each genotype c01-math-0090 , the two haplotypes c01-math-0091 and c01-math-0092 obtained by the correct phasing of c01-math-0093 . As could be expected, on the basis of only the knowledge of c01-math-0094 it, is not easy to describe constraints that define precisely which of the exponentially many phasings of a genotype is the correct one. Biologists have therefore described several sensible criteria for good phasings. For instance, under a widely accepted parsimony principle (inspired by the Occam's razor principle), a good solution may be one that minimizes the number of distinct haplotypes inferred.

Once it has been mathematically modeled, haplotyping gives rise to several nice and challenging combinatorial problems (for surveys on population haplotyping problems, see, e.g., References [11, 38, 39]). These problems have been extensively studied in the last few years. Some of them have been proven NP-hard and solved by exponential-time algorithms, while for others polynomial-time algorithms have been designed.

In this chapter, we address some of the most interesting combinatorial haplotyping models proposed in the literature. Each model and objective function has specific biological motivations, which are discussed in the following sections. While our focus will be on the combinatorial approach to haplotyping problems, it should be remarked that there is also a very important statistical approach to population haplotyping problems, which does not fall within the scope of this survey. The statistical approach has led to widely used software tools for haplotyping, such as the program PHASE [60].

Given a set of c01-math-0095 SNPs, fix arbitrarily a binary encoding of the two alleles for each SNP (i.e., call one of the two alleles 0 and the other 1). Once the encoding has been fixed, each haplotype becomes a binary vector of length c01-math-0096 .

A haplotype c01-math-0097 is denoted by c01-math-0098 , the value of its c01-math-0099 th component, with c01-math-0100 . Given two haplotypes c01-math-0101 and c01-math-0102 , we define a special sum whose result is a vector c01-math-0103 . The vector c01-math-0104 has length c01-math-0105 , and its components take values in c01-math-0106 , according to the following rule:

equation

We call a vector c01-math-0108 with entries in c01-math-0109 a genotype. Each position c01-math-0110 such that c01-math-0111 is called an ambiguous position (or ambiguous site). We denote by c01-math-0112 the set of ambiguous positions of c01-math-0113 . Biologically, genotype entries of value 0 or 1 correspond to homozygous SNP sites, while entries of value 2 correspond to heterozygous sites. In Figure 1.4, we illustrate a case of three individuals, showing their haplotypes and genotypes.

c01f004

Figure 1.4 Haplotypes and corresponding genotypes.

A resolution of a genotype c01-math-0114 is given by a pair of haplotypes c01-math-0115 such that c01-math-0116 . The haplotypes c01-math-0117 and c01-math-0118 are said to resolve c01-math-0119 . A genotype is ambiguous if it has more than one possible resolution, that is, if it has at least two ambiguous positions. A haplotype c01-math-0120 is said to be compatible with a genotype c01-math-0121 if c01-math-0122 can be used in a resolution of c01-math-0123 . It can immediately be seen that c01-math-0124 is compatible with c01-math-0125 if and only if c01-math-0126 at each position where c01-math-0127 . Two genotypes c01-math-0128 and c01-math-0129 are compatible if there exists at least one haplotype compatible with both of them, otherwise, they are incompatible. It can immediately be seen that c01-math-0130 and c01-math-0131 are compatible if and only if c01-math-0132 at each position c01-math-0133 where they are both nonambiguous.

As previously discussed, the experiment yielding each genotype is such that, at each SNP, it is known whether an individual is homozygous for allele 0 (in which case, we may set c01-math-0134 ), homozygous for allele 1 (in which case, we may set c01-math-0135 ), or heterozygous (in which case, we may set c01-math-0136 ). Therefore, for the haplotyping problems described in this section, the input data consist in a set c01-math-0137 of c01-math-0138 genotypes c01-math-0139 , corresponding to c01-math-0140 individuals in a population. The output is a set c01-math-0141 of haplotypes such that, for each c01-math-0142 , there is at least one pair of haplotypes c01-math-0143 with c01-math-0144 . Such a set c01-math-0145 of haplotypes is said to explain c01-math-0146 . In addition to explaining c01-math-0147 , the set c01-math-0148 is also required to satisfy some particular constraints. These constraints are different for different specific types of haplotyping problems. For each problem described in this survey, the particular constraints are given in the corresponding section.

1.3.1 Clark's Rule

The geneticist Clark proposed in Reference [21] a rule (today known as Clark's rule) to derive new haplotypes by inference from known ones as follows:

Clark's Inference Rule: Given a genotype c01-math-0149 and a compatible haplotype c01-math-0150 , obtain a new haplotype c01-math-0151 by setting c01-math-0152 at all positions c01-math-0153 and c01-math-0154 at the remaining positions.

Note that c01-math-0155 and c01-math-0156 resolve c01-math-0157 . In order to resolve all genotypes of c01-math-0158 , Clark suggested the use of successive applications of his inference rule. The procedure requires a bootstrap set of haplotypes used to derive new haplotypes at the very beginning. These starting haplotypes are obtained by resolving, in the unique possible way, the unambiguous genotypes in c01-math-0159 (of which it is assumed there is always at least one).

The following is the procedure that Clark proposed for haplotyping a population. He supported the validity of the approach by arguments from theoretical population genetics:

Clark's Algorithm: Let c01-math-0160 be the set of nonambiguous genotypes and let c01-math-0161 be the set of haplotypes obtained from c01-math-0162 . Reset c01-math-0163 . Then, repeat the following. If they exist, take a c01-math-0164 and a compatible c01-math-0165 and apply the inference rule, obtaining c01-math-0166 . Set c01-math-0167 , c01-math-0168 , and iterate. When no such c01-math-0169 and c01-math-0170 exist, the algorithm succeeds if c01-math-0171 and fails otherwise.

Note that the procedure is nondeterministic as it does not specify how to choose c01-math-0172 and c01-math-0173 whenever there are more candidates to the application of the rule. For example, suppose c01-math-0174 . The algorithm starts by setting c01-math-0175 and c01-math-0176 . The inference rule can be used to resolve c01-math-0177 from c01-math-0178 , obtaining c01-math-0179 , which can, in turn, be used to resolve c01-math-0180 , obtaining c01-math-0181 . However, one could have started by using c01-math-0182 to resolve c01-math-0183 obtaining c01-math-0184 . At that point, there would be no way to resolve c01-math-0185 . The nondeterminism in the choice of the genotype c01-math-0186 and haplotype c01-math-0187 to which the inference rule is applied can be settled by fixing a deterministic rule based on the initial sorting of the data. In Reference [21], a large number of random sorting are used to run the algorithm and the best solution overall is reported. Tests on real and simulated data sets showed that although most of the time the algorithm could resolve all genotypes, often the algorithm failed.

The problem of finding an ordering of application of Clark's inference rule that leaves the fewest number of unresolved genotypes in the end was first defined and studied by Gusfield [33], who proved it is NP-hard and APX-hard (i.e., there is a value c01-math-0188 for which it is NP-hard even to give a c01-math-0189 -approximation algorithm). As for practical algorithms, Gusfield [34, 35] proposed an integer programming approach for a graph-theoretic reformulation of the problem. The problem is first transformed, by an exponential-time reduction, into a problem on a digraph c01-math-0190 defined as follows. Let c01-math-0191 , where

c01-math-0192

. Let c01-math-0193 be the subset of haplotypes determined from the set c01-math-0194 of unambiguous genotypes. For each c01-math-0195 and c01-math-0196 in c01-math-0197 , there is an arc c01-math-0198 if c01-math-0199 is ambiguous, c01-math-0200 , and c01-math-0201 (i.e., c01-math-0202 can be inferred from c01-math-0203 via c01-math-0204 ). Then, any directed tree rooted ata node c01-math-0205 specifies a feasible history of successive applications of the inference rule starting at node c01-math-0206 . The problem can then be stated as follows: find the largest number of nodes that can be reached by a set of node-disjoint-directed trees, where each tree is rooted at a node in c01-math-0207 and for every ambiguous genotype c01-math-0208 , at most one node in c01-math-0209 is reached.

The above graph problem was shown to be NP-hard [34]. For its solution, Gusfield proposed an ILP formulation and noticed that the solution of the LP relaxation was very often integer for the real-life instances tested. The model was applied to real data as well as random instances, with up to 80 genotypes, of which 60 were ambiguous, over 15 SNPs.

1.3.2 Pure Parsimony

Clark's algorithm tries to reuse existing haplotypes as much as possible and it introduces new haplotypes only when strictly needed. As a result, the procedure usually tends to produce solutions of small size. Note that the maximum size for a set c01-math-0210 resolving c01-math-0211 is c01-math-0212 , while the smallest possible size is c01-math-0213 .

Pure Parsimony Haplotyping (PPH) has the explicit objective of minimizing the size of c01-math-0214 . This objective has several biological motivations. For one, the number of distinct haplotypes observed in nature is vastly smaller than the number of possible haplotypes. Furthermore, as we all descend from a small number of ancestors, their haplotypes should be the same we possess today (if it were not for some recombination events and mutations). Finally, pure parsimony follows a general principle according to which, of many possible explanations of an observed phenomenon, one should always favor the simplest.

The PPH problem is NP-hard, as first shown by Hubbel [43]. Lancia et al. [48] showed that, in fact, the problem is APX-hard. This result holds also if each genotype is restricted to possess at most three ambiguous sites. Note that although the problem is APX-hard, there exist constant-ratio approximation algorithms under the restriction that each genotype has at most c01-math-0215 ambiguous sites for each constant c01-math-0216 [48].

Several algorithmic approaches, many of them employing mathematical programming techniques, have been used in PPH (for a survey of models and approaches for PPH, see Reference [17]). In particular, we recall the following works.

1.3.2.1 Integer Programming Formulations of Exponential Size

Let us denote by c01-math-0217 the set of all haplotypes compatible with at least onegenotype of c01-math-0218 and let c01-math-0219 . The first ILP formulation for PPH, called TIP, was provided by Gusfield [37]. TIP has c01-math-0220 variables and c01-math-0221 constraints. There is a binary variable c01-math-0222 associated with every c01-math-0223 ( c01-math-0224 means that c01-math-0225 is included in the solution, whereas c01-math-0226 means that c01-math-0227 is not included). After fixing a total ordering on c01-math-0228 , denote by c01-math-0229 the set of those ordered pairs c01-math-0230 with c01-math-0231 , c01-math-0232 . For every c01-math-0233 , let c01-math-0234 . For every c01-math-0235 and c01-math-0236 , there is a binary variable c01-math-0237 used to select an ordered pair c01-math-0238 to resolve the genotype c01-math-0239 . The following is then a valid ILP formulation of the PPH problem [37, 48]:

1.1 equation

subject to

1.2 equation

1.3 equation

1.4 equation

1.5

equation

Constraints (1.2) impose that each genotype is resolved by at least one pair of haplotypes. Constraints (1.3) and (1.4) make sure that c01-math-0245 can be used to resolve a genotype only if both c01-math-0246 and c01-math-0247 are included in the solution.

Gusfield managed to employ some preprocessing rules to get rid of variables that can be proved to be nonessential in the formulation. The resulting model is called RTIP (Reduced TIP). Although practically useful, the preprocessing still leaves an exponential model whose size grows quite quickly with respect to the instance size. The experimental results of Reference [37] showed that RTIP can be used to tackle problems with up to 50 genotypes over 30 SNPs, but there must be relatively small levels of heterozygosity in the input genotypes.

Lancia and Rizzi [50] showed that when each genotype has at most two ambiguous sites, the above ILP is totally unimodular. As a consequence, the ILP solution is always naturally integer and hence the problem can be solved in polynomial time.

Lancia and Serafini [51] proposed a new exponential ILP model for PPH. The model exploits an interpretation of PPH as a particular type of Set Covering (SC) based on the following observation: if c01-math-0248 is a set of haplotypes resolving c01-math-0249 , then for each genotype c01-math-0250 , ambiguous position c01-math-0251 , and value c01-math-0252 , there is a haplotype c01-math-0253 such that c01-math-0254 .

This condition is a covering condition because it represents the covering constraint for a set cover problem in which the universe is the set of all triples c01-math-0255 , for c01-math-0256 , c01-math-0257 , and c01-math-0258 , and each haplotype c01-math-0259 represents a subset of the universe, namely, c01-math-0260 . The condition is only necessary, but not sufficient, for c01-math-0261 to be a feasible solution of PPH. Consider, for example, the following diagonal instance c01-math-0262 . The set c01-math-0263 satisfies the covering condition but does not resolve c01-math-0264 .

The SC associated with PPH seeks to minimize the number of haplotypes needed to satisfy all covering conditions for the given set of genotypes c01-math-0265 . As we have seen, this SC is in fact a relaxation of PPH. The formulation of the SC model has an exponential number of variables and constraints, and can be solved, as described in Reference [51], by branch-and-cut-and-price, that is, via the generation of variables and constraints at run-time. It is possible that the optimal solution of SC resolves c01-math-0266 , in which case it is an optimal PPH solution as well. If not, one can try to obtain a good feasible PPH solution from the optimal cover by adding only a small number of haplotypes. This idea is exploited by an effective heuristic presented in Reference [51]. The computational results show that the SC approach can be orders of magnitude faster than RTIP and can be applied to instances with c01-math-0267 and c01-math-0268 up to c01-math-0269 . These are among the largest-size instances for which provably optimal solutions have been obtained in the literature.

1.3.2.2 Integer Programming Formulations of Polynomial Size and Hybrid Formulations

Many authors Brown and Harrower [13], Bertolazzi et al. [8], Lancia et al. [48], Halldorsson et al. [40] have independently proposed polynomial-size integer programming formulations. More or less, all these formulations employ variables to represent the bits of the haplotypes in the solution (so there are, at most, c01-math-0270 such variables). The basic idea is that for each genotype c01-math-0271 one must determine two haplotypes c01-math-0272 and c01-math-0273 such that c01-math-0274 . This implies that, for each position c01-math-0275 such that c01-math-0276 , one needs to decide a variable c01-math-0277 , and then set c01-math-0278 and c01-math-0279 . The polynomial formulations for PPH consist in expressing the PPH objective function and constraints in terms of the c01-math-0280 variables and possibly a (polynomial-size) set of additional variables.

The LP relaxation of these formulations is generally quite weak. The use of some valid cuts [8, 13] improves the quality of the bound, but the integrality gap between the integer optimum and the LP relaxation remains large. Brown and Harrower [14] also proposed a hybrid model in which variables for a fixed subset of haplotypes are explicitly present, while the rest of the haplotypes are implicitly represented by polynomially many variables and constraints. These polynomial/hybrid formulations were successfully used for the solution of problems of similar size as those solvable by using TIP. Furthermore, some tests were conducted on larger problems (30 genotypes over up to 75 SNPs), on which the exponential formulation could not be applied successfully owing to the IP size.

The last polynomial model for PPH was proposed by Catanzaro et al. [16] and is based on the class representatives with smallest index (a technique adopted by Campelo et al. for the vertex color problem [15]). The authors observed that a feasible solution to PPH can be represented by means of a bipartite graph whose two shores correspond to haplotypes and genotypes. Each vertex c01-math-0281 has degree 2 and there are two vertices, say c01-math-0282 and c01-math-0283 adjacent to c01-math-0284 such that c01-math-0285 . The bipartite graph representation of a solution suggests that, in a feasible solution to PPH, the haplotypes induce a family of subsets of genotypes satisfying the following three properties: (i) each subset of genotypes shares one haplotype, (ii) each genotype belongs to exactly two subsets, and (iii) every pair of subsets intersects in at most one genotype. Catanzaro et al. [16] exploited the bipartite graph representation of a solution to PPH to provide a polynomial-size ILP model that turns out to be quite effective and to outperform other polynomial models for the PPH problem.

1.3.2.3 Quadratic, Semidefinite Programming Approaches

A quadratic formulation solved by semidefinite programming was proposed byKalpakis and Namjoshi [45]. Similarly to the RTIP model, the formulation has a variable for each possible haplotype and hence it cannot be used to tackle instances for which c01-math-0286 is too large. According to the computational results, the size of the problems solved is comparable to that of RTIP and of the best polynomial ILP models. On the basis of a similar formulation, an (exponential-time) approximation algorithm was presented in Reference [42].

1.3.2.4 Combinatorial branch-and-bound Approaches

Wang and Xu [65] proposed a simple combinatorial branch-and-bound approach. The solution is built by enumerating all possible resolutions for each of the genotypes in turn. The lower bound is the number of haplotypes used so far. Since the search space is exponential and the bound is weak, the method is not able to solve instances of size comparable to the other approaches. Even the solution for 20 genotypes over 20 SNPs can sometimes take an extremely long time to be found.

1.3.2.5 Boolean Optimization

One of the approaches applied to PPH is Pseudo Boolean Optimization (PBO). PBO is a technique by which a problem can be modeled via integer linear constraints over a set of boolean variables. The goal is to find an assignment of the variables that satisfies all constraints and minimizes a given objective function. The model is solved by a SAT solver (much in the same way as an ILP model is solved by an ILP solver), which employs solution-searching techniques specific for boolean models.

The PBO models for PPH are mostly based on the following feasibility question: given a tentative cardinality c01-math-0287 , does there exist a set c01-math-0288 of c01-math-0289 haplotypes that resolves c01-math-0290 ? The feasibility question is expressed in terms of boolean variables similar to the binary variables of the polynomial ILP models. Starting at a lower bound, c01-math-0291 is increased until the feasibility question has a positive answer. At that point, c01-math-0292 represents the optimal solution.

The first PBO algorithm for PPH was presented by Lynce and Marques-Silva [54]. Later works aimed at breaking the symmetries in the original PBO model, and computing tight lower and upper bounds to be used for pruning the search space Graca et al. [31], Marques-Silva et al. [55]. The SAT approaches showed to be competitive with the best mathematical programming approaches for PPH.

1.3.2.6 Heuristics

In general, all the exact models mentioned so far run into troubles when trying to solve large instances (where the most critical parameter is the number of ambiguous positions per genotype). The exponential models imply the creation of too many variables and/or constraints for obtaining a solution within a reasonable time. The polynomial and combinatorial models, on the other hand, employ quite weak lower bounds so that closing the gap and terminating the branch-and-bound search is again impossible within a reasonable time. In order to solve large instances of PPH, one needs then to resort to the use of effective heuristics that can find near-optimal solutions to instances with hundreds of genotypes over hundreds of SNPs with high levels of heterozygosity.

One such heuristic procedure is CollHaps, proposed by Tininini et al. [61]. CollHaps is based on the representation of the solution as a matrix c01-math-0293 of c01-math-0294 rows (the solving haplotypes, also called symbolic haplotypes because they may contain variables), in which some entries are fixed, while others have to be determined. The entries to be determined correspond to heterozygous positions in the input genotypes. To each such entry, CollHaps associates a 0–1 variable, which is then set to a fixed value in the course of the procedure. At each step, CollHaps tries to maintain as few distinct symbolic haplotypes as possible, which is done by trying to make identical some rows of c01-math-0295 via a greedy setting of the variables. The idea is to eventually end up with as few distinct actual haplotypes as possible. Experimental results have shown that CollHaps is a very effective and accurate heuristic for PPH, and ranks among the best available procedures for this problem.

In another heuristic approach for PPH, Di Gaspero and Roli [27] proposed the use of stochastic local search, which they considered to yield a reasonable compromise between the quality of the solutions and running time of the procedure. In their work, Di Gaspero and Roli utilized a family of local search strategies, such as Best Improvement (BI), Stochastic First Improvement (SFI), Simulated Annealing (SA), and Tabu Search (TS).

1.3.3 Perfect Phylogeny

One limitation of the PPH model is that it does not take into account the fact that haplotypes may evolve over time. Haplotypes evolve by mutation and recombination, and an individual can possess two haplotypes such that neither one is possessed by one of her parents. The haplotype regions in which no recombination and/or mutations have ever happened in a population over time are called blocks. For reconstructing haplotype blocks, starting from genotype data of a block in a population, the parsimony model is the most appropriate [37]. However, when genotypes span several blocks, differentmodels of haplotyping have been considered. One of these is haplotyping for perfect phylogeny.

The perfect phylogeny model is used under the hypothesis that no recombination events happened, but there were only mutations. It is assumed that at the beginning there existed only one ancestral haplotype, and new haplotypes were derived over time from existing haplotypes as follows. If at some point there existed a haplotype c01-math-0296 in the population and then a mutation of c01-math-0297 happened, which was then passed on to the new generation, a new haplotype c01-math-0298 started to exist, with c01-math-0299 for c01-math-0300 , and c01-math-0301 . We say that c01-math-0302 is the father of c01-math-0303 in the tree of haplotype evolution. In the infinite-sites coalescent model

Enjoying the preview?

Page 1 of 1

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches

About this ebook

Mourad Elloumi

Related authors

Related to Pattern Recognition in Computational Molecular Biology

Titles in the series (16)

Machine Learning in Bioinformatics

Grid Computing for Bioinformatics and Computational Biology

Analysis of Biological Networks

Elements of Computational Systems Biology

Bioinformatics Algorithms: Techniques and Applications

Biomolecular Networks: Methods and Applications in Systems Biology

Mathematics of Bioinformatics: Theory, Methods and Applications

Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data

Computational Intelligence and Pattern Analysis in Biology Informatics