Statistical Bioinformatics: For Biomedical and Life Science Researchers
By Jae K. Lee
()
About this ebook
- Clearly explains the use of bioinformatics tools in life sciences research without requiring an advanced background in math/statistics
- Enables biomedical and life sciences researchers to successfully evaluate the validity of their results and make inferences
- Enables statistical and quantitative researchers to rapidly learn novel statistical concepts and techniques appropriate for large biological data analysis
- Carefully revisits frequently used statistical approaches and highlights their limitations in large biological data analysis
- Offers programming examples and datasets
- Includes chapter problem sets, a glossary, a list of statistical notations, and appendices with references to background mathematical and technical material
- Features supplementary materials, including datasets, links, and a statistical package available online
Statistical Bioinformatics is an ideal textbook for students in medicine, life sciences, and bioengineering, aimed at researchers who utilize computational tools for the analysis of genomic, proteomic, and many other emerging high-throughput molecular data. It may also serve as a rapid introduction to the bioinformatics science for statistical and computational students and audiences who have not experienced such analysis tasks before.
Related to Statistical Bioinformatics
Related ebooks
Computational Intelligence and Pattern Analysis in Biology Informatics Rating: 0 out of 5 stars0 ratingsKnowledge-Based Bioinformatics: From Analysis to Interpretation Rating: 0 out of 5 stars0 ratingsMathematics of Bioinformatics: Theory, Methods and Applications Rating: 0 out of 5 stars0 ratingsComputational Approaches in Cheminformatics and Bioinformatics Rating: 0 out of 5 stars0 ratingsIntroducing Proteomics: From Concepts to Sample Separation, Mass Spectrometry and Data Analysis Rating: 0 out of 5 stars0 ratingsAnalysis of Biological Networks Rating: 0 out of 5 stars0 ratingsBayesian Biostatistics Rating: 0 out of 5 stars0 ratingsBiostatistics Decoded Rating: 0 out of 5 stars0 ratingsBiophysical Methods for Biotherapeutics: Discovery and Development Applications Rating: 0 out of 5 stars0 ratingsBiostatistics: A Methodology For the Health Sciences Rating: 0 out of 5 stars0 ratingsIntroduction to Biostatistics with JMP (Hardcover edition) Rating: 1 out of 5 stars1/5Computational Statistics Rating: 5 out of 5 stars5/5Biomolecular Networks: Methods and Applications in Systems Biology Rating: 0 out of 5 stars0 ratingsConcise Encyclopaedia of Bioinformatics and Computational Biology Rating: 0 out of 5 stars0 ratingsOpportunities in Biological Science Careers Rating: 0 out of 5 stars0 ratingsPattern Recognition in Computational Molecular Biology: Techniques and Approaches Rating: 0 out of 5 stars0 ratingsBioinformatics: Algorithms, Coding, Data Science And Biostatistics Rating: 0 out of 5 stars0 ratingsIntegration of Omics Approaches and Systems Biology for Clinical Applications Rating: 0 out of 5 stars0 ratingsData Analysis and Visualization in Genomics and Proteomics Rating: 0 out of 5 stars0 ratingsThe Handbook of Plant Genome Mapping: Genetic and Physical Mapping Rating: 0 out of 5 stars0 ratingsComputational Methods for Next Generation Sequencing Data Analysis Rating: 0 out of 5 stars0 ratingsBioinformatics workflow management system Third Edition Rating: 0 out of 5 stars0 ratingsGrid Computing for Bioinformatics and Computational Biology Rating: 1 out of 5 stars1/5Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis Rating: 0 out of 5 stars0 ratingsChemogenomics in Drug Discovery: A Medicinal Chemistry Perspective Rating: 0 out of 5 stars0 ratingsBioinformatics for Geneticists: A Bioinformatics Primer for the Analysis of Genetic Data Rating: 4 out of 5 stars4/5Introduction to Bioinformatics, Sequence and Genome Analysis Rating: 0 out of 5 stars0 ratingsMolecular Data Analysis Using R Rating: 0 out of 5 stars0 ratingsBiomolecular and Bioanalytical Techniques: Theory, Methodology and Applications Rating: 0 out of 5 stars0 ratingsIntroduction to Proteomics: Principles and Applications Rating: 0 out of 5 stars0 ratings
Medical For You
The Little Book of Hygge: Danish Secrets to Happy Living Rating: 4 out of 5 stars4/5What Happened to You?: Conversations on Trauma, Resilience, and Healing Rating: 4 out of 5 stars4/5Mating in Captivity: Unlocking Erotic Intelligence Rating: 4 out of 5 stars4/5The Lost Book of Simple Herbal Remedies: Discover over 100 herbal Medicine for all kinds of Ailment, Inspired By Dr. Barbara O'Neill Rating: 0 out of 5 stars0 ratingsBrain on Fire: My Month of Madness Rating: 4 out of 5 stars4/5Gut: The Inside Story of Our Body's Most Underrated Organ (Revised Edition) Rating: 4 out of 5 stars4/5Mediterranean Diet Meal Prep Cookbook: Easy And Healthy Recipes You Can Meal Prep For The Week Rating: 5 out of 5 stars5/5The Vagina Bible: The Vulva and the Vagina: Separating the Myth from the Medicine Rating: 5 out of 5 stars5/5"Cause Unknown": The Epidemic of Sudden Deaths in 2021 & 2022 Rating: 5 out of 5 stars5/5The Emperor of All Maladies: A Biography of Cancer Rating: 5 out of 5 stars5/5Extra Focus: The Quick Start Guide to Adult ADHD Rating: 5 out of 5 stars5/5Women With Attention Deficit Disorder: Embrace Your Differences and Transform Your Life Rating: 5 out of 5 stars5/5Blitzed: Drugs in the Third Reich Rating: 4 out of 5 stars4/5ATOMIC HABITS:: How to Disagree With Your Brain so You Can Break Bad Habits and End Negative Thinking Rating: 5 out of 5 stars5/5This Is How Your Marriage Ends: A Hopeful Approach to Saving Relationships Rating: 4 out of 5 stars4/5Adult ADHD: How to Succeed as a Hunter in a Farmer's World Rating: 4 out of 5 stars4/5Herbal Healing for Women Rating: 4 out of 5 stars4/5Hidden Lives: True Stories from People Who Live with Mental Illness Rating: 4 out of 5 stars4/5The Song of the Cell: An Exploration of Medicine and the New Human Rating: 4 out of 5 stars4/5The White Coat Investor: A Doctor's Guide to Personal Finance and Investing Rating: 4 out of 5 stars4/5The Diabetes Code: Prevent and Reverse Type 2 Diabetes Naturally Rating: 5 out of 5 stars5/5Ghost Boy: The Miraculous Escape of a Misdiagnosed Boy Trapped Inside His Own Body Rating: 4 out of 5 stars4/5Woman: An Intimate Geography Rating: 4 out of 5 stars4/5The 40 Day Dopamine Fast Rating: 4 out of 5 stars4/5
Reviews for Statistical Bioinformatics
0 ratings0 reviews
Book preview
Statistical Bioinformatics - Jae K. Lee
CHAPTER 1
ROAD TO STATISTICAL BIOINFORMATICS
Jae K. Lee
Department of Public Health Science, University of Virginia,
Charlottesville, Virginia, USA
There has been a great explosion of biological data and information in recent years, largely due to the advances of various high-throughput biotechnologies such as mass spectrometry, high throughput sequencing, and many genome-wide SNP profiling, RNA gene expression microarray, protein mass spectrometry, and many other recent high-throughput biotechniques (Weinstein et al., 2002). Furthermore, powerful computing systems and fast Internet connections to large worldwide biological databases enable individual laboratory researchers to easily access an unprecedentedly huge amount of biological data. Such enormous data are often too overwhelming to understand and extract the most relevant information to each researcher’s investigation goals. In fact, these large biological data are information rich and often contain much more information than the researchers who have generated such data may have anticipated. This is why many major biomedical research institutes have made significant efforts to freely share such data with general public researchers. Bioinformatics is the emerging science field concerned with the development of various analysis methods and tools for investigating such large biological data efficiently and rigorously. This kind of development requires many different components: powerful computer systems to archive and process such data, effective database designs to extract and integrate information from various heterogeneous biological databases, and efficient analysis techniques to investigate and analyze these large databases. In particular, analysis of these massive biological data is extremely challenging for the following reasons.
CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE
Analysis techniques on high-throughput biological data are required to carefully handle and investigate an astronomical number of candidate targets and possible mechanisms, most of which are false positives, from such massive data (Tusher et al., 2001). For example, a traditional statistical testing criterion which allows 5% false-positive error (or significance level) would identify ~500 false positives from 10K microarray data between two biological conditions of interest even though no real biologically differentially regulated genes exist between the two. If a small number of, for example, 100, genes that are actually differentially regulated exist, such real differential expression patterns will be mixed with the above 500 false positives without any a priori information to discriminate the true positives from the false positives. Then, confidence on the 600 targets that were identified by such a statistical testing may not be high. Simply tightening such a statistical criterion will result in a high false-negative error rate, without being able to identify many important real biological targets. This kind of pitfall, the so-called multiple-comparisons issue, becomes even more serious when biological mechanisms such as certain signal transduction and regulation pathways that involve multiple targets are searched from such biological data; the number of candidate pathway mechanisms to be searched grows exponentially, for example, 10! for 10-gene sequential pathway mechanisms. Thus, no matter how powerful a computer system can handle a given computational task, it is prohibitive to tackle such problems by exhaustive computational search and comparison for these kinds of problems. Many current biological problems have been theoretically proven to be NP (nonpolynomial) hard in computer science, implying that no finite (polynomial) computational algorithm can search all possible solutions as the number of biological targets involved in such a solution becomes too large. More importantly, this kind of exhaustive search is simply prone to the risk of discovering numerous false positives. In fact, this is one of the most difficult challenges in investigating current large biological databases and is why only heuristic algorithms that tightly control such a high false positive error rate and investigate a very small portion of all possible solutions are often sought for many biological problems. Thus, the success of many bioinformatics studies critically depends on the construction and use of effective and efficient heuristic algorithms, most of which are based on probabilistic modeling and statistical inference techniques that can maximize the statistical power of identifying true positives while rigorously controlling their false positive error rates.
CHALLENGE 2: HIGH-DIMENSIONAL BIOLOGICAL DATA
The second challenge is the high-dimensional nature of biological data in many bioinformatics studies. When biological data are simultaneously generated with many gene targets, their data points become dramatically sparse in the corresponding high-dimensional data space. It is well known that mathematical and computational approaches often fail to capture such high-dimensional phenomena accurately (Tamayo et al., 1999). For example, many statistical algorithms cannot easily move between local maxima in a high-dimensional space. Also, inference by combining several disjoint lower dimensional phenomena may not provide the correct understanding on the real phenomena in their joint, high-dimensional space. It is therefore important to understand statistical dimension reduction techniques that can reduce high-dimensional data problems into lower dimensional ones while the important variation of interest in biological data is preserved.
CHALLENGE 3: SMALL-n AND LARGE-p PROBLEM
The third challenge is the so-called "small-n and large-p" problem. Desired performance of conventional statistical methods is achieved when the sample size, namely n, of the data, the number of independent observations of event, is much larger than the number of parameters, say p, which need to be inferred by statistical inference (Jain et al., 2003). In many bioinformatics problems, this situation is often completely reversed. For example, in a microarray study, tens of thousands of gene transcripts’ expression patterns may become candidate prediction factors for a biological phenomenon of interest (e.g., tumor sensitivity vs. resistance to a chemotherapeutic compound) but the number of independent observations (e.g., different patient biopsy samples) is often at most a few tens or smaller. Due to the experimental costs and limited biological materials, the number of independent replicated samples can be sometimes extremely small, for example, two or three, or unavailable. In these cases, most traditional statistical approaches often perform very poorly. Thus, it is also important to select statistical analysis tools that can provide both high specificity and high sensitivity under these circumstances.
CHALLENGE 4: NOISY HIGH-THROUGHPUT BIOLOGICAL DATA
The fourth challenge is due to the fact that high-throughput biotechnical data and large biological databases are inevitably noisy because biological information and signals of interest are often observed with many other random or biased factors that may obscure main signals and information of interest (Cho and Lee, 2004). Therefore, investigations on large biological data cannot be successfully performed unless rigorous statistical algorithms are developed and effectively utilized to reduce and decompose various sources of error. Also, careful assessment and quality control of initial data sets is critical for all subsequent bioinformatics analyses.
CHALLENGE 5: INTEGRATION OF MULTIPLE, HETEROGENEOUS BIOLOGICAL DATA INFORMATION
The last challenge is the integration of information often from multiple heterogeneous biological and clinical data sets, such as large gene functional and annotation databases, biological subjects’ phenotypes, and patient clinical information. One of the main goals in performing high-throughput biological experiments is to identify the most important critical biological targets and mechanisms highly associated with biological subjects’ phenotypes, such as patients’ prognosis and therapeutic response (Pittman et al., 2004). In these cases, multiple large heterogeneous datasets need to be combined in order to discover the most relevant molecular targets. This requires combining multiple datasets with very different data characteristics and formats, some of which cannot easily be integrated by standard statistical inference techniques, for example, the information from genomic and proteomic expression data and reported pathway mechanisms in the literature. It will be extremely important to develop and use efficient yet rigorous analysis tools for integrative inference on such complex biological data information beyond the individual researcher’s manual and subjective integration.
In this book, we introduce the statistical concepts and techniques that can overcome these challenges in studying various large biological datasets. Researchers with biological or biomedical backgrounds may not be able, or may not need, to learn advanced mathematical and statistical techniques beyond the intuitive understanding of such topics for their practical applications. Thus, we have organized this book for life science researchers to efficiently learn the most relevant statistical concepts and techniques for their specific biological problems. We believe that this composition of the book will help nonstatistical researchers to minimize unnecessary efforts in learning statistical topics that are less relevant to their specific biological questions, yet help them learn and utilize rigorous statistical methods directly relevant to those problems. Thus, while this book can serve as a general reference for various concepts and methods in statistical bioinformatics, it is also designed to be effectively used as a textbook for a semester or shorter length course as below. In particular, the chapters are divided into four blocks of different statistical issues in analyzing large biological datasets (Fig. 1.1):
I. Statistical Foundation Probability theories (Chapter 2), statistical quality control (Chapter 3), statistical tests (Chapter 4)
II. High-Dimensional Analysis Clustering analysis (Chapter 5), classification analysis (Chapter 6), multidimensional visualization (Chapter 7)
III Advanced Analysis Topics Statistical modeling (Chapter 8), experimental design (Chapter 9), statistical resampling methods (Chapter 10)
IV Multigene Analysis in Systems Biology Genetic network analysis (Chapter 11), genetic association analysis (Chapter 12), R Bioconductor tools in systems biology (Chapter 13)
Figure 1.1 Possible course structure.
c01_image001.jpgThe first block of chapters will be important, especially for students who do not have a strong statistical background. These chapters will provide general backgrounds and terminologies to initiate rigorous statistical analysis on large biological datasets and to understand more advanced analysis topics later. Students with a good statistical understanding may also quickly review these chapters since there are certain key concepts and techniques (especially in Chapters 3 and 4) that are relatively new and specialized for analyzing large biological datasets.
The second block consists of analysis topics frequently used in investigating high-dimensional biological data. In particular, clustering and classification techniques, by far, are most commonly used in many practical applications of high-throughput data analysis. Various multidimensional visualization tools discussed in Chapter 7 will also be quite handy in such investigations.
The third block deals with more advanced topics in large biological data analysis, including advanced statistical modeling for complex biological problems, statistical resampling techniques that can be conveniently used with the combination of classification (Chapter 6) and statistical modeling (Chapter 8) techniques, and experimental design issues in high-throughput microarray studies.
The final block contains concise description of the analysis topics in several active research areas of multigene network and genetic association analysis as well as the R Bioconductor software in systems biology analysis. These will be quite useful for performing challenging gene network and multigene investigations in the fast-growing systems biology field.
These four blocks of chapters can be followed with the current order for a full semester-length course. However, except for the first block, the following three blocks are relatively independent of each other and can be covered (or skipped for specific needs and foci under a time constraint) in any order, as depicted in Figure 1.1. We hope that life science researchers who need to deal with challenging analysis issues in overwhelming large biological data in their specific investigations can effectively meet their learning goals in this way.
REFERENCES
Cho, H., and Lee, J. K. (2004). Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics, 20(13): 2016–2025.
Jain, N., et al. (2003). Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics, 19(15): 1945–1951.
Pittman, J., et al. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl. Acad. Sci. U.S.A., 101(22): 8431–8436.
Tamayo, P., et al. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A., 96(6): 2907–2912.
Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A., 98(9): 5116–5121.
Weinstein, J. N., et al. (2002). The bioinformatics of microarray gene expression profiling. Cytometry, 47(1): 46–49.
CHAPTER 2
PROBABILITY CONCEPTS AND DISTRIBUTIONS FOR ANALYZING LARGE BIOLOGICAL DATA
Sooyoung Cheon
KU Industry-Academy Cooperation Group Team of Economics and Statistics, Korea University, Jochiwon 339–700, Korea
2.1 INTRODUCTION
In general, the results may vary when measurements are taken. Because of this variability, many decisions have to be made that involve uncertainties. For example, in medical research, interest may center on the effectiveness of a new vaccine for mumps or the estimation of the cure rate of treatment for breast cancer. In these situations, probabilistic foundation can help people make decisions to reduce uncertainty.
Experiments may be generally considered for events where the outcomes cannot be predicted with certainty; that is, each experiment ends in an outcome that cannot be determined with certainty before the performance of the experiment. Such experiments are called random experiments. Thus we consider probability in the study of randomness and uncertainty.
A physician tells a patient with breast cancer that there is a 90% cure rate with treatment. A sportscaster gives a baseball team a 20% chance of winning the next World Series. These two situations represent two different approaches, the frequentist and subjective approaches, respectively, to probability. The frequentist approach assigns a probability a long-term relative frequency. The cure rate is based on the observed outcomes of a large number of treated cases. When an experiment is repeated a large number of times under identical conditions, a regular pattern may emerge. In that case, the relative frequencies with which different outcomes occur will settle down to fixed proportions. These fixed proportions are defined as the probabilities of the corresponding outcomes. The subjective approach is a personal assessment of the chance that a given outcome will occur. The situations in this case are one-time phenomena that are not repeatable. The accuracy of this probability depends on the information available about the particular situation and a person’s interpretation of that information. A probability can be assigned, but it may vary from person to person.
This chapter reviews basic probability concepts and common distributions used for analyzing large biological data, in particular bioinformatics. First, basic concepts and some laws of rules of probability are introduced. In many cases, we are often in a situation to update our knowledge based on new information. In that case we need a concept of conditional probability. General definitions of conditional probability and the Bayes theorem are introduced. The theorem is the basis of statistical methodology called Bayesian statistics. Then random variables are discussed, along with the expected value and variance that are used to summarize its behavior. Six discrete and five continuous probability distributions are described that occur most frequently in bioinformatics and computational biology. We conclude with the description of an empirical distribution and sampling strategy related to resampling and bootstrapping techniques.
2.2 BASIC CONCEPTS
We begin with the notion of a random experiment, which is a procedure or an operation whose outcome is uncertain, and consider some aspects of events themselves before considering the probability theory associated with events.
2.2.1 Sample Spaces and Events
The collection of all possible outcomes of a particular experiment is called a sample space. We will use the letter S to denote a sample space. An event is any collection of possible outcomes of an experiment, that is, any subset of S. For example, if we plan to roll a die twice, the experiment is actual rolling of the die two times, and none, one, or both of these two events will have occurred after the experiment is carried out. In the first five examples, we have discrete sample spaces; in the last two, we have continuous sample spaces.
Example 2.1 Sample Space and Events from Random Experiments and Real Data. The following are some examples of random experiments and the associated sample spaces and events:
1. Assume you toss a coin once. The sample space is S = {H, T}, where H = head and T = tail and the event of a head is {H}.
2. Assume you toss a coin twice. The sample space is S = {(H, H), (H, T), (T, H), (T, T)}, and the event of obtaining exactly one head is {(H, T), (T, H)}.
3. Assume you roll a single die. The sample space is S = {1,2, 3, 4, 5, 6}, and the event that the outcome is even is {2, 4, 6}.
4. Assume you roll two dice. The sample space is S = {(1, 1), (1, 2),..., (6, 6)}. The event that the sum of the numbers on the two dice equals 5 is {(1, 4), (2, 3), (3, 2), (4, 1)}.
5. Assume you count the number of defective welds in a car body. The sample space is S = {0, 1,..., N}, where N = total number of welds. The event that the number of defective welds is no more than two is {0, 1,2}.
6. Assume you measure the relative humidity of air. The sample space is S = [0, 100]. The event that the relative humidity is at least 90% is [90, 100].
7. Assume you observe the lifetime of a car battery. The sample space is S = [0, +1}. The event that the car battery fails before 12 months is [0, 12).
Event algebra is a mathematical language to express relationships among events and combinations of events. Below are the most common terms and ones that will be used in the remainder of this chapter.
Union The union of two events A and B, denoted by A U B, is the event consisting of all outcomes that belong to A or B or both:
c02_image001.jpgIntersection The intersection of events A and B, denoted by A > B or simply by AB, is the event consisting of all outcomes common to both A and B:
c02_image002.jpgComplementation The complement of an event A, denoted by Ac, is the event consisting of all outcomes not in A.
c02_image003.jpgDisjoint If A and B have no outcomes in common, that is, A > B = Ø, where Ø is an empty set, then A and B are called disjoint or mutually exclusive events.
Example 2.2 Event Algebra from Random Experiments
A = {Sum of two dice is a multiple of 3} = {3, 6, 9, 12}
B = {Sum of two dice is a multiple of 4} = {4, 8, 12}
C = {Sum of two dice is even} = {2, 4, 6, 8, 10, 12}
D = {Sum of two dice is odd} = {3, 5, 7, 9, 11}
1. The union of A and B is A c02_image004.jpg B = {3, 4, 6, 8, 9, 12}.
2. The intersection of A and B is A ∩ B = {12}.
3. The complementation of A is Ac = {2, 4, 5, 7, 8, 10, 11}.
4. Since C ∩ D = Ø, C and D are mutually exclusive events.
2.2.2 Probability
When an experiment is performed, an outcome from the sample space occurs. If the same experiment is performed a number of times, different outcomes may occur each time or some outcomes may repeat. The relative frequency of occurrence of an outcome in a large number of identical experiments can be thought of as a probability. More probable outcomes are those that occur more frequently. If the outcomes of an experiment can be described probabilistically, the data from an experiment can be analyzed statistically.
The probability usually assigns a real number to each event. Let S be the sample space of an experiment. Each event A in S has a number, P(A), called the probability of A, which satisfied the following three axioms:
Axiom 1 P(A) ≥ 0.
Axiom 2 P(S) = 1.
Axiom 3 If A and B are mutually exclusive events, then P(A c02_image004.jpg B) = P(A) + P(B).
These three axioms form the basis of all probability theory. Any function of P that satisfies the axioms of probability is called a probability function. For any sample space many different probability functions can be defined. The following basic results can be derived using event algebra:
P(Ac) = 1 – P(A).
For any two events A and B, P(A c02_image004.jpg B) = P(A) + P(B) – P(A>B).
For any two events A and B, P(A) = P(A ∩ B) + P(A ∩ Bc).
If B is included in A, then A ∩ B = B.
Therefore P(A) – P(B) = P(A ∩ Bc) and P(A) 2265 P(B).
Example 2.3 Probabilities of Events from Random Experiments. The following are some examples of the probability of an outcome from experiments given in Example 2.1.
1. P({H}) = ½.
2. P({(T, T)}) = ¼.
3. P({1,2}) = c02_image005.jpg = ⅓.
4.P({(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}) = c02_image006.jpg .
2.3 CONDITIONAL PROBABILITY AND INDEPENDENCE
A sample space is generally defined and all probabilities are calculated with respect to that sample space. In many cases, however, we are in a position to update the sample space based on new information. For example, like the fourth example of Example 2.3, if we just consider the case that two outcomes from rolling a die twice are the same, the size of the sample space is reduced from 36 to 6. General definitions of conditional probability and independence are introduced. The Bayes theorem is also introduced, which is the basis of a statistical methodology called Bayesian statistics.
2.3.1 Conditional Probability
Consider two events, A and B. Suppose that an event B has occurred. This occurrence may change the probability of A. We denote this by P(A | B), the conditional probability of event A given that B has occurred.
As an example, suppose a card is randomly dealt from a well-shuffled deck of 52 cards. The probability of this card being a heart is 4, since there are 13 hearts in a deck. However, if you are told that the dealt card is red, then the probability of a heart changes to 2, because the sample space is reduced to just 26 cards. Likewise, if you are told that the dealt card is black, then P(Heart | Black) = 0.
If A and B are events in S and P(B) > 0, then P(A | B) is called the conditional probability of A given B if the following axiom is satisfied:
Axiom P(A | B) = P(A > B)/P(B).
Example 2.4 Conditional Probability from Tossing Two Dice. An experiment consists of tossing two fair dice with a sample space of 6 x 6 = 36 outcomes. Consider two events:
A = {difference of the numbers on the dice is 3}
B = {sum of the numbers on the dice is a multiple of 3}
What is the conditional probability of A given B?
Solution
A = {(1,4), (2,5), (3,6), (4,1), (5,2), (6, 3)},
B = {(1,2), (1,5), (2,1), (2,4), (3,3), (3,6), (4,2), (4,5), (5,1), (5,4), (6,3), (6,6)}
and thus A ∩ B = {(3, 6), (6, 3)}.
Thus B and A ∩ B consist of 12 and 2 outcomes, respectively. Assuming that all outcomes are equally likely, the conditional probability of A given B is
c02_image007.jpgThat means that the probability of the event with 2 outcomes that difference of numbers on two dice is 3, in the sample space of 12 outcomes that sum of numbers on two dice is a multiple of 3, is 1/6.
2.3.2 Independence
Sometimes there are some situations that the knowledge of an event B does not give us any more information about A than what we already had. Like this situation the event A is called independent of the event B if P(A | B) = P(A ). In this case, we have the following axiom:
Axiom P(A > B) = P(A | B)P(B) = P(A)P(B).
Since P(A ∩ B) can also be expressed as P(B | A) P(A), this shows that B is independent of A. Thus A and B are mutually independent.
Independent events are not same as disjoint events. In fact, there is a strong dependence between disjoint events. If A ∩ B = Ø and if A occurs, then B cannot occur. Similarly, if B occurs, then A cannot occur.
The advantage of the above axiom is that it treats the events symmetrically and will be easier to generalize to more than two events. Many gambling games provide models of independent events. The spins of a roulette wheel and the tosses of a pair of dice are both series of independent events.
Example 2.5 Independence from Tossing One Die. Suppose that a fair die is to be rolled. Let the event A be that the number is even and the event B that the number is greater than or equal to 3. Is event A independent of B?
Here A = {2, 4, 6} and B = {3, 4, 5, 6}. Thus P(A) = ½ and P(B) = ⅔ since S has six outcomes all equally likely. Given the event A, we know that it is 2, 4, or 6, and two of these three possibilities are greater than or equal to 3. Therefore, the probability the number is greater than or equal to 3 given that it is even is ⅔:
c02_image008.jpgThese events are independent of each other. Thus the probability that it is greater than or equal to 3 is the same whether or not we know that the number is even.
2.3.3 Bayes’s Theorem
For any two events A and B in a sample space, we can write
c02_image009.jpgThis leads to a simple formula:
c02_image010.jpgThis is known as Bayes’s theorem or the inverse probability law. It forms the basis of a statistical methodology called Bayesian statistics. In Bayesian statistics, P(B) is called the prior probability of B, which refers to the probability of B prior to the knowledge of the occurrence of A. We call P(B | A) the posterior probability of B, which refers to the probability of B after observing A. Thus Bayes’s theorem can be viewed as a way of updating the probability of B in light of the knowledge about A.
We can write the above formula in a more useful form.
Bayes’s rule: Let B1, B2, ... be a partition of the sample space. In other words, B1, B2,... are mutually disjoint and B1 c02_image004.jpg B2 c02_image004.jpg . . . = S. Let A be any set. Then, for each i = 1,2, . . . ,
c02_image011.jpgThis result can be used to update the prior probabilities of mutually exclusive events B1, B2,... in light of the new information that A has occurred. The following example illustrates an interesting application of Bayes’s theorem.
Example 2.6 Bayes’s Rule. A novel biomarker (or a combination of biomarkers) diagnosis assay is 95% effective in detecting a certain disease when it is present. The test also yields 1% false-positive result. If 0.5% of the population has the disease, what is the probability a person with a positive test result actually has the disease?
Solution Let A = {a person’s biomarker assay test result is positive} and B = {a person has the disease}. Then P(B) = 0.005, P(A | B) = 0.95, and P(A | Bc) = 0.01:
c02_image012.jpgThus, the probability that a person really has the disease is only 32.3%!
In Example 2.6, how can we improve the odds of detecting real positives? We can improve by using multiple independent diagnosis assays, for example, A, B, and C, all with 32.3% detection probabilities. Then, the probability that a person with positive results from all three assays has the disease will be
c02_image013.jpg2.4 RANDOM VARIABLES
A random variable (abbreviated as r.v.) associates a unique numerical value with each outcome in the sample space. Formally, a r.v. is a real-valued function from a sample space S into the real numbers. We denote a r.v. by an uppercase letter (e.g., X or Y) and a particular value taken by a r.v. by the corresponding lowercase letter (e.g., x or y).
Example 2.7 Random Variables. Here are some examples of random variables:
1. Run an assay test. Then a r.v. is c02_image014.jpg
2. Toss two dice. Then a r.v. is X = sum of the numbers on the dice.
3. Observe a transistor until it lasts. Then a r.v. is X = lifetime of the transistor in days
A r.v. X may be discrete or continuous. The r.v.’s in the first two examples above are discrete, while the third one is continuous.
2.4.1 Discrete Random Variables
A r.v. is discrete if the number of possible values it can take is finite (e.g., {0, 1}) or countably infinite (e.g., all nonnegative integers {0, 1, 2,...}). Thus the possible values of a discrete r.v. can be listed as Xi, X2,.... Suppose that we can calculate P(X = x) for every value x. The collection of these probabilities can be viewed as a function of X. The probability mass function ( p.m.f) of a discrete r.v. X is given by
c02_image015.jpgExample 2.8 Distribution of Discrete r.v.’s from Tossing Two Dice. Assume you toss two dice. What is the distribution of the sum?
Solution Let X be the sum of the numbers on two fair tossed dice. The p.m.f. of X can be derived by listing all 36 possible outcomes, which are equally likely, and counting the outcomes that result in X = x for x = 2, 3,..., 12. Then
c02_image016.jpgTABLE 2.1 The p.m.f. of X, Sum of Numbers on Two Fair Dice
c02_image017.jpgFigure 2.1 Histogram of X, the sum of the numbers on two fair dice.
c02_image018.jpgFor example, there are 4 outcomes that result in X = 5: (1, 4), (2, 3), (3, 2), (4, 1). Therefore,
c02_image019.jpgThe p.m.f. is tabulated in Table 2.1 and plotted in Figure 2.1.
2.4.2 Continuous Random Variables
A r.v. X is continuous if it can take any value from one or more intervals of real numbers. We cannot use a p.m.f. to describe the probability distribution of X, because its possible values are uncountably infinite. We use a new notion called the probability density function (p.d.f.) such that areas under the f(x) curve represent probabilities.
The p.d.f. of a continuous r.v. X is the function that satisfies
c02_image020.jpgwhere FX(x) is the cumulative distribution function (c.d.f ) of a r.v. X defined by
c02_image021.jpgExample 2.9 Probability Calculation from Exponential Distribution. The simplest distribution used to model the times to failure (lifetime) of items or survival times of patients is the exponential distribution. The p.d.f. of the exponential distribution is given by
c02_image022.jpgwhere λ is the failure rate. Suppose a certain type of computer chip has a failure rate of once every 15 years ( c02_image023.jpg ), and the time to failure is exponentially distributed. What is the probability that a chip would last 5–10 years?
Solution Let X be the lifetime of a chip. The desired probability is
c02_image024.jpgA simple R code example is:
R command: pexp(1/15,10) – pexp(1/15,5)
Output: 0.4866 - 0.2835=0.2031
2.5 EXPECTED VALUE AND VARIANCE
The p.m.f. or p.d.f. completely describes the probabilistic behavior of a r.v. However, certain numerical measures computed from the distribution provide useful summaries. Two common measures that summarize the behavior of a r.v., called parameters, are its expected value (or mean) and its variance. The expected value of a r.v. is merely its average value weighted according to the probability distribution. The expected value of a distribution is a measure of center. By weighting the values of the r.v. according to the probability distribution, we are finding the number that is the average from numerous experiments. The variance gives a measure of the degree of spread of a distribution around its mean.
2.5.1 Expected Value
The expected value or the mean of a discrete r.v. X, denoted by E(X), μX, or simply μ, is defined as
c02_image025.jpgThis is a sum of possible values, x1, x2,..., taken by the r.v. X weighted by their probabilities.
The expected value of a continuous r.v. X is defined as
c02_image026.jpgE(X) can be thought of as the center of gravity of the distribution of X.
If the probability distribution of a r.v. X is known, then the expected value of a function of X, say g(X) [e.g., g(X) = X²], equals
c02_image027.jpgprovided that the sum of the integral exists. If E | g(X) | = ∞, we say that E[g(X)] does not exist.
Example 2.10 Expectation of Discrete Random Variable. Suppose two dice are tossed (Example 2.8). Let X1 be the sum of two dice and X2 and X3 be the values of the first and second tossings, respectively. What are the expectations of X1, X2 + X3?
Solution
c02_image028.jpgExample 2.11 Expectation of Continuous Random Variable. Let the density of X be f (X) = ½ for 0 < x <2. What is the expectation of X?
Solution
c02_image029.jpgREMARK Changing the order of summation (or subtraction) and expectation does not affect the result (invariant to summation/subtraction).
2.5.2 Variance and Standard Deviation
The variance of a r.v. X, denoted by Var(X), c02_image030.jpg , or simply σ², is defined as
c02_image031.jpgThe variance is a square sum of the differences between each possible value and mean m weighted by their probabilities and a measure of the dispersion of a r.v. about its mean.
The variance of a constant is zero.
An alternative expression for Var(X ) is given by
c02_image032.jpgThe standard deviation (SD) is the positive square root of the variance:
c02_image033.jpgThe interpretation attached to the variance is that a small value of the variance means that all Xs are very likely to be close to E(X) and a larger value of variance means that all Xs are more spread out. If Var(X) = 0, then X is equal to E(X), with probability 1, and there is no variation in X. The standard deviation has the same qualitative interpretation. The standard deviation is easier to interpret in that the measurement unit on the standard deviation is the same as that for the original variable X. The measurement unit on the variance is the square of the original unit.
Example 2.12 Variance from Tossing Two Dice. Let X be the sum of the values from tossing two dice. What is the variance of X?
Solution
c02_image034.jpg2.5.3 Covariance and Correlation
In earlier sections, we discussed the independence in a relationship between two r.v.’s. If there is a relationship, it may be strong or weak. In this section we discuss two numerical measures of the strength of a relationship between two r.v.’s, the covariance and correlation.
For instance, consider an experiment in which r.v.’s X and Y are measured, where X is the weight of a sample of water and Y is the volume of the same sample of water. Clearly there is a strong relationship between X and Y. If (X, Y) pairs with such a strong relationship are measured on a lot of samples and plotted, they should fall on a straight line. Now consider another experiment where X is the body weight of a human and Y is the same human’s height. We can clearly find that they have a relationship, but it is not nearly as strong. In this case we would not expect that the observed data points fall on a straight line. These relationships are measured by covariance and correlation that quantity the strength of a relationship between two r.v.’s.
The covariance of two r.v.’s, X and Y, measures the joint dispersion from their respective means, given by
c02_image035.jpgNote that Cov(X, Y) can be positive or negative. Positive covariance implies that large (small) values of X are associated with large (small) values of Y; negative covariance implies that large (small) values of X are associated with small (large) values of Y.
If X and Y are independent, then E(XY) = E(X )E(Y) and hence Cov(X, Y) = 0. However, the converse of this is not true in general (e.g., normal). In other words, X and Y may be dependent and yet their covariance may be zero. This is illustrated by the following example.
Example 2.13 Example of Dependence with Zero Covariance. Define Y in terms of X such that
c02_image036.jpgAre X and Y independent?
Solution Obviously, Y depends on X; however, Cov(X, Y) = 0, which can be verified as follows:
c02_image037.jpgHence, Cov(X, Y) = E(XY) – E(X)E(Y) = 0.
Example 2.14 Covariance from Probability and Statistics Grades. The joint distribution of the probability and statistics grades, X and Y, is given later in Table 2.6 and their marginal distributions are given also later in Table 2.7. What is the covariance of X and Y?
Solution
c02_image038.jpgThen,
c02_image039.jpgTo judge the extent of dependence or association between two r.v.’s, we need to standardize their covariance. The correlation (or correlation coefficient) is simply the covariance standardized so that its range is [ – 1, 1]. The correlation between X and Y is defined by
c02_image040.jpgNote that pXY is a unitless quantity, while Cov(X, Y) is not. It follows from the covariance relationship that if X and Y are independent, then pXY = 0; however, pXY = 0 does not imply that they are independent.
Example 2.15 Correlation Coefficient from Probability and Statistics Grades. We saw that the covariance between the probability and statistics grades is 0.352 in Example 2.14. Although this tells us that two grades are positively associated, it does not tell us the strength of linear association, since the covariance is not a standardized measure. For this purpose, we calculate the correlation coefficient. Check that Var(X) = 0.858 and Var(Y) = 0.678. Then
c02_image041.jpgThis correlation is not very close to 1, which implies that there is not a strong linear relationship between X and Y.
2.6 DISTRIBUTIONS OF RANDOM VARIABLES
In this section we describe the six discrete probability distributions and five continuous probability distributions that occur most frequently in bioinformatics and computational biology. These are called univariate models. In the last three sections, we discuss probability models that involve more than one random variable called multivariate models.
TABLE 2.2 R Function Names and Parameters for Standard Probability Distributions
Table 2.2 describes the R function names and parameters for a number of standard probability distributions for general use. The first letter of the function name indicates which of the probability functions it describes. For example:
dnorm is the density function (p.d.f.) of the normal distribution
pnorm is the cumulative distribution function (c.d.f.) of the normal distribution
qnorm is the quantile function of the normal distribution
These functions can be used to replace statistical tables. For example, the 5% critical value for a (two-sided) t test on 11 degrees of freedom is given by qt(0.975, 11), and the P value