Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Statistical Bioinformatics: For Biomedical and Life Science Researchers
Statistical Bioinformatics: For Biomedical and Life Science Researchers
Statistical Bioinformatics: For Biomedical and Life Science Researchers
Ebook702 pages7 hours

Statistical Bioinformatics: For Biomedical and Life Science Researchers

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book provides an essential understanding of statistical concepts necessary for the analysis of genomic and proteomic data using computational techniques. The author presents both basic and advanced topics, focusing on those that are relevant to the computational analysis of large data sets in biology. Chapters begin with a description of a statistical concept and a current example from biomedical research, followed by more detailed presentation, discussion of limitations, and problems. The book starts with an introduction to probability and statistics for genome-wide data, and moves into topics such as clustering, classification, multi-dimensional visualization, experimental design, statistical resampling, and statistical network analysis.
  • Clearly explains the use of bioinformatics tools in life sciences research without requiring an advanced background in math/statistics
  • Enables biomedical and life sciences researchers to successfully evaluate the validity of their results and make inferences
  • Enables statistical and quantitative researchers to rapidly learn novel statistical concepts and techniques appropriate for large biological data analysis
  • Carefully revisits frequently used statistical approaches and highlights their limitations in large biological data analysis
  • Offers programming examples and datasets
  • Includes chapter problem sets, a glossary, a list of statistical notations, and appendices with references to background mathematical and technical material
  • Features supplementary materials, including datasets, links, and a statistical package available online

Statistical Bioinformatics is an ideal textbook for students in medicine, life sciences, and bioengineering, aimed at researchers who utilize computational tools for the analysis of genomic, proteomic, and many other emerging high-throughput molecular data. It may also serve as a rapid introduction to the bioinformatics science for statistical and computational students and audiences who have not experienced such analysis tasks before.

LanguageEnglish
PublisherWiley
Release dateSep 20, 2011
ISBN9781118211526
Statistical Bioinformatics: For Biomedical and Life Science Researchers

Related to Statistical Bioinformatics

Related ebooks

Medical For You

View More

Related articles

Reviews for Statistical Bioinformatics

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Statistical Bioinformatics - Jae K. Lee

    CHAPTER 1

    ROAD TO STATISTICAL BIOINFORMATICS

    Jae K. Lee

    Department of Public Health Science, University of Virginia,

    Charlottesville, Virginia, USA

    There has been a great explosion of biological data and information in recent years, largely due to the advances of various high-throughput biotechnologies such as mass spectrometry, high throughput sequencing, and many genome-wide SNP profiling, RNA gene expression microarray, protein mass spectrometry, and many other recent high-throughput biotechniques (Weinstein et al., 2002). Furthermore, powerful computing systems and fast Internet connections to large worldwide biological databases enable individual laboratory researchers to easily access an unprecedentedly huge amount of biological data. Such enormous data are often too overwhelming to understand and extract the most relevant information to each researcher’s investigation goals. In fact, these large biological data are information rich and often contain much more information than the researchers who have generated such data may have anticipated. This is why many major biomedical research institutes have made significant efforts to freely share such data with general public researchers. Bioinformatics is the emerging science field concerned with the development of various analysis methods and tools for investigating such large biological data efficiently and rigorously. This kind of development requires many different components: powerful computer systems to archive and process such data, effective database designs to extract and integrate information from various heterogeneous biological databases, and efficient analysis techniques to investigate and analyze these large databases. In particular, analysis of these massive biological data is extremely challenging for the following reasons.

    CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

    Analysis techniques on high-throughput biological data are required to carefully handle and investigate an astronomical number of candidate targets and possible mechanisms, most of which are false positives, from such massive data (Tusher et al., 2001). For example, a traditional statistical testing criterion which allows 5% false-positive error (or significance level) would identify ~500 false positives from 10K microarray data between two biological conditions of interest even though no real biologically differentially regulated genes exist between the two. If a small number of, for example, 100, genes that are actually differentially regulated exist, such real differential expression patterns will be mixed with the above 500 false positives without any a priori information to discriminate the true positives from the false positives. Then, confidence on the 600 targets that were identified by such a statistical testing may not be high. Simply tightening such a statistical criterion will result in a high false-negative error rate, without being able to identify many important real biological targets. This kind of pitfall, the so-called multiple-comparisons issue, becomes even more serious when biological mechanisms such as certain signal transduction and regulation pathways that involve multiple targets are searched from such biological data; the number of candidate pathway mechanisms to be searched grows exponentially, for example, 10! for 10-gene sequential pathway mechanisms. Thus, no matter how powerful a computer system can handle a given computational task, it is prohibitive to tackle such problems by exhaustive computational search and comparison for these kinds of problems. Many current biological problems have been theoretically proven to be NP (nonpolynomial) hard in computer science, implying that no finite (polynomial) computational algorithm can search all possible solutions as the number of biological targets involved in such a solution becomes too large. More importantly, this kind of exhaustive search is simply prone to the risk of discovering numerous false positives. In fact, this is one of the most difficult challenges in investigating current large biological databases and is why only heuristic algorithms that tightly control such a high false positive error rate and investigate a very small portion of all possible solutions are often sought for many biological problems. Thus, the success of many bioinformatics studies critically depends on the construction and use of effective and efficient heuristic algorithms, most of which are based on probabilistic modeling and statistical inference techniques that can maximize the statistical power of identifying true positives while rigorously controlling their false positive error rates.

    CHALLENGE 2: HIGH-DIMENSIONAL BIOLOGICAL DATA

    The second challenge is the high-dimensional nature of biological data in many bioinformatics studies. When biological data are simultaneously generated with many gene targets, their data points become dramatically sparse in the corresponding high-dimensional data space. It is well known that mathematical and computational approaches often fail to capture such high-dimensional phenomena accurately (Tamayo et al., 1999). For example, many statistical algorithms cannot easily move between local maxima in a high-dimensional space. Also, inference by combining several disjoint lower dimensional phenomena may not provide the correct understanding on the real phenomena in their joint, high-dimensional space. It is therefore important to understand statistical dimension reduction techniques that can reduce high-dimensional data problems into lower dimensional ones while the important variation of interest in biological data is preserved.

    CHALLENGE 3: SMALL-n AND LARGE-p PROBLEM

    The third challenge is the so-called "small-n and large-p" problem. Desired performance of conventional statistical methods is achieved when the sample size, namely n, of the data, the number of independent observations of event, is much larger than the number of parameters, say p, which need to be inferred by statistical inference (Jain et al., 2003). In many bioinformatics problems, this situation is often completely reversed. For example, in a microarray study, tens of thousands of gene transcripts’ expression patterns may become candidate prediction factors for a biological phenomenon of interest (e.g., tumor sensitivity vs. resistance to a chemotherapeutic compound) but the number of independent observations (e.g., different patient biopsy samples) is often at most a few tens or smaller. Due to the experimental costs and limited biological materials, the number of independent replicated samples can be sometimes extremely small, for example, two or three, or unavailable. In these cases, most traditional statistical approaches often perform very poorly. Thus, it is also important to select statistical analysis tools that can provide both high specificity and high sensitivity under these circumstances.

    CHALLENGE 4: NOISY HIGH-THROUGHPUT BIOLOGICAL DATA

    The fourth challenge is due to the fact that high-throughput biotechnical data and large biological databases are inevitably noisy because biological information and signals of interest are often observed with many other random or biased factors that may obscure main signals and information of interest (Cho and Lee, 2004). Therefore, investigations on large biological data cannot be successfully performed unless rigorous statistical algorithms are developed and effectively utilized to reduce and decompose various sources of error. Also, careful assessment and quality control of initial data sets is critical for all subsequent bioinformatics analyses.

    CHALLENGE 5: INTEGRATION OF MULTIPLE, HETEROGENEOUS BIOLOGICAL DATA INFORMATION

    The last challenge is the integration of information often from multiple heterogeneous biological and clinical data sets, such as large gene functional and annotation databases, biological subjects’ phenotypes, and patient clinical information. One of the main goals in performing high-throughput biological experiments is to identify the most important critical biological targets and mechanisms highly associated with biological subjects’ phenotypes, such as patients’ prognosis and therapeutic response (Pittman et al., 2004). In these cases, multiple large heterogeneous datasets need to be combined in order to discover the most relevant molecular targets. This requires combining multiple datasets with very different data characteristics and formats, some of which cannot easily be integrated by standard statistical inference techniques, for example, the information from genomic and proteomic expression data and reported pathway mechanisms in the literature. It will be extremely important to develop and use efficient yet rigorous analysis tools for integrative inference on such complex biological data information beyond the individual researcher’s manual and subjective integration.

    In this book, we introduce the statistical concepts and techniques that can overcome these challenges in studying various large biological datasets. Researchers with biological or biomedical backgrounds may not be able, or may not need, to learn advanced mathematical and statistical techniques beyond the intuitive understanding of such topics for their practical applications. Thus, we have organized this book for life science researchers to efficiently learn the most relevant statistical concepts and techniques for their specific biological problems. We believe that this composition of the book will help nonstatistical researchers to minimize unnecessary efforts in learning statistical topics that are less relevant to their specific biological questions, yet help them learn and utilize rigorous statistical methods directly relevant to those problems. Thus, while this book can serve as a general reference for various concepts and methods in statistical bioinformatics, it is also designed to be effectively used as a textbook for a semester or shorter length course as below. In particular, the chapters are divided into four blocks of different statistical issues in analyzing large biological datasets (Fig. 1.1):

    I. Statistical Foundation Probability theories (Chapter 2), statistical quality control (Chapter 3), statistical tests (Chapter 4)

    II. High-Dimensional Analysis Clustering analysis (Chapter 5), classification analysis (Chapter 6), multidimensional visualization (Chapter 7)

    III Advanced Analysis Topics Statistical modeling (Chapter 8), experimental design (Chapter 9), statistical resampling methods (Chapter 10)

    IV Multigene Analysis in Systems Biology Genetic network analysis (Chapter 11), genetic association analysis (Chapter 12), R Bioconductor tools in systems biology (Chapter 13)

    Figure 1.1 Possible course structure.

    c01_image001.jpg

    The first block of chapters will be important, especially for students who do not have a strong statistical background. These chapters will provide general backgrounds and terminologies to initiate rigorous statistical analysis on large biological datasets and to understand more advanced analysis topics later. Students with a good statistical understanding may also quickly review these chapters since there are certain key concepts and techniques (especially in Chapters 3 and 4) that are relatively new and specialized for analyzing large biological datasets.

    The second block consists of analysis topics frequently used in investigating high-dimensional biological data. In particular, clustering and classification techniques, by far, are most commonly used in many practical applications of high-throughput data analysis. Various multidimensional visualization tools discussed in Chapter 7 will also be quite handy in such investigations.

    The third block deals with more advanced topics in large biological data analysis, including advanced statistical modeling for complex biological problems, statistical resampling techniques that can be conveniently used with the combination of classification (Chapter 6) and statistical modeling (Chapter 8) techniques, and experimental design issues in high-throughput microarray studies.

    The final block contains concise description of the analysis topics in several active research areas of multigene network and genetic association analysis as well as the R Bioconductor software in systems biology analysis. These will be quite useful for performing challenging gene network and multigene investigations in the fast-growing systems biology field.

    These four blocks of chapters can be followed with the current order for a full semester-length course. However, except for the first block, the following three blocks are relatively independent of each other and can be covered (or skipped for specific needs and foci under a time constraint) in any order, as depicted in Figure 1.1. We hope that life science researchers who need to deal with challenging analysis issues in overwhelming large biological data in their specific investigations can effectively meet their learning goals in this way.

    REFERENCES

    Cho, H., and Lee, J. K. (2004). Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics, 20(13): 2016–2025.

    Jain, N., et al. (2003). Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics, 19(15): 1945–1951.

    Pittman, J., et al. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl. Acad. Sci. U.S.A., 101(22): 8431–8436.

    Tamayo, P., et al. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A., 96(6): 2907–2912.

    Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A., 98(9): 5116–5121.

    Weinstein, J. N., et al. (2002). The bioinformatics of microarray gene expression profiling. Cytometry, 47(1): 46–49.

    CHAPTER 2

    PROBABILITY CONCEPTS AND DISTRIBUTIONS FOR ANALYZING LARGE BIOLOGICAL DATA

    Sooyoung Cheon

    KU Industry-Academy Cooperation Group Team of Economics and Statistics, Korea University, Jochiwon 339–700, Korea

    2.1 INTRODUCTION

    In general, the results may vary when measurements are taken. Because of this variability, many decisions have to be made that involve uncertainties. For example, in medical research, interest may center on the effectiveness of a new vaccine for mumps or the estimation of the cure rate of treatment for breast cancer. In these situations, probabilistic foundation can help people make decisions to reduce uncertainty.

    Experiments may be generally considered for events where the outcomes cannot be predicted with certainty; that is, each experiment ends in an outcome that cannot be determined with certainty before the performance of the experiment. Such experiments are called random experiments. Thus we consider probability in the study of randomness and uncertainty.

    A physician tells a patient with breast cancer that there is a 90% cure rate with treatment. A sportscaster gives a baseball team a 20% chance of winning the next World Series. These two situations represent two different approaches, the frequentist and subjective approaches, respectively, to probability. The frequentist approach assigns a probability a long-term relative frequency. The cure rate is based on the observed outcomes of a large number of treated cases. When an experiment is repeated a large number of times under identical conditions, a regular pattern may emerge. In that case, the relative frequencies with which different outcomes occur will settle down to fixed proportions. These fixed proportions are defined as the probabilities of the corresponding outcomes. The subjective approach is a personal assessment of the chance that a given outcome will occur. The situations in this case are one-time phenomena that are not repeatable. The accuracy of this probability depends on the information available about the particular situation and a person’s interpretation of that information. A probability can be assigned, but it may vary from person to person.

    This chapter reviews basic probability concepts and common distributions used for analyzing large biological data, in particular bioinformatics. First, basic concepts and some laws of rules of probability are introduced. In many cases, we are often in a situation to update our knowledge based on new information. In that case we need a concept of conditional probability. General definitions of conditional probability and the Bayes theorem are introduced. The theorem is the basis of statistical methodology called Bayesian statistics. Then random variables are discussed, along with the expected value and variance that are used to summarize its behavior. Six discrete and five continuous probability distributions are described that occur most frequently in bioinformatics and computational biology. We conclude with the description of an empirical distribution and sampling strategy related to resampling and bootstrapping techniques.

    2.2 BASIC CONCEPTS

    We begin with the notion of a random experiment, which is a procedure or an operation whose outcome is uncertain, and consider some aspects of events themselves before considering the probability theory associated with events.

    2.2.1 Sample Spaces and Events

    The collection of all possible outcomes of a particular experiment is called a sample space. We will use the letter S to denote a sample space. An event is any collection of possible outcomes of an experiment, that is, any subset of S. For example, if we plan to roll a die twice, the experiment is actual rolling of the die two times, and none, one, or both of these two events will have occurred after the experiment is carried out. In the first five examples, we have discrete sample spaces; in the last two, we have continuous sample spaces.

    Example 2.1 Sample Space and Events from Random Experiments and Real Data. The following are some examples of random experiments and the associated sample spaces and events:

    1. Assume you toss a coin once. The sample space is S = {H, T}, where H = head and T = tail and the event of a head is {H}.

    2. Assume you toss a coin twice. The sample space is S = {(H, H), (H, T), (T, H), (T, T)}, and the event of obtaining exactly one head is {(H, T), (T, H)}.

    3. Assume you roll a single die. The sample space is S = {1,2, 3, 4, 5, 6}, and the event that the outcome is even is {2, 4, 6}.

    4. Assume you roll two dice. The sample space is S = {(1, 1), (1, 2),..., (6, 6)}. The event that the sum of the numbers on the two dice equals 5 is {(1, 4), (2, 3), (3, 2), (4, 1)}.

    5. Assume you count the number of defective welds in a car body. The sample space is S = {0, 1,..., N}, where N = total number of welds. The event that the number of defective welds is no more than two is {0, 1,2}.

    6. Assume you measure the relative humidity of air. The sample space is S = [0, 100]. The event that the relative humidity is at least 90% is [90, 100].

    7. Assume you observe the lifetime of a car battery. The sample space is S = [0, +1}. The event that the car battery fails before 12 months is [0, 12).

    Event algebra is a mathematical language to express relationships among events and combinations of events. Below are the most common terms and ones that will be used in the remainder of this chapter.

    Union The union of two events A and B, denoted by A U B, is the event consisting of all outcomes that belong to A or B or both:

    c02_image001.jpg

    Intersection The intersection of events A and B, denoted by A > B or simply by AB, is the event consisting of all outcomes common to both A and B:

    c02_image002.jpg

    Complementation The complement of an event A, denoted by Ac, is the event consisting of all outcomes not in A.

    c02_image003.jpg

    Disjoint If A and B have no outcomes in common, that is, A > B = Ø, where Ø is an empty set, then A and B are called disjoint or mutually exclusive events.

    Example 2.2 Event Algebra from Random Experiments

    A = {Sum of two dice is a multiple of 3} = {3, 6, 9, 12}

    B = {Sum of two dice is a multiple of 4} = {4, 8, 12}

    C = {Sum of two dice is even} = {2, 4, 6, 8, 10, 12}

    D = {Sum of two dice is odd} = {3, 5, 7, 9, 11}

    1. The union of A and B is A c02_image004.jpg B = {3, 4, 6, 8, 9, 12}.

    2. The intersection of A and B is A B = {12}.

    3. The complementation of A is Ac = {2, 4, 5, 7, 8, 10, 11}.

    4. Since C D = Ø, C and D are mutually exclusive events.

    2.2.2 Probability

    When an experiment is performed, an outcome from the sample space occurs. If the same experiment is performed a number of times, different outcomes may occur each time or some outcomes may repeat. The relative frequency of occurrence of an outcome in a large number of identical experiments can be thought of as a probability. More probable outcomes are those that occur more frequently. If the outcomes of an experiment can be described probabilistically, the data from an experiment can be analyzed statistically.

    The probability usually assigns a real number to each event. Let S be the sample space of an experiment. Each event A in S has a number, P(A), called the probability of A, which satisfied the following three axioms:

    Axiom 1 P(A) ≥ 0.

    Axiom 2 P(S) = 1.

    Axiom 3 If A and B are mutually exclusive events, then P(A c02_image004.jpg B) = P(A) + P(B).

    These three axioms form the basis of all probability theory. Any function of P that satisfies the axioms of probability is called a probability function. For any sample space many different probability functions can be defined. The following basic results can be derived using event algebra:

    P(Ac) = 1 – P(A).

    For any two events A and B, P(A c02_image004.jpg B) = P(A) + P(B) – P(A>B).

    For any two events A and B, P(A) = P(A ∩ B) + P(A ∩ Bc).

    If B is included in A, then A ∩ B = B.

    Therefore P(A) – P(B) = P(A ∩ Bc) and P(A) 2265 P(B).

    Example 2.3 Probabilities of Events from Random Experiments. The following are some examples of the probability of an outcome from experiments given in Example 2.1.

    1. P({H}) = ½.

    2. P({(T, T)}) = ¼.

    3. P({1,2}) = c02_image005.jpg = ⅓.

    4.P({(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)}) = c02_image006.jpg .

    2.3 CONDITIONAL PROBABILITY AND INDEPENDENCE

    A sample space is generally defined and all probabilities are calculated with respect to that sample space. In many cases, however, we are in a position to update the sample space based on new information. For example, like the fourth example of Example 2.3, if we just consider the case that two outcomes from rolling a die twice are the same, the size of the sample space is reduced from 36 to 6. General definitions of conditional probability and independence are introduced. The Bayes theorem is also introduced, which is the basis of a statistical methodology called Bayesian statistics.

    2.3.1 Conditional Probability

    Consider two events, A and B. Suppose that an event B has occurred. This occurrence may change the probability of A. We denote this by P(A | B), the conditional probability of event A given that B has occurred.

    As an example, suppose a card is randomly dealt from a well-shuffled deck of 52 cards. The probability of this card being a heart is 4, since there are 13 hearts in a deck. However, if you are told that the dealt card is red, then the probability of a heart changes to 2, because the sample space is reduced to just 26 cards. Likewise, if you are told that the dealt card is black, then P(Heart | Black) = 0.

    If A and B are events in S and P(B) > 0, then P(A | B) is called the conditional probability of A given B if the following axiom is satisfied:

    Axiom P(A | B) = P(A > B)/P(B).

    Example 2.4 Conditional Probability from Tossing Two Dice. An experiment consists of tossing two fair dice with a sample space of 6 x 6 = 36 outcomes. Consider two events:

    A = {difference of the numbers on the dice is 3}

    B = {sum of the numbers on the dice is a multiple of 3}

    What is the conditional probability of A given B?

    Solution

    A = {(1,4), (2,5), (3,6), (4,1), (5,2), (6, 3)},

    B = {(1,2), (1,5), (2,1), (2,4), (3,3), (3,6), (4,2), (4,5), (5,1), (5,4), (6,3), (6,6)}

    and thus A ∩ B = {(3, 6), (6, 3)}.

    Thus B and A B consist of 12 and 2 outcomes, respectively. Assuming that all outcomes are equally likely, the conditional probability of A given B is

    c02_image007.jpg

    That means that the probability of the event with 2 outcomes that difference of numbers on two dice is 3, in the sample space of 12 outcomes that sum of numbers on two dice is a multiple of 3, is 1/6.

    2.3.2 Independence

    Sometimes there are some situations that the knowledge of an event B does not give us any more information about A than what we already had. Like this situation the event A is called independent of the event B if P(A | B) = P(A ). In this case, we have the following axiom:

    Axiom P(A > B) = P(A | B)P(B) = P(A)P(B).

    Since P(A ∩ B) can also be expressed as P(B | A) P(A), this shows that B is independent of A. Thus A and B are mutually independent.

    Independent events are not same as disjoint events. In fact, there is a strong dependence between disjoint events. If A ∩ B = Ø and if A occurs, then B cannot occur. Similarly, if B occurs, then A cannot occur.

    The advantage of the above axiom is that it treats the events symmetrically and will be easier to generalize to more than two events. Many gambling games provide models of independent events. The spins of a roulette wheel and the tosses of a pair of dice are both series of independent events.

    Example 2.5 Independence from Tossing One Die. Suppose that a fair die is to be rolled. Let the event A be that the number is even and the event B that the number is greater than or equal to 3. Is event A independent of B?

    Here A = {2, 4, 6} and B = {3, 4, 5, 6}. Thus P(A) = ½ and P(B) = ⅔ since S has six outcomes all equally likely. Given the event A, we know that it is 2, 4, or 6, and two of these three possibilities are greater than or equal to 3. Therefore, the probability the number is greater than or equal to 3 given that it is even is ⅔:

    c02_image008.jpg

    These events are independent of each other. Thus the probability that it is greater than or equal to 3 is the same whether or not we know that the number is even.

    2.3.3 Bayes’s Theorem

    For any two events A and B in a sample space, we can write

    c02_image009.jpg

    This leads to a simple formula:

    c02_image010.jpg

    This is known as Bayes’s theorem or the inverse probability law. It forms the basis of a statistical methodology called Bayesian statistics. In Bayesian statistics, P(B) is called the prior probability of B, which refers to the probability of B prior to the knowledge of the occurrence of A. We call P(B | A) the posterior probability of B, which refers to the probability of B after observing A. Thus Bayes’s theorem can be viewed as a way of updating the probability of B in light of the knowledge about A.

    We can write the above formula in a more useful form.

    Bayes’s rule: Let B1, B2, ... be a partition of the sample space. In other words, B1, B2,... are mutually disjoint and B1 c02_image004.jpg B2 c02_image004.jpg . . . = S. Let A be any set. Then, for each i = 1,2, . . . ,

    c02_image011.jpg

    This result can be used to update the prior probabilities of mutually exclusive events B1, B2,... in light of the new information that A has occurred. The following example illustrates an interesting application of Bayes’s theorem.

    Example 2.6 Bayes’s Rule. A novel biomarker (or a combination of biomarkers) diagnosis assay is 95% effective in detecting a certain disease when it is present. The test also yields 1% false-positive result. If 0.5% of the population has the disease, what is the probability a person with a positive test result actually has the disease?

    Solution Let A = {a person’s biomarker assay test result is positive} and B = {a person has the disease}. Then P(B) = 0.005, P(A | B) = 0.95, and P(A | Bc) = 0.01:

    c02_image012.jpg

    Thus, the probability that a person really has the disease is only 32.3%!

    In Example 2.6, how can we improve the odds of detecting real positives? We can improve by using multiple independent diagnosis assays, for example, A, B, and C, all with 32.3% detection probabilities. Then, the probability that a person with positive results from all three assays has the disease will be

    c02_image013.jpg

    2.4 RANDOM VARIABLES

    A random variable (abbreviated as r.v.) associates a unique numerical value with each outcome in the sample space. Formally, a r.v. is a real-valued function from a sample space S into the real numbers. We denote a r.v. by an uppercase letter (e.g., X or Y) and a particular value taken by a r.v. by the corresponding lowercase letter (e.g., x or y).

    Example 2.7 Random Variables. Here are some examples of random variables:

    1. Run an assay test. Then a r.v. is c02_image014.jpg

    2. Toss two dice. Then a r.v. is X = sum of the numbers on the dice.

    3. Observe a transistor until it lasts. Then a r.v. is X = lifetime of the transistor in days

    A r.v. X may be discrete or continuous. The r.v.’s in the first two examples above are discrete, while the third one is continuous.

    2.4.1 Discrete Random Variables

    A r.v. is discrete if the number of possible values it can take is finite (e.g., {0, 1}) or countably infinite (e.g., all nonnegative integers {0, 1, 2,...}). Thus the possible values of a discrete r.v. can be listed as Xi, X2,.... Suppose that we can calculate P(X = x) for every value x. The collection of these probabilities can be viewed as a function of X. The probability mass function ( p.m.f) of a discrete r.v. X is given by

    c02_image015.jpg

    Example 2.8 Distribution of Discrete r.v.’s from Tossing Two Dice. Assume you toss two dice. What is the distribution of the sum?

    Solution Let X be the sum of the numbers on two fair tossed dice. The p.m.f. of X can be derived by listing all 36 possible outcomes, which are equally likely, and counting the outcomes that result in X = x for x = 2, 3,..., 12. Then

    c02_image016.jpg

    TABLE 2.1 The p.m.f. of X, Sum of Numbers on Two Fair Dice

    c02_image017.jpg

    Figure 2.1 Histogram of X, the sum of the numbers on two fair dice.

    c02_image018.jpg

    For example, there are 4 outcomes that result in X = 5: (1, 4), (2, 3), (3, 2), (4, 1). Therefore,

    c02_image019.jpg

    The p.m.f. is tabulated in Table 2.1 and plotted in Figure 2.1.

    2.4.2 Continuous Random Variables

    A r.v. X is continuous if it can take any value from one or more intervals of real numbers. We cannot use a p.m.f. to describe the probability distribution of X, because its possible values are uncountably infinite. We use a new notion called the probability density function (p.d.f.) such that areas under the f(x) curve represent probabilities.

    The p.d.f. of a continuous r.v. X is the function that satisfies

    c02_image020.jpg

    where FX(x) is the cumulative distribution function (c.d.f ) of a r.v. X defined by

    c02_image021.jpg

    Example 2.9 Probability Calculation from Exponential Distribution. The simplest distribution used to model the times to failure (lifetime) of items or survival times of patients is the exponential distribution. The p.d.f. of the exponential distribution is given by

    c02_image022.jpg

    where λ is the failure rate. Suppose a certain type of computer chip has a failure rate of once every 15 years ( c02_image023.jpg ), and the time to failure is exponentially distributed. What is the probability that a chip would last 5–10 years?

    Solution Let X be the lifetime of a chip. The desired probability is

    c02_image024.jpg

    A simple R code example is:

    R command: pexp(1/15,10) – pexp(1/15,5)

    Output: 0.4866 - 0.2835=0.2031

    2.5 EXPECTED VALUE AND VARIANCE

    The p.m.f. or p.d.f. completely describes the probabilistic behavior of a r.v. However, certain numerical measures computed from the distribution provide useful summaries. Two common measures that summarize the behavior of a r.v., called parameters, are its expected value (or mean) and its variance. The expected value of a r.v. is merely its average value weighted according to the probability distribution. The expected value of a distribution is a measure of center. By weighting the values of the r.v. according to the probability distribution, we are finding the number that is the average from numerous experiments. The variance gives a measure of the degree of spread of a distribution around its mean.

    2.5.1 Expected Value

    The expected value or the mean of a discrete r.v. X, denoted by E(X), μX, or simply μ, is defined as

    c02_image025.jpg

    This is a sum of possible values, x1, x2,..., taken by the r.v. X weighted by their probabilities.

    The expected value of a continuous r.v. X is defined as

    c02_image026.jpg

    E(X) can be thought of as the center of gravity of the distribution of X.

    If the probability distribution of a r.v. X is known, then the expected value of a function of X, say g(X) [e.g., g(X) = X²], equals

    c02_image027.jpg

    provided that the sum of the integral exists. If E | g(X) | = ∞, we say that E[g(X)] does not exist.

    Example 2.10 Expectation of Discrete Random Variable. Suppose two dice are tossed (Example 2.8). Let X1 be the sum of two dice and X2 and X3 be the values of the first and second tossings, respectively. What are the expectations of X1, X2 + X3?

    Solution

    c02_image028.jpg

    Example 2.11 Expectation of Continuous Random Variable. Let the density of X be f (X) = ½ for 0 < x <2. What is the expectation of X?

    Solution

    c02_image029.jpg

    REMARK Changing the order of summation (or subtraction) and expectation does not affect the result (invariant to summation/subtraction).

    2.5.2 Variance and Standard Deviation

    The variance of a r.v. X, denoted by Var(X), c02_image030.jpg , or simply σ², is defined as

    c02_image031.jpg

    The variance is a square sum of the differences between each possible value and mean m weighted by their probabilities and a measure of the dispersion of a r.v. about its mean.

    The variance of a constant is zero.

    An alternative expression for Var(X ) is given by

    c02_image032.jpg

    The standard deviation (SD) is the positive square root of the variance:

    c02_image033.jpg

    The interpretation attached to the variance is that a small value of the variance means that all Xs are very likely to be close to E(X) and a larger value of variance means that all Xs are more spread out. If Var(X) = 0, then X is equal to E(X), with probability 1, and there is no variation in X. The standard deviation has the same qualitative interpretation. The standard deviation is easier to interpret in that the measurement unit on the standard deviation is the same as that for the original variable X. The measurement unit on the variance is the square of the original unit.

    Example 2.12 Variance from Tossing Two Dice. Let X be the sum of the values from tossing two dice. What is the variance of X?

    Solution

    c02_image034.jpg

    2.5.3 Covariance and Correlation

    In earlier sections, we discussed the independence in a relationship between two r.v.’s. If there is a relationship, it may be strong or weak. In this section we discuss two numerical measures of the strength of a relationship between two r.v.’s, the covariance and correlation.

    For instance, consider an experiment in which r.v.’s X and Y are measured, where X is the weight of a sample of water and Y is the volume of the same sample of water. Clearly there is a strong relationship between X and Y. If (X, Y) pairs with such a strong relationship are measured on a lot of samples and plotted, they should fall on a straight line. Now consider another experiment where X is the body weight of a human and Y is the same human’s height. We can clearly find that they have a relationship, but it is not nearly as strong. In this case we would not expect that the observed data points fall on a straight line. These relationships are measured by covariance and correlation that quantity the strength of a relationship between two r.v.’s.

    The covariance of two r.v.’s, X and Y, measures the joint dispersion from their respective means, given by

    c02_image035.jpg

    Note that Cov(X, Y) can be positive or negative. Positive covariance implies that large (small) values of X are associated with large (small) values of Y; negative covariance implies that large (small) values of X are associated with small (large) values of Y.

    If X and Y are independent, then E(XY) = E(X )E(Y) and hence Cov(X, Y) = 0. However, the converse of this is not true in general (e.g., normal). In other words, X and Y may be dependent and yet their covariance may be zero. This is illustrated by the following example.

    Example 2.13 Example of Dependence with Zero Covariance. Define Y in terms of X such that

    c02_image036.jpg

    Are X and Y independent?

    Solution Obviously, Y depends on X; however, Cov(X, Y) = 0, which can be verified as follows:

    c02_image037.jpg

    Hence, Cov(X, Y) = E(XY) – E(X)E(Y) = 0.

    Example 2.14 Covariance from Probability and Statistics Grades. The joint distribution of the probability and statistics grades, X and Y, is given later in Table 2.6 and their marginal distributions are given also later in Table 2.7. What is the covariance of X and Y?

    Solution

    c02_image038.jpg

    Then,

    c02_image039.jpg

    To judge the extent of dependence or association between two r.v.’s, we need to standardize their covariance. The correlation (or correlation coefficient) is simply the covariance standardized so that its range is [ – 1, 1]. The correlation between X and Y is defined by

    c02_image040.jpg

    Note that pXY is a unitless quantity, while Cov(X, Y) is not. It follows from the covariance relationship that if X and Y are independent, then pXY = 0; however, pXY = 0 does not imply that they are independent.

    Example 2.15 Correlation Coefficient from Probability and Statistics Grades. We saw that the covariance between the probability and statistics grades is 0.352 in Example 2.14. Although this tells us that two grades are positively associated, it does not tell us the strength of linear association, since the covariance is not a standardized measure. For this purpose, we calculate the correlation coefficient. Check that Var(X) = 0.858 and Var(Y) = 0.678. Then

    c02_image041.jpg

    This correlation is not very close to 1, which implies that there is not a strong linear relationship between X and Y.

    2.6 DISTRIBUTIONS OF RANDOM VARIABLES

    In this section we describe the six discrete probability distributions and five continuous probability distributions that occur most frequently in bioinformatics and computational biology. These are called univariate models. In the last three sections, we discuss probability models that involve more than one random variable called multivariate models.

    TABLE 2.2 R Function Names and Parameters for Standard Probability Distributions

    Table 2.2 describes the R function names and parameters for a number of standard probability distributions for general use. The first letter of the function name indicates which of the probability functions it describes. For example:

    dnorm is the density function (p.d.f.) of the normal distribution

    pnorm is the cumulative distribution function (c.d.f.) of the normal distribution

    qnorm is the quantile function of the normal distribution

    These functions can be used to replace statistical tables. For example, the 5% critical value for a (two-sided) t test on 11 degrees of freedom is given by qt(0.975, 11), and the P value

    Enjoying the preview?
    Page 1 of 1