Nothing Special   »   [go: up one dir, main page]

0% found this document useful (0 votes)
3 views74 pages

Tests of Hypothesis

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 74

CHAPTER 8

Tests of Hypotheses
OBJECTIVE
To introduce the basic concepts of a statistical test of a
hypothesis; to present statistical tests for several common
population parameters and to illustrate their use in practical
sampling situations

CONTENTS
8.1 The Relationship Between Statistical Tests of
Hypotheses and Confidence Intervals
8.2 Elements and Properties of a Statistical Test
8.3 Finding Statistical Tests: Classical Methods
8.4 Choosing the Null and Alternative Hypotheses
8.5 The Observed Significance Level for a Test: p-values
8.6 Testing a Population Mean
8.7 Testing the Difference Between Two Population
Means: Independent Samples
8.8 Testing the Difference Between Two Population
Means: Matched Pairs
8.9 Testing a Population Proportion
8.10 Testing the Difference Between Two Population
Proportions
8.11 Testing a Population Variance
8.12 Testing the Ratio of Two Population Variances
8.13 Alternative Testing Procedures: Bootstrapping
and Bayesian Methods (Optional)

• STATISTICS IN ACTION

• Comparing Methods for Dissolving Drug Tablets—
Dissolution Method Equivalence Testing
368
Statistics In Action 369

• STATISTICS IN ACTION

• Comparing Methods for Dissolving Drug Tablets—Dissolution Method Equivalence Testing

I
n the pharmaceutical industry, quality engineers are responsible for maintaining the quality of drug prod-
ucts produced in the manufacturing process. The key to quality is an assessment of product characteris-
tics through repeated measurements of the variable of interest. When the variable is the concentration of
a particular constituent in a mixture, the process is called an assay. For this “Statistics in Action”, we focus
on a chemical assay to determine how fast a solid-dosage pharmaceutical product (e.g., an aspirin tablet or
capsule) dissolves. Since variation in the dissolution of the drug can have harmful side effects on the pa-
tient, quality inspectors require a test that accurately measures dissolution.
In “Dissolution Method Equivalence” (Chapter 4, Statistical Case Studies: A Collaboration between
Academe and Industry, ASA-SIAM Series on Statistics and Applied Probability, 1998), statisticians Russell
Reeve and Francis Giesbrecht explored the dissolution characteristics of a new immediate-release drug
product manufactured by a well-known pharmaceutical company. An immediate-release product is de-
signed to dissolve and enter the bloodstream as fast as possible. To test for dissolution of the solid-dosage
drug, the company uses an apparatus with six vessels or tubes, each tube containing a dissolving solution.
Drug tablets or capsules are dropped in the tubes. Then, at predetermined times, a small amount of the so-
lution is withdrawn from each tube and analyzed using high performance liquid chromatography (HPLC).
The HPLC device quantifies how much of the drug is in the solution; this value is expressed as percent of
label strength (%LS).
Initially, the process described above is typically performed in a laboratory at the company’s research
and development center. Once dissolution of the drug has been deemed satisfactory, the process is trans-
ferred to the manufacturing facility. However, federal regulations require that quality engineers at the man-
ufacturing site produce results equivalent to those at the R&D center. In fact, the company must provide
documentation that verifies that any two sites using the dissolution test produce equivalent assay results.
Dissolution test data for an analgesic in tablet form conducted at two manufacturing sites (New Jersey
and Puerto Rico) are listed in Table SIA8.1. (These data are saved in the DISSOLVE file.) Note that %LS val-
ues were obtained at four different points in time – after 20 minutes, after 40 minutes, after 60 minutes, and
after 120 minutes – for each of the six vessels. Based on the sample data, do the two sites produce equiv-
alent assay results?
In the Statistics in Action Revisited section later in this chapter, we demonstrate how the methods out-
lined in this chapter can be used to make the comparison required by the quality control engineers at the
pharmaceutical company.

DISSOLVE
TABLE SIA8.1 Dissolution Test Data (Percent Label Strength)
Site Time (min) Vessel 1 Vessel 2 Vessel 3 Vessel 4 Vessel 5 Vessel 6
New Jersey 20 5 10 2 7 6 0
40 72 79 81 70 72 73
60 96 99 93 95 96 99
120 99 99 96 100 98 100

Puerto Rico 20 10 12 7 3 5 14
40 65 66 71 70 74 69
60 95 99 98 94 90 92
120 100 102 98 99 97 100
Source: Reeve, R., and Giesbrecht, F. “Dissolution Method Equivalence.” Statistical Case Studies: A Collaboration between Academe and Industry,
(editors: R. Peck, L. Haugh, and A. Goodman), ASA-SIAM Series on Statistics and Applied Probability, 1998 (Chapter 4, Table 4).
370 Chapter 8 Tests of Hypotheses

8.1 The Relationship Between Statistical Tests


of Hypotheses and Confidence Intervals
As stated in Chapter 7, there are two general methods available for making inferences
about population parameters. We can estimate their values using confidence intervals
(the subject of Chapter 7) or we can make decisions about them. Making decisions
about specific values of the population parameters—testing hypotheses about these
values—is the topic of this chapter.
Confidence intervals and hypothesis tests are related and can be used to make de-
cisions about parameters. For example, suppose an investigator for the Environmental
Protection Agency (EPA) wants to determine whether the mean level m of a certain
type of pollutant released into the atmosphere by a chemical company meets the EPA
guidelines. If 3 parts per million is the upper limit allowed by the EPA, the investiga-
tor would want to use sample data (daily pollution measurements) to decide whether
the company is violating the law, i.e., to decide whether m 7 3. If, say, a 99% confi-
dence interval for m contained only numbers greater than 3, then the EPA would be
confident that the mean exceeds the established limit.
As a second example, consider a manufacturer that purchases electric light
fuses in lots of 10,000, and suppose that the supplier of the fuses guarantees that no
more than 1% of the fuses in any given lot are defective. Since the manufacturer
cannot test each of the 10,000 fuses in a lot, he must decide whether to accept or re-
ject a lot based on an examination of a sample of fuses selected from the lot. If the
number Y of defective fuses in a sample of, say, n = 100, is large, he will reject the
lot and send it back to the supplier. Thus, he wants to decide whether the proportion
p of defectives in the lot exceeds .01, based on information contained in a sample. If
a confidence interval for p falls below .01, then the manufacturer will accept the lot
and be confident that the proportion of defectives is less than 1%; otherwise, he will
reject it.
The examples in the preceding paragraphs illustrate how a confidence interval
can be used to make a decision about a parameter. Note that both applications are one-
directional; the EPA wants to determine whether m 7 3 and the manufacturer wants
to know if p 7 .01. (In contrast, if the manufacturer is interested in determining
whether p 7 .01 or p 6 .01, the inference would be two-directional.)
Recall, from Chapter 7, that to find the value of z (or t) used in a (1 - a)100%
confidence interval, the value of a is divided in half and a/2 is placed in both the upper
and lower tails of the Z (or T ) distribution. Consequently, confidence intervals are de-
signed to be two-directional. Use of a two-directional technique in a situation where a
one-directional method is desired will lead the researcher (e.g., the EPA or the manu-
facturer) to understate the level of confidence associated with the method. As we will
explain in this chapter, hypothesis tests are appropriate for either one- or two-
directional decisions about a population parameter.

8.2 Elements and Properties of a Statistical Test


We now return to the EPA example to introduce the concepts involved in a test of a hy-
pothesis. We will use a method analogous to proof by contradiction. The theory the
EPA wants to support, called the alternative (or research) hypothesis, is that m 7 3,
where m is the true mean level of pollution in parts per million. The alternative hy-
pothesis is denoted by the symbol Ha. The theory contradictory to the alternative hy-
pothesis, that m is at most equal to 3, say, m = 3, is called the null hypothesis and is
denoted by the symbol H0. Thus, the EPA hopes to show support for the alternative
8.2 Elements and Properties of a Statistical Test 371

hypothesis, m 7 3, by obtaining sample evidence indicating that the null hypothesis,


m = 3, is false. That is, the EPA wants to test
H0: m = 3
Ha: m 7 3
The decision whether to reject the null hypothesis is based on a statistic, called a
test statistic, computed from sample data. For example, suppose the EPA plans to
base its decision on a sample of n = 30 daily pollution readings. If the sample mean y
of the 30 pollution measurements is much larger than 3, the EPA would tend to reject
the null hypothesis and conclude that m 7 3. However, if y is smaller than 3, say,
y = 2.8 parts per million, there is insufficient evidence to refute the null hypothesis.
Thus, the sample mean y serves as a test statistic.
The values that the test statistic y can assume will be divided into two sets. Those
larger than some specified value, say, y Ú 3.1, will imply rejection of the null hy-
pothesis and acceptance of the alternative hypothesis. This set of values of the test sta-
tistic is known as the rejection region for the test. A test of the null hypothesis, H0:
m = 3, against the alternative hypothesis, Ha: m 7 3, employing the sample mean
y as a test statistic and y Ú 3.1 as a rejection region, represents one particular test that
possesses specific properties. If we change the rejection region to y Ú 3.2, we obtain a
different test with different properties.
The preceding discussion indicates that a statistical test consists of the five ele-
ments summarized in the box.

Elements of a Statistical Test


1. Null hypothesis, H0, about one or more population parameters
2. Alternative hypothesis, Ha, that we will accept if we decide to reject the null
hypothesis
3. Test statistic, computed from sample data
4. Rejection region, indicating the values of the test statistic that will imply re-
jection of the null hypothesis
5. Conclusion, the decision made on whether to accept or reject the null hypothesis

Since a statistical test can result in one of only two outcomes—rejecting or not re-
jecting the null hypothesis—the test conclusion is subject to only two types of error. In
the preceding example, the EPA wants to test H0: m = 3 against Ha: m 7 3. If the
EPA investigator concludes that Ha is true (i.e., if he rejects H0), then the EPA will
charge the company with violating its pollution standards. The two errors that the EPA
can make are shown in Table 8.1.
The EPA might reject the null hypothesis if, in fact, it is true. That is, the EPA
might charge the company with violating its standards, when, in fact, the company is
innocent (Type I error). Or the EPA might decide to accept the null hypothesis if, in
fact, it is false. That is, the EPA may conclude that the company is not in violation of

TABLE 8.1 Conclusions and Consequences for the EPA’s


Test of Hypothesis
True State of Nature

Company Not in Violation Company in Violation


EPA Decision (H0 true) (Ha true)
Company in Violation (Reject H0) Type I error Correct decision
Company Not in Violation (Accept H0) Correct decision Type II error
372 Chapter 8 Tests of Hypotheses

the pollution standards when, in fact, the company is in violation (Type II error). The
probabilities of making these two types of errors measure the risks of making incor-
rect decisions when we perform a test of hypothesis and, consequently, provide meas-
ures of the goodness of this inferential decision-making procedure.

Definition 8.1
Rejecting the null hypothesis if it is true is a Type I error. The probability of making a Type I error is de-
noted by the symbol a.

Definition 8.2
Accepting the null hypothesis if it is false is a Type II error. The probability of making a Type II error is
denoted by the symbol b.

Which of the two errors, Type I or Type II, is more serious? From the EPA’s per-
spective, the Type I error is the more serious error. If the EPA falsely accuses the com-
pany of violating the pollution limits, a costly lawsuit will likely occur. On the other
hand, the residents who live near the chemical company would probably view the Type II
error as more serious; if this error occurs, the EPA is failing to charge the company
when it is, in fact, polluting the surrounding air. In either case, it is important to com-
pute the probabilities, a and b, to assess the reliability of inferences derived from the
hypothesis test. The next four examples illustrate how to compute these probabilities.

Example 8.1 A manufacturer of notebook computers believes that it can sell a particular software package to
more than 20% of the buyers of its computers. Ten prospective purchasers of the notebook computer
Elements of a Statistical Test:
were randomly selected and questioned about their interest in the software package. Of these, four
Proportion of Software indicated that they planned to buy the package. Does this sample provide sufficient evidence to indi-
Purchasers cate that more than 20% of the computer purchasers will buy the software package?

Solution Let p be the true proportion of all prospective notebook computer buyers who will
purchase the software package. Since we want to show that p 7 .2, we choose Ha:
p 7 .2 for the alternative hypothesis and H0: p = .2 for the null hypothesis. We will
use the binomial random variable Y, the number of prospective purchasers in the sam-
ple who plan to buy the software, as the test statistic and will reject H0: p = .2 if Y is
large. A graph of p( y) for n = 10 and p = .2 is shown in Figure 8.1.

p(y)

.3

.2

.1 α = .121

y
0 1 2 3 4 5 6 7 8 9 10
Rejection region

FIGURE 8.1
Graph of p(y) for n = 10 and p = .2, i.e., if the null hypothesis is true
8.2 Elements and Properties of a Statistical Test 373

Large values of Y will support the alternative hypothesis, Ha: p 7 .2, but what
values of Y should we include in the rejection region? Suppose that we select values of
Y Ú 4 as the rejection region. Then the elements of the test are
H0: p = .2
Ha: p 7 .2
Test statistic: Y = y
Rejection region: y Ú 4
To conduct the test, we note that the observed value of Y, y = 4, falls in the rejection
region. Thus, for this test procedure, we reject the null hypothesis, H0: p = .2, and
conclude that the manufacturer is correct, i.e., p 7 .2.

Example 8.2 What is the probability that the statistical test procedure of Example 8.1 would lead us to an incorrect
decision if, in fact, the null hypothesis is true?
Computing the Type I
Error Rate
Solution We will calculate the probability a that the test procedure would lead us to make a
Type I error, i.e., to reject H0 if, in fact, H0 is true. This is the probability that y falls in
the rejection region if in fact p = .2:
3
a = P1Y Ú 4 ƒ p = .22 = 1 - a p1y2
y=0

gy=0
3
The partial sum p( y) for a binomial random variable with n = 10 and p = .2
is given in Table 2 of Appendix B as .879. Therefore,
3
a = 1 - a p1y2 = 1 - .879 = .121
y=0

The probability that the test procedure would lead us to conclude that p 7 .2, if in
fact it is not, is .121. This probability corresponds to the area of the shaded region in
Figure 8.1.

In Example 8.1, we computed the probability a of committing a Type I error. The


probability b of making a Type II error, i.e., failing to detect a value of p greater than
.2, depends on the value of p. For example, if p = .20001, it will be very difficult to
detect this small deviation from the null hypothesized value of p = .2. In contrast, if
p = 1.0, then every prospective purchaser of the minicomputer will want to buy the
software package, and in such a case it will be very evident from the sample informa-
tion that p 7 .2. We will illustrate the procedure for calculating b in Example 8.3.

Example 8.3 Refer to Example 8.2 and suppose that p is actually equal to .60. What is the probability b that the test
procedure will fail to reject H0: p = .2 if, in fact, p = .6?
Computing the Type II
Error Rate
Solution The binomial probability distribution p(y) for n = 10 and p = .6 is shown in Figure 8.2.
The probability that we will fail to reject H0 is equal to the probability that
Y = 0, 1, 2, or 3, i.e., the probability that Y does not fall in the rejection region. This
probability, b, corresponds to the shaded area under the probability histogram in the
figure. Therefore,
3
b = P1Y … 3 | p = .62 = a p1y2 for n = 10 and p = .6
y=0
374 Chapter 8 Tests of Hypotheses

p(y)

.3

.2

.1

y
0 1 2 3 4 5 6 7 8 9 10
Rejection region

FIGURE 8.2
Graph of p(y) for n = 10 and p = .6, i.e., if the alternative hypothesis is true

This partial sum, given in Table 2 of Appendix B for a binomial random variable
with n = 10 and p = .6, is .055. Therefore, the probability that we will fail to reject
H0: p = .2 if p is as large as .6 is b = .055.

Another important property of a statistical test is its ability to detect departures


from the null hypothesis when they exist. This is measured by the probability of re-
jecting H0 when, in fact, H0 is false. Note that this probability is simply (1 - b ):
P1Reject H0 when H0 is false2 = 1 - P1Accept H0 when H0 is false2
= 1 - P1Type II error2
= 1 - b
The probability (1 - b ) is called the power of the test. The higher the power, the
greater the probability of detecting departures from H0 when they exist.

Definition 8.3
The power of a statistical test, (1 - b ), is the probability of rejecting the null hypothesis H0 when, in
fact, H0 is false.

Example 8.4 Refer to the test of hypothesis in Example 8.1. Find the power of the test if in fact p = .3.

Computing the Power


of a Test
Solution From Definition 8.3, the power of the test is the probability (1 - b ). The probability
of making a Type II error, i.e., failing to reject H0: p = .2, if in fact p = .3, will be
larger than the value of b calculated in Example 8.3 because p = .3 is much closer to
the hypothesized value of p = .2. Thus,
3
b = P1Y … 3 | p = .32 = a p1y2 for n = 10 and p = .3
y=0

The value of this partial sum, given in Table 2 of Appendix B for a binomial random
variable with n = 10 and p = .3, is .650. Therefore, the probability that we will fail
to reject H0: p = .2 if in fact p = .3 is b = .650 and the power of the test is
11 - b2 = 11 - .6502 = .350. You can see that the closer the actual value of p is to
the hypothesized null value, the more unlikely it is that we will reject H0: p = .2.

The preceding examples indicate how we can calculate a and b for a simple sta-
tistical test and thereby measure the risks of making Type I and Type II errors. These
8.2 Elements and Properties of a Statistical Test 375

probabilities describe the properties of this inferential decision-making procedure and


enable us to compare one test with another. For two tests, each with a rejection region
selected so that a is equal to some specified value, say, .10, we would select the
test that, for a specified alternative, has the smaller risk of making a Type II error, i.e.,
one that has the smaller value of b. This is equivalent to choosing the test with the
higher power.
We will present a number of statistical tests in the following sections. In each
case, the probability a of making a Type I error is known, i.e., a is selected by the ex-
perimenter and the rejection region is determined accordingly. In contrast, the value of
b for a specific alternative is often difficult to calculate. This explains why we attempt
to show that Ha is true by showing that the data do not support H0. We hope that the
sample evidence will support the alternative (or research) hypothesis. If it does, we
will be concerned only about making a Type I error, i.e., rejecting H0 if it is true. The
probability a of committing such an error will be known.

Applied Exercises
8.1 Miscellaneous. Define a and b for a statistical test of a. A false negative corresponds to which type of error,
hypothesis. Type I or Type II?
b. A false positive corresponds to which type of error,
8.2 Miscellaneous. Explain why each of the following state-
Type I or Type II?
ments is incorrect:
c. Which of the two errors is more serious? Explain.
a. The probability that the null hypothesis is correct is
equal to a. 8.4 Pascal array variables. Pascal is a high-level programming
b. If the null hypothesis is rejected, then the test proves language used frequently in microprocessors. An experi-
that the alternative hypothesis is correct. ment was conducted to investigate the proportion of Pascal
c. In all statistical tests of hypothesis, a + b = 1. variables that are array variables (in contrast to scalar
8.3 Screening new drugs. Pharmaceutical companies are con- variables, which are less efficient in terms of execution
tinually searching for new drugs. Testing the thousands of time). Twenty variables are randomly selected from a set
compounds for the few that might be effective is known in of Pascal programs and Y, the number of array variables, is
the pharmaceutical industry as drug screening. Dunnett recorded. Suppose we want to test the hypothesis that Pas-
(1978) views the drug-screening procedure in its prelimi- cal is a more efficient language than Algol, in which 20%
nary stage in terms of a statistical decision problem: “In of the variables are array variables. That is, we will test
drug screening, two actions are possible: (1) to ‘reject’ the H0: p = .20 against Ha: p 7 .20, where p is the probabil-
drug, meaning to conclude that the tested drug has little or ity of observing an array variable on each trial. (Assume
no effect, in which case it will be set aside and a new drug that the 20 trials are independent.)
selected for screening; and (2) to ‘accept’ the drug provi- a. Find a for the rejection region Y Ú 8.
sionally, in which case it will be subjected to further, more b. Find a for the rejection region Y Ú 5.
refined experimentation.”* Since it is the goal of the re- c. Find b for the rejection region Y Ú 8 if p = .5. (Note:
searcher to find a drug that affects a cure, the null and al- Past experience has shown that approximately half the
ternative hypotheses in a statistical test would take the variables in most Pascal programs are array variables.)
following form: d. Find b for the rejection region Y Ú 5. if p = .5.
H0: Drug is ineffective in treating a particular disease e. Which of the rejection regions, Y Ú 8 or Y Ú 5, is
Ha: Drug is effective in treating a particular disease more desirable if you want to minimize the probability
of a Type I error? Type II error?
Dunnett comments on the possible errors associated with f. Find the rejection region of the form Y Ú a so that a is
the drug-screening procedure: “To abandon a drug when approximately equal to .01.
in fact it is a useful one (a false negative) is clearly unde- g. For the rejection region determined in part f, find the
sirable, yet there is always some risk in that. On the other power of the test, if in fact p = .4.
hand, to go ahead with further, more expensive testing of a h. For the rejection region determined in part f, find the
drug that is in fact useless (a false positive) wastes time power of the test, if in fact p = .7.
and money that could have been spent on testing other
compounds.” 8.5 Defective power meters. A manufacturer of power meters,
which are used to regulate energy thresholds of a data-
*From Tanur, J. M., et al., eds. Statistics: A Guide to the Unknown. San communications system, claims that when its production
Francisco: Holden-Day, 1978. process is operating correctly, only 10% of the power
376 Chapter 8 Tests of Hypotheses

meters will be defective. A vendor has just received a ship- by checking characteristics of the proposed user’s palm
ment of 25 power meters from the manufacturer. Suppose against those stored in the authorized users’ data bank.
the vendor wants to test H0: p = .10 against Ha: p 7 .10, a. Define a Type I error and Type II error for this test.
where p is the true proportion of power meters that are de- Which is the more serious error? Why?
fective. Use Y Ú 6 as the rejection region. b. Palmguard reports that the Type I error rate for its sys-
a. Determine the value of a for this test procedure. tem is less than 1%, whereas the Type II error rate is
b. Find b if in fact p = .2. What is the power of the test .00025%. Interpret these error rates.
for this value of p? c. Another successful security system, the EyeDentifyer,
c. Find b if in fact p = .4. What is the power of the test “spots authorized computer users by reading the one-
for this value of p? of-a-kind patterns formed by the network of minute
8.6 Authorizing computer users. At high-technology indus-
blood vessels across the retina at the back of the eye.”
tries, computer security is achieved by using a password— The EyeDentifyer reports Type I and II error rates of
a collection of symbols (usually letters and numbers) that .01% (1 in 10,000) and .005% (5 in 100,000), respec-
must be supplied by the user before the computer permits tively. Interpret these rates.
access to the account. The problem is that persistent hack-
ers can create programs that enter millions of combina- Theoretical Exercise
tions of symbols into a target system until the correct
password is found. The newest systems solve this problem 8.7 Show that for a fixed sample size n, a increases as b de-
by requiring authorized users to identify themselves by creases, and vice versa.
unique body characteristics. For example, a system devel-
oped by Palmguard, Inc. tests the hypothesis
H0: The proposed user is authorized
Ha: The proposed user is unauthorized

8.3 Finding Statistical Tests: Classical Methods


To find a statistical test about one or more population parameters, we must (1) find a suit-
able test statistic and (2) specify a rejection region. Classical statisticians use a method
proposed by R. A. Fisher for finding a reasonable test statistic for testing a hypothesis. For
example, suppose we want to test a hypothesis about the sole parameter u of a probability
function p(y) or density function f(y), and let L represent the likelihood function of
the sample. Then to test the null hypothesis, H0: u = u0, Fisher’s likelihood ratio test
statistic is

Likelihood assuming u = u0 L1u02


l =
L1uN 2
=
Likelihood assuming u = uN

where uN is the maximum likelihood estimator of u. Fisher reasoned that if u differs


from u0, then the value of the likelihood L when u = uN will be larger than when
u = u0. Thus, the rejection region for the test contains values of l that are small—say,
smaller than some value lR.
If you are interested in learning more about Fisher’s likelihood ratio test, consult
the references at the end of this chapter. Fortunately, most of the statistics that we
would choose intuitively for test statistics are functions of the corresponding likeli-
hood ratio statistic l. These are the pivotal statistics used to construct confidence in-
tervals in Chapter 7.
Recall that most of the pivotal statistics in Chapter 7 have approximately normal
sampling distributions for large samples. This fact allows us to easily derive a large-
sample statistical test of hypothesis. To illustrate, suppose that we want to test a hy-
pothesis, H0: u = u0, about a parameter u and that the estimator uN possesses a normal
sampling distribution with mean u and standard deviation suN . We will further assume
that suN is known or that we can obtain a good approximation for it when the sample
8.3 Finding Statistical Tests: Classical Methods 377

size(s) is (are) large. It can be shown (proof omitted) that the likelihood ratio test sta-
tistic l reduces to the standard normal variable Z:

uN - u0
Z =
suN
The location of the rejection region for this test can be deduced by examining the
formula for the test statistic Z. The farther uN departs from u0, i.e., the larger the absolute
value of the deviation ƒ uN - u0 ƒ , the greater will be the weight of evidence to indicate
that u is not equal to u0. If we want to detect values of u larger than u0, i.e., Ha: u 7 u0,
we locate the rejection region in the upper tail of the sampling distribution of the
standard normal z test statistic (see Figure 8.3a). If we want to detect only values of u
less than u0, i.e., Ha: u 6 u0, we locate the rejection region in the lower tail of the z
distribution (see Figure 8.3b). These two tests are called one-tailed statistical tests
because the entire rejection region is located in only one tail of the Z distribution.
However, if we want to detect either a value of u larger than u0 or a value smaller than
u0, i.e., Ha: u Z u0, we locate the rejection region in both the upper and the lower tails
of the z distribution (see Figure 8.3c). This is called a two-tailed statistical test.

f(z) f(z)

α α

z z
0 0
Rejection Rejection
region region
zα –zα
a. One-tailed test; b. One-tailed test;
Ha: θ > θ0 Ha: θ < θ0

f(z)

α α
2 2
z
0
Rejection Rejection
region region
–zθ/2 zθ/2
c. Two-tailed test;
Hθ: θ ≠ θ0

FIGURE 8.3
Rejection regions for one- and two-tailed tests
378 Chapter 8 Tests of Hypotheses

The large-sample statistical test that we have described is summarized in the fol-
lowing box. Many of the population parameters and test statistics discussed in the re-
maining sections of this chapter satisfy the assumptions of this test. We will illustrate
the use of the test with a practical example on the population mean m.

A Large-Sample Test Based on the Standard Normal z Test Statistic


One-Tailed Test Two-Tailed Test
H0: u = u0 H0: u = u0
Ha: u 7 u0 (or Ha: u 6 u02 Ha: u Z u0
uN - u0 uN - u0
Test statistic: Z = Test statistic: Z =
suN suN
Rejection region: Z 7 za 1or Z 6 - za2 Rejection region: ƒ Z ƒ 7 za/2
where P1Z 7 za2 = a where P1Z 7 za/22 = a/2

Example 8.5 The Department of Highway Improvements, responsible for repairing a 25-mile stretch of interstate
highway, wants to design a surface that will be structurally efficient. One important consideration is
Testing m: Mean Number of the volume of heavy freight traffic on the interstate. State weigh stations report that the average
Heavy Freight Trucks number of heavy-duty trailers traveling on a 25-mile segment of the interstate is 72 per hour. However,
Traveling per Hour the section of highway to be repaired is located in an urban area and the department engineers be-
lieve that the volume of heavy freight traffic for this particular section is greater than the average re-
ported for the entire interstate. To validate this theory, the department monitors the highway for 50
1-hour periods randomly selected throughout the month. Suppose the sample mean and standard de-
viation of the heavy freight traffic for the 50 sampled hours are

y = 74.1 s = 13.3
Do the data support the department’s theory? Use a = .10.

Solution For this example, the parameter of interest is m, the average number of heavy-duty
trailers traveling on the 25-mile stretch of interstate highway. Recall that the sample
mean y is used to estimate m and that for large n, y has an approximately normal sam-
pling distribution. Thus, we can apply the large-sample test outlined in the box.
The elements of the test are
H0: m = 72
Ha: m 7 72
y - 72 y - 72 y - 72
Test statistic: Z = = L
syq s/ 1n s/ 1n
Rejection region: Z 7 1.28
1since z.10 = 1.28, from Table 5 of Appendix B2
We now substitute the sample statistics into the test statistic to obtain
74.1 - 72
Z L = 1.12
13.3/150
Thus, although the average number of heavy freight trucks per hour in the sample
exceeds the state’s average by more than 2, the Z value of 1.12 does not fall in the re-
jection region (see Figure 8.4). Therefore, this sample does not provide sufficient evi-
dence at a = .10 to support the Department of Highway Improvements theory.

What is the risk of making an incorrect decision in Example 8.5? If we reject the
null hypothesis, then we know that the probability of making a Type I error (rejecting
H0 if it is true) is a = .10. However, we failed to reject the null hypotheses in
8.3 Finding Statistical Tests: Classical Methods 379

FIGURE 8.4 f(z)


Location of the test statistic for
Example 8.5

α = .10

z
0 Z = 1.12
Rejection
region
z.10 = 1.28

Example 8.5 and, consequently, we must be concerned about the possibility of making
a Type II error (accepting H0 if, in fact, it is false). We will evaluate the risk of mak-
ing a Type II error in Example 8.6.

Example 8.6 Refer to the one-tailed test for m, Example 8.5. If the mean number m of heavy freight trucks traveling
a particular 25-mile stretch of interstate highway is in fact 78 per hour, what is the probability that the
Calculating b for the test procedure of Example 8.5 would fail to detect it? That is, what is the probability b that we would
Traveling Trucks Test fail to reject H0: m = 72 in this one-tailed test if m is actually equal to 78?

Solution To calculate b for the large-sample Z test, we need to specify the rejection region in
terms of the point estimator uN , where, for this example, uN = y. From Figure 8.4, you
can see that the rejection region consists of values of Z Ú 1.28. To determine the
value of y corresponding to z = 1.28, we substitute into the equation

y - m0 y - m0 y - 72
Z = L or 1.28 =
s/ 1n s/ 1n 13.3/150

Solving for y, we obtain y = 74.41. Therefore, the rejection region for the test is
Z Ú 1.28 or, equivalently, y Ú 74.41.
The dotted curve in Figure 8.5 is the sampling distribution for y if H0: m = 72 is
true. This curve was used to locate the rejection region for y (and, equivalently, z),
i.e., values of y contradictory to H0: m = 72. The solid curve is the sampling distribu-
tion for y if m = 78. Since we want to find b if H0 is in fact false and m = 78, we
want to find the probability that y does not fall in the rejection region if m = 78. This

FIGURE 8.5 f(y)


The probability b of making a Type
II error if m = 78 in Example 8.6

y
μ = 72 74.41 μ = 78
Rejection region
380 Chapter 8 Tests of Hypotheses

probability corresponds to the shaded area under the solid curve for values of
y 6 74.41. To find this area under the normal curve, we need to find the area A corre-
sponding to

y - 78 74.41 - 78
Z = L = - 1.91
s/ 1n 13.3/150

The value of A, given in Table 5 of Appendix B, is .4719. Then from Figure 8.5, it can
be seen that

b = .5 - A = .5 - .4719 = .0281

Therefore, the probability of failing to reject H0: m = 72 if m is, in fact, as large as


m = 78 is only .0281.

Example 8.6 illustrates that it is not too difficult to calculate b for various alterna-
tives for the large-sample Z test (see box). However, it may be extremely difficult to
calculate b for other tests. Although sophisticated techniques are available for evaluat-
ing the risk of making a Type II error when the exact value of b is unavailable or is dif-
ficult to calculate, they are beyond the scope of this text. Consult the references at the
end of this chapter if you are interested in learning about these methods.

Calculating b for a Large-Sample Z Test


Consider a large-sample test of H0: u = u0 at significance level a. The value of b for
a specific value of the alternative u = ua is calculated as follows:

uN 0 - ua
Upper-tailed test: b = P¢Z 6 ≤
suN

where uN 0 = u0 + zasuN is the value of the estimator corresponding to the border of


the rejection region

uN 0 - ua
Lower-tailed test: b = P¢Z 7 ≤
suN

where uN 0 = u0 - zasuN is the value of the estimator corresponding to the border of


the rejection region

uN 0,L - ua uN 0,U - ua
Two-tailed test: b = P¢ 6 Z 6 ≤
suN suN

where uN 0,U = u0 + zasuN and uN 0,L = u0 - zasuN are the values of the estimator cor-
responding to the borders of the rejection region

Theoretical Exercises
8.8 Suppose y1, y2, Á , yn is a random sample from a normal Show that the likelihood L of the sample is
distribution with unknown mean m and variance s2 = 1, i.e.,
1 n n 1y - m22>2
1 -1y - m22>2 L1m2 = ¢ ≤ e-a i = 1 i
f1y2 = e 12p
12p
8.4 Choosing the Null and Alternative Hypotheses 381

8.9 Refer to Exercise 8.8. Suppose we want to test H0: m = 0 8.10 Refer to Exercise 8.8 and 8.9. Show that the rejection re-
against the alternative Ha: m 7 0. Since the estimator of m gion l … la is equivalent to the rejection region y Ú ya,
is mN = y, the likelihood ratio test statistic is where P1l … la2 = a and P1y Ú ya2 = a. (Hint: Use
L1m02
2
L102 the fact that e -a : 0 as ƒ a ƒ : q .)
l =
L1mN 2
=
L1y2
Show that
l = e-n1y2 >2
2

[Hint: Use the fact that g i = 11yi - y22 = g i = 1 y 2i - ny 2.]


n n

8.4 Choosing the Null and Alternative Hypotheses


Now that you have conducted a large-sample statistical test of hypothesis and have
seen how to calculate the value of b—the probability of failing to reject H0: u = u0 if
u is in fact equal to some alternative value, u = ua—the logic for choosing the null
and alternative hypotheses may make more sense to you. The theory that we want to
support (or detect if true) is usually chosen as the alternative hypothesis because, if the
data support Ha (i.e., if we reject H0), we immediately know the value of a, the prob-
ability of incorrectly rejecting H0 if it is true. For example, in Example 8.5, the De-
partment of Highway Improvements theorized that the mean number of heavy-duty
vehicles traveling a certain segment of interstate exceeds 72 per hour. Consequently,
the department set up the alternative hypothesis as Ha: m 7 72. In contrast, if we
choose the null hypothesis as the theory that we want to support, and if the data
support this theory, i.e., the test leads to nonrejection of H0, then we would have to
investigate the values of b for some specific alternatives. Clearly, we want to avoid
this tedious and sometimes extremely difficult task, if possible.
Another issue that arises in a practical situation is whether to conduct a one- or a
two-tailed test. The decision depends on what you want to detect. For example, sup-
pose you operate a chemical plant that produces a variable amount Y of product per
day and that if E(Y) ⫽ m is less than 100 tons per day, you will eventually be bank-
rupt. If m exceeds 100 tons per day, you are financially safe. To determine whether your
process is leading to financial disaster, you will want to detect whether m 6 100 tons,
and you will conduct a one-tailed test of H0: m = 100 versus Ha: m 6 100. If you were
to conduct a two-tailed test for this situation, you would reduce your chance of detect-
ing values of m less than 100 tons, i.e., you would increase the values of b for alterna-
tive values of m 6 100 tons.
As a different example, suppose you have designed a new drug so that its mean
potency is some specific level, say, 10%. As the mean potency tends to exceed 10%,
you lose money. If it is less than 10% by some specified amount, the drug becomes in-
effective as a pharmaceutical (and you lose money). To conduct a test of the mean po-
tency m for this situation, you would want to detect values of m either larger than or
smaller than m = 10. Consequently, you would select Ha: m Z 10 and conduct a two-
tailed statistical test (or alternatively, construct a confidence interval).
These examples demonstrate that a statistical test is an attempt to detect departures
from H0; the key to the test is to define the specific alternatives that you want to detect.
We must stress, however, that H0 and Ha should be constructed prior to obtaining and
observing the sample data. If you use information in the sample data to aid in selecting H0
and Ha, the prior information gained from the sample biases the test results—specifically,
the true probability of a Type I error will be larger than the preselected value of a.
382 Chapter 8 Tests of Hypotheses

Example 8.7 A metal lathe is checked periodically by quality control inspectors to determine whether it is produc-
ing machine bearings with a mean diameter of .5 inch. If the mean diameter of the bearings is larger
Choosing H0 and Ha for or smaller than .5 inch, then the process is out of control and needs to be adjusted. Formulate the null
Testing the Mean Diameter and alternative hypotheses that could be used to test whether the bearing production process is out
of Bearings of control.

Solution The hypotheses must be stated in terms of a population parameter. Thus, we define
m = true mean diameter 1in inches2 of all bearings produced by the lathe
If either m 7 .5 or m 6 .5, then the metal lathe’s production process is out of control.
Since we wish to be able to detect either possibility, the null and alternative hypothe-
ses would be
H0: m = .5 (i.e., the process is in control)
Ha: m Z .5 (i.e., the process is out of control)

In Sections 8.5–8.12, we will present applications of the hypothesis-testing logic de-


veloped in this chapter. The cases to be considered are those for which we developed es-
timation procedures in Chapter 7. Since the theory and reasoning involved are based on
the developments of Chapter 7 and Sections 8.1–8.4, we will present only a summary of
the hypothesis-testing procedure for one-tailed and two-tailed tests in each situation.

Applied Exercises
In Exercises 8.11–8.16, formulate the appropriate null and alter- 8.15 Software vendor ratings. Each year, Computerworld
native hypotheses. magazine reports the Datapro ratings of all computer soft-
ware vendors. Vendors are rated on a scale from 1 to 4
8.11 Strength of natural fiber composites. An article in ACS
(1 = poor, 4 = excellent) in such areas as reliability, effi-
Sustainable Chemistry & Engineering (Vol. 1, 2013) inves-
ciency, ease of installation, and ease of use by a random
tigated the use of natural fiber composites produced from
sample of software users. A software vendor wants to de-
switchgrass. Researchers want to know if the mean tensile
termine whether its product has a higher mean Datapro
strength of this fiber composite exceeds 20 megapascals.
rating than a rival vendor’s product.
8.12 Egg-hatching rate of frogs. A herpetologist wants to deter-
8.16 Radium in soil. The Environmental Protection Agency
mine whether the egg-hatching rate for a certain species of
wishes to test whether the mean amount of radium-226 in
frog exceeds .5 when the eggs are exposed to ultraviolet
soil in a Florida county exceeds the maxium allowable
radiation.
amount, 4 pCi/L.
8.13 Testing fishing line. A manufacturer of fishing line wants
8.17 Real-time scheduling. Industrial engineers want to com-
to show that the mean breaking strength of a competitor’s
pare two methods of real-time scheduling in a manufactur-
22-pound line is really less than 22 pounds.
ing operation. Specifically, they want to determine
8.14 Loaded casino dice. A craps player who has experienced a whether the mean number of items produced differs for
long run of bad luck at the craps table wants to test the two methods.
whether the casino dice are “loaded,” i.e., whether the
proportion of “sevens” occurring in many tosses of the two
dice is different from 16 (if the dice are fair, the probability
of tossing a “seven” is 16 ).

8.5 The Observed Significance Level for a Test


According to the statistical test procedures described in the preceding sections, the re-
jection region and the corresponding value of a are selected prior to conducting the
test and the conclusion is stated in terms of rejecting or not rejecting the null hypothe-
sis. A second method of presenting the result of a statistical test is one that reports the
extent to which the test statistic disagrees with the null hypothesis and leaves the reader
8.5 The Observed Significance Level for a Test 383

the task of deciding whether to reject the null hypothesis. This measure of disagree-
ment is called the observed significance level (or p-value) for the test.*

Definition 8.4
The observed significance level, or p-value, for a specific statistical test is the probability
(assuming H0 is true) of observing a value of the test statistic that is at least as contradictory to
the null hypothesis, and supportive of the alternative hypothesis, as the one computed from the
sample data.

When publishing the results of a statistical test of hypothesis in journals, case


studies, reports, etc., many researchers make use of p-values. Instead of selecting a a
priori and then conducting a test as outlined in this chapter, the researcher may com-
pute and report the value of the appropriate test statistic and its associated p-value. It is
left to the reader of the report to judge the significance of the result, i.e., the reader
must determine whether to reject the null hypothesis in favor of the alternative
hypothesis, based on the reported p-value. Usually, the null hypothesis will be rejected
only if the observed significance level is less than the fixed significance level a chosen
by the reader. There are two inherent advantages of reporting test results in this man-
ner: (1) Readers are permitted to select the maximum value of a that they would be
willing to tolerate if they actually carried out a standard test of hypothesis in the man-
ner outlined in this chapter, and (2) it is an easy way to present the results of test
calculations performed by a computer. Most statistical software packages perform the
calculations for a test, give the observed value of the test statistic, and leave it to
the reader to formulate a conclusion. Others give the observed significance level for
the test, a procedure that makes it easy for the user to decide whether to reject the null
hypothesis.

Interpreting p-Values
1. Choose the maximum value of a you are willing to tolerate.
2. Find the observed significance level ( p-value) of the test.
3. Regret the null hypothesis if a 7 p-value.

Example 8.8 Find the observed significance level for the statistical test of Example 8.5 and interpret the result.

Finding a One-Tailed p-value


Solution In Example 8.5, we tested a hypothesis about the mean m of the number of heavy
freight trucks per hour using a particular 25-mile stretch of interstate highway. Since
we wanted to detect values of m larger than m0 = 72, we conducted a one-tailed test,
rejecting H0 for large values of y, or equivalently, large values of Z. The observed
value of Z, computed from the sample of n = 50 randomly selected 1-hour periods,
was Z = 1.12. Since any value of Z larger than Z = 1.12 would be even more contra-
dictory to H0, the observed significance level for the test is

p-value = P1Z Ú 1.122

This value corresponds to the shaded area in the upper tail of the z distribution
shown in Figure 8.6. The area A corresponding to z = 1.12, given in Table 5 of Ap-
pendix B, is .3686. Therefore, the observed significance level is

p-value = P1Z Ú 1.122 = .5 - A = .5 - .3686 = .1314

*The term p-value or probability value was coined by users of statistical methods. The p in the expression
p-value should not be confused with the binomial parameter p.
384 Chapter 8 Tests of Hypotheses

FIGURE 8.6 f(z)


Finding the p-value for an upper-
tailed test when z = 1.12

p-value = .1314

α = .10

z
0 Rejection
1.28 Region

Z = 1.12

This result indicates that the probability of observing a z value at least as contradictory
to H0 as the one observed in this (if H0 is in fact true) is .1314. Therefore, we will re-
ject H0 only for preselected values of a greater than .1314. Recall that the Department
of Highway Improvements selected a Type I error probability of a = .10. Since
p-value = .1314 exceeds a = .10, the department has insufficient evidence to reject
H0. Note that this conclusion agrees with that of Example 8.5, as shown in
Figure 8.6.

Example 8.9 Suppose that the test of Example 8.5 had been a two-tailed test, i.e., suppose that the alternative of
interest had been Ha: m Z 72. Find the observed significance level for the test and interpret the result.
Finding a Two-Tailed p-value Assume that a = .10, as in Example 8.5.

Solution If the test were two-tailed, either very large or very small values of Z would be contra-
dictory to the null hypothesis H0: m = 72. Consequently, values of Z Ú 1.12 or
Z … - 1.12 would be more contradictory to H0 than the observed value of Z = 1.12.
Therefore, the observed significance level for the test (shaded in Figure 8.7) is
p-value = P1Z Ú 1.122 + P1Z … - 1.122
= 21.13142 = .2628
Since we want to conduct the two-tailed test at a = .10, the rejection region is
ƒ Z ƒ 7 1.96, as shown in Figure 8.7. Note that the p-value exceeds a; we again have
insufficient evidence to reject H0.

FIGURE 8.7 f(z)


Finding the p-value for a two-tailed
test when z = 1.12
p-value = .2628

.1314 .1314

z
Rejection 0 Rejection
Region Region
Z = –1.12 Z = 1.12
–1.96 1.96
8.5 The Observed Significance Level for a Test 385

Observed significance levels are more easily obtained using statistical software.
The exact p-value for the one-tailed test of Example 8.8 is shown (shaded) on the
MINITAB printout, Figure 8.8. Typically, a researcher will utilize statistical software,
rather than probability tables or formulas, to find p-values.

FIGURE 8.8
MINITAB Output for One-Tailed
Test of a Population Mean

Note: Some statistical software packages (e.g., SPSS) will conduct only two-tailed
tests of hypothesis. For these packages, you obtain the p-value for a one-tailed test as
shown in the next box.

Converting a Two-Tailed p-Value from a Printout to a One-Tailed p-Value

if e
Reported p-value Ha is of form 7 and z is positive
p =
2 Ha is of form 6 and z is negative

p = 1 - a b if e
Reported p-value Ha is of form 7 and z is negative
2 Ha is of form 6 and z is positive

Applied Exercises
8.18 One-tailed p-value. For a large-sample test of H0: u = u0 c. a = .01, p-value = .001
versus Ha: u 7 u0, compute the p-value associated with d. a = .025, p-value = .05
each of the following test statistic values: e. a = .10, p-value = .45
a. z = 1.96 8.21 Converting a two-tailed p-value. In a test of H0: m = 75
b. z = 1.645 performed using the computer, SPSS reports a two-tailed
c. z = 2.67 p-value of .1032. Make the appropriate conclusion for
d. z = 1.25 each of the following situations:
8.19 Two-tailed p-value. For a large-sample test of H0: u = u0 a. Ha: m 6 75, z - - 1.63, a = .05
versus Ha: u Z u0, compute the p-value associated with b. Ha: m 6 75, z = 1.63, a = .10
each of the following test statistic values: c. Ha: m 6 75, z = - 1.63, a = .10
a. z = - 1.01 d. Ha: m 6 75, z = - 1.63, a = .01
b. z = - 2.37 8.22 p-value interpretation. An analyst tested the null hypoth-
c. z = 4.66 esis m Ú 20 against the alternative hypothesis that
d. z = 1.45 m 6 20. The analyst reported a p-value of .06. What is
8.20 Comparing “a” to p-value. For each a and observed sig- the smallest value of a for which the null hypothesis
nificance level ( p-value) pair, indicate whether the null would be rejected?
hypothesis would be rejected.
a. a = .05, p-value = .10
b. a = .10, p-value = .05
386 Chapter 8 Tests of Hypotheses

8.6 Testing a Population Mean


In Example 8.5, we developed a large-sample test for a population mean based on the
standard normal z statistic. The elements of this test are summarized in the box.

Large-Sample (n Ú 30) Test of Hypothesis About a Population Mean m


One-Tailed Test Two-Tailed Test
H0: m = m0 H0: m = m0
Ha: m 7 m0 (or Ha: m 6 m0) Ha: m Z m0

Test statistic: Test statistic:


y - m0 y - m0 y - m0 y - m0
Z = L Z = L
syq s/ 2n syq s/ 2n

Rejection region: Rejection region: ƒ Z ƒ 7 za/2


Z 7 za 1or Z 6 - za2
p-value = P1Z 7 z c2 3or, P1Z 6 z c24 p-value = 2P1Z 7 ƒ z c ƒ 2
where P1Z 7 za2 = a, P1Z 7 za/22 = a/2, m0 is our symbol for the particular
numerical value specified for m in the null hypothesis, and z c is the computed value
of the test statistic.

Assumptions: None (since the central limit theorem guarantees that y is approxi-
mately normal regardless of the distribution of the sampled population)

Example 8.10 Humerus bones from the same species of animal tend to have approximately the same length-to-width
ratios. When fossils of humerus bones are discovered, archeologists can often determine the species of
Large-Sample Test of m: animal by examining the length-to-width ratios of the bones. It is known that species A has a mean ratio
Mean Length-to-Width of 8.5. Suppose 41 fossils of humerus bones were unearthed at an archeological site in East Africa,
Ratio of Bones where species A is believed to have inhabited. (Assume that the unearthed bones are all from the same
unknown species.) The length-to-width ratios of the bones were measured and are listed in Table 8.2.
We wish to test the hypothesis that m, the population mean ratio of all bones of this particular
species, is equal to 8.5 against the alternative that it is different from 8.5, i.e., we wish to test whether
the unearthed bones are from species A.
a. Suppose we want a very small chance of rejecting H0, if, in fact, m is equal to 8.5. That is, it
is important that we avoid making a Type I error. Select an appropriate value of the signifi-
cance level, a.
b. Test whether m, the population mean length-to-width ratio, is different from 8.5, using the
significance level selected in part a.

BONES TABLE 8.2 Length-to-Width Ratios of a Sample of Humerus Bones

10.73 8.89 9.07 9.20 10.33 9.98 9.84 9.59


8.48 8.71 9.57 9.29 9.94 8.07 8.37 6.85
8.52 8.87 6.23 9.41 6.66 9.35 8.86 9.93
8.91 11.77 10.48 10.39 9.39 9.17 9.89 8.17
8.93 8.80 10.02 8.38 11.67 8.30 9.17 12.00
9.38
8.6 Testing a Population Mean 387

Solution a. The hypothesis-testing procedure that we have developed gives us the advantage of
being able to choose any significance level that we desire. Since the significance level,
a, is also the probability of a Type I error, we will choose a to be very small. In general,
researchers who consider a Type I error to have very serious practical consequences
should perform the test at a very low a value—say, a = .01. Other researchers may
be willing to tolerate an a value as high as .10 if a Type I error is not deemed a serious
error to make in practice. For this example, we will test at a = .01.
b. We formulate the following hypotheses:
H0: m = 8.5
Ha: m Z 8.5
Note that this is a two-tailed test, since we want to detect departures from m = 8.5 in
either direction. The sample size is large (n = 41); thus, we may proceed with the
large-sample test about m.
At significance level a = .01, we will reject the null hypothesis for this two-
tailed test if

ƒ Z ƒ 7 za/2 = z.005

i.e., if Z 6 - 2.58 or if Z 7 2.58. This rejection region is shown in Figure 8.9.


After entering the data of Table 8.2 into a computer, we obtained the summary
statistics shown in the SAS printout, Figure 8.7. The values y = 9.257 and s = 1.203
(shaded in the printout) are used to compute the test statistic

y - m0 9.257 - 8.5
Z L = = 4.03
s/ 1n 1.203/141

This test statistic value is also shaded on Figure 8.10, as well as the p-value of the test,
p-value = .002.

f(z)

α α
= .005 = .005
2 2
z
0
Reject H0 Reject H0
–z.005 = –2.58 –z.005 = 2.58
Observed value of test statistic
Z = 4.03

FIGURE 8.9
Rejection region for Example 8.10

Note that the test statistic lies within the rejection region (see Figure 8.9), and,
a = .01 exceeds the p-value. Consequently, we reject H0 and conclude that the mean
length-to-width ratio of all humerus bones of this particular species is significantly dif-
ferent from 8.5. If the null hypothesis is in fact true (i.e., if m = 8.5), then the proba-
bility that we have incorrectly rejected it is equal to a = .01.
388 Chapter 8 Tests of Hypotheses

FIGURE 8.10
SAS printout for Example 8.10

The practical implications of the result obtained in Example 8.10 remain to be stud-
ied further. Perhaps the animal discovered at the archeological site is of some species
other than A. Alternatively, the unearthed humerus bones may have larger than normal
length-to-width ratios because of unusual feeding habits of species A. It is not always
the case that a statistically significant result implies a practically significant result.
The researcher must retain objectivity and judge the practical significance using, among
other criteria, knowledge of the subject matter and the phenomenon under investigation.
A small-sample statistical test for making inferences about a population mean is
(like its associated confidence interval of Section 7.4) based on the assumption that
the sample data are independent observations on a normally distributed random vari-
able. The test statistic is based on the T distribution given in Section 7.4.
The elements of the statistical test are listed in the accompanying box. As we sug-
gested in Chapter 7, the small-sample test will possess the properties specified in the
box even if the sampled population is moderately nonnormal. However, for data that
departs greatly from normality (i.e., highly skewed data), we must resort to one of the
nonparametric techniques discussed in Chapter 15.

Small-Sample Test of Hypothesis About a Population Mean m


One-Tailed Test Two-Tailed Test

H0: m = m0 H0: m = m0
Ha: m 7 m0 1or Ha: m 6 m02 Ha: m Z m0
y - m0
Test statistic: T =
s/ 1n
Rejection region: T 7 ta 1or T 6 - ta2 Rejection region: ƒ T ƒ 7 ta/2
p-value = P1T Ú t c2 3or, P1T … t c24 p-value = 2P1T Ú ƒ t c ƒ 2

where the distribution of t is based on (n - 1) degrees of freedom; P1T 7 ta2 = a;


P1T 7 ta/22 = a/2, and t c is the computed value of the test-statistic.
Assumption: The relative frequency distribution of the population from which the
sample was selected is approximately normal.
Warning: If the data depart greatly from normality, this small-sample test may
lead to erroneous inferences. In this case, use the nonparametric sign test that is
discussed in Section 15.2.
8.6 Testing a Population Mean 389

Example 8.11 Scientists have labeled benzene, a chemical solvent commonly used to synthesize plastics, as a pos-
sible cancer-causing agent. Studies have shown that people who work with benzene more than
Small-Sample Test of m: 5 years have 20 times the incidence of leukemia than the general population. As a result, the federal
Mean Benzene Content government has lowered the maximum allowable level of benzene in the workplace from 10 parts per
million (ppm) to 1 ppm. Suppose a steel manufacturing plant, which exposes its workers to benzene
daily, is under investigation by the Occupational Safety and Health Administration (OSHA). Twenty air
samples, collected over a period of 1 month and examined for benzene content, yielded the data in
Table 8.3. Is the steel manufacturing plant in violation of the new government standards? Test the hy-
pothesis that the mean level of benzene at the steel manufacturing plant is greater than 1 ppm, using
a = .05.

BENZENE TABLE 8.3 Benzene Content for 20 Air Samples

0.21 1.44 2.54 2.97 0.00 3.91 2.24 2.41 4.50 0.15
0.30 0.36 4.50 5.03 0.00 2.89 4.71 0.85 2.60 1.26

Solution The OSHA wants to establish the research hypothesis that the mean level of benzene,
m, at the steel manufacturing plant exceeds 1 ppm. The elements of this small-sample
one-tailed test are
H0: m = 1
Ha: m 7 1
y - m0
Test statistic: T =
s/ 2n
Assumption: The relative frequency distribution of the population of benzene levels for
all air samples at the steel manufacturing plant is approximately normal.
Rejection region: For a = .05 and df = 1n - 12 = 19, reject H0 if T 7 t.05 = 1.729
(see Figure 8.11)
Summary statistics for the sample data are shown on the MINITAB printout, Figure 8.12.
The values of y and s (highlighted) are y = 2.143 and s = 1.736.
We now calculate the test statistic:
y - 1 2.143 - 1
T = = = 2.95
s/ 2n 1.736/ 220
The value of T is also shown (highlighted) on Figure 8.12 as well as the p-value of the
test, .004.
Note that the test statistic value falls into the rejection region (see Figure 8.11),
and a = .05 exceeds the p-value of the test. Therefore the OSHA concludes that
m 7 1 part per million and the plant is in violation of the new government standards.

FIGURE 8.11 f(t)


Rejection region for Example 8.11

α = .05

t
0
T = 2.95
Rejection
region
t.05 = 1.729
390 Chapter 8 Tests of Hypotheses

FIGURE 8.12
MINITAB printout for
Example 8.11

The reliability associated with this inference is a = .05. This implies that if the test-
ing procedure was applied repeatedly to random samples of data collected at the plant,
the OSHA would falsely reject H0 for only 5% of the tests. Consequently, the OSHA
is highly confident (95% confident) that the plant is violating the new standards.

Applied Exercises
FUP
8.23 Stability of compounds in new drugs. Refer to the ACS c. Locate values of the test statistic and corresponding
Medicinal Chemistry Letters (Vol. 1, 2010) study of the p-value on the printout.
metabolic stability of drugs, Exercise 2.16 (p. 36). Recall d. Select a value of a, the probability of a Type I error. In-
that two important values computed from the testing phase terpret this value in the words of the problem.
are the fraction of compound unbound to plasma ( fup) and e. Give the appropriate conclusion, based on the results of
the fraction of compound unbound to microsomes ( fumic). parts c and d.
A key formula for assessing stability assumes that the f. What conditions must be satisfied for the test results to
fup/fumic ratio is 1. Pharmacologists at Pfizer Global Re- be valid?
search and Development tested 416 drugs and reported the 8.24 Surface roughness of pipe. Refer to the Anti-corrosion
fup/fumic ratio for each. These data are saved in the FUP Methods and Materials (Vol. 50, 2003) study of the sur-
file and summary statistics are provided in the MINITAB face roughness of coated interior pipe used in oil fields,
printout shown below. Suppose the pharmacologists want Exercise 7.26 (p. 311). The data (in micrometers) for
to determine if the true mean ratio, m, differs from 1. 20 sampled pipe sections are reproduced in the table on
a. Specify the null and alternative hypothesis for this test. p. 391.
b. Descriptive statistics for the sample ratios are provided a. Give the null and alternative hypotheses for testing
in the accompanying MINITAB printout. Note that the whether the mean surface roughness of coated interior
sample mean, y = .327 is less than 1. Consequently, a pipe, m, differs from 2 micrometers.
pharmacologist wants to reject the null hypothesis. b. The results of the test, part a, are shown in the
What are the problems with using such a decision rule? MINITAB printout at the bottom of the page. Locate
the test statistic and p-value on the printout.
MINITAB Output for Exercise 8.23

MINITAB Output for Exercise 8.24


8.6 Testing a Population Mean 391

ROUGHPIPE g. Find the value of b for ma = 5 l/m2. Interpret this


value.
1.72 2.50 2.16 2.13 1.06 2.24 2.31 2.03 1.09 1.40 h. Find the power of the test for ma = 5 l/m2. Interpret
2.57 2.64 1.26 2.05 1.19 2.13 1.27 1.51 2.41 1.95 this value.
YIELD
Source: Farshad, F., and Pesacreta, T. “Coated pipe interior surface
roughness as measured by three scanning probe instruments.” Anti- 8.26 Yield strength of steel connecting bars. To protect against
corrosion Methods and Materials, Vol. 50, No. 1, 2003 (Table III). earthquake damage, steel beams are typically fitted and
connected with plastic hinges. However, these plastic hinges
c. Give the rejection region for the hypothesis test, using are prone to deformations and are difficult to inspect and re-
a = .05. pair. An alternative method of connecting steel beams—one
d. State the appropriate conclusion for the hypothesis test. that uses high strength steel bars with clamps—
e. In Exercise 7.26 you found a 95% confidence interval was investigated in Engineering Structures (July 2013).
for m. Explain why the confidence interval and test sta- Mathematical models for predicting the performance of
tistic lead to the same conclusion about µ. these steel connecting bars assume the bars have a mean
DISTILL
yield strength of 300 megapascals (MPa). To verify this
assumption, the researchers conducted material property
8.25 Water distillation with solar energy. In countries with a
tests on the steel connecting bars. In a sample of three
water shortage, converting salt water to potable water is a
tests, the yield strengths were 354, 370, and 359 MPa.
critical problem. The standard method of water distillation
(These data are saved in the YIELD file.) Do the data in-
is with a single slope solar still. Several enhanced solar
dicate that the true mean yield strength of the steel bars ex-
energy water distillation systems were investigated in
ceeds 300 MPa? Test using a = .01.
Applied Solar Energy (Vol. 46, 2010). One new system
employs a sun tracking meter and a step-wise basin. The 8.27 Cheek teeth of extinct primates. Refer to the American
new system was tested over three randomly selected days Journal of Physical Anthropology (Vol. 142, 2010) study
at a location in Amman, Jordan. The daily amounts of dis- of the characteristics of cheek teeth (e.g., molars) in an ex-
tilled water collected by the new system over the three tinct primate species, Exercise 2.14 (p. 35). Recall that the
days were 5.07, 5.45, and 5.21 liters per square meter researchers recorded the dentary depth of molars (in mil-
(l/m2). Suppose it is known that the mean daily amount of limeters) for a sample of 18 cheek teeth extracted from
distilled water collected by the standard method at the skulls. These depth measurements are reproduced in the
same location in Jordan is m = 1.4 l/m2. accompanying table. Anthropologists know that the mean
a. Set up the null and alternative hypotheses for determin- dentary depth of molars in an extinct primate species—
ing whether the mean daily amount of distilled water called Species “A”—is 15 millimeters. Is there evidence to
collected by the new system is greater than 1.4. indicate that the sample of 18 cheek teeth come from some
b. For this test, give a practical interpretation of the value other extinct primate species (i.e., some species other than
a = .10. Species “A”)? Use the accompanying SPSS printout to an-
c. Find the mean and standard deviation of the distilled swer the question.
water amounts for the sample of three days. (The data
CHEEKTEETH
are saved in the DISTILL file.)
d. Use the information from part c to calculate the test Data on Dentary Depth (mm) of Molars
statistic.
e. Find the observed significance level (p-value) of the test. 18.12 16.55
f. State, practically, the appropriate conclusion. 19.48 15.70
19.36 17.83
15.94 13.25
SPSS Output for Exercise 8.27
15.83 16.12
19.70 18.13
15.76 14.02
17.00 14.04
13.96 16.20
Source: Boyer, D.M., Evans, A.R., and
Jernvall, J. “Evidence of Dietary
Differentiation Among Late Paleocene-Early
Eocene Plesiadapids (Mammalia, Primates)”,
American Journal of Physical Anthropology,
Vol. 142, 2010. (Table A3.)
392 Chapter 8 Tests of Hypotheses

8.28 Dissolved organic compound in lakes. The level of dis- 8.29 Cooling method for gas turbines. During periods of high
solved oxygen in the surface water of a lake is vital to electricity demand, especially during the hot summer
maintaining the lake’s ecosystem. Environmentalists from months, the power output from a gas turbine engine can drop
the University of Wisconsin monitored the dissolved oxy- dramatically. One way to counter this drop in power is by
gen levels over time for a sample of 25 lakes in the state cooling the inlet air to the gas turbine. An increasingly popular
(Aquatic Biology, May 2010). To ensure a representative cooling method uses high-pressure inlet fogging. The per-
sample, the environmentalists focused on several lake formance of a sample of 67 gas turbines augmented with
characteristics, including dissolved organic compound high-pressure inlet fogging was investigated in the Journal of
(DOC). The DOC data (measured in grams per cubic-me- Engineering for Gas Turbines and Power (Jan. 2005). One
ters) for the 25 lakes are listed in the accompanying table. measure of performance is heat rate (kilojoules per kilowatt
The population of Wisconsin lakes has a mean DOC value per hour). Heat rates for the 67 gas turbines are listed in the
of 15 grams/m3. table on the bottom of page. Suppose that a standard gas tur-
a. Use a hypothesis test (at a = .10) to make an inference bine has, on average, a heat rate of 10,000 kJ/kWh. Conduct a
about whether the sample is representative of all Wis- test to determine if the mean heat rate of gas turbines aug-
consin lakes for the characteristic, dissolved organic mented with high-pressure inlet fogging exceeds 10,000
compound. kJ/kWh. Use a = .05.
b. What is the likelihood that the test, part a, will detect 8.30 Alkalinity of river water. In Exercise 5.36 (p. 205) you
a mean that differs from 15 grams/m3 if, in fact, ma = learned that the mean alkalinity level of water specimens
14 grams/m3? collected from the Han River in Seoul, Korea, is 50 milli-
grams per liter. (Environmental Science & Engineering,
WISCLAKES September 1, 2000.) Consider a random sample of 100
water specimens collected from a tributary of the Han
LAKE DOC LAKE DOC River. Suppose the mean and standard deviation of the al-
kalinity levels for the sample are y = 67.8 mg/L and
Allequash 9.6 Muskellunge 18.4
s = 14.4 mg/L. Is there sufficient evidence (at a = .01)
Big Muskellunge 4.5 Northgate Bog 2.7 to indicate that the population mean alkalinity level of
Brown 13.2 Paul 4.2 water in the tributary exceeds 50 mg/L?
Crampton 4.1 Peter 30.2 8.31 Walking straight into circles. When people get lost in un-
familiar terrain, do they really walk in circles, as is com-
Cranberry Bog 22.6 Plum 10.3
monly believed? To answer this question, researchers
Crystal 2.7 Reddington Bog 17.6 conducted a field experiment and reported the results in
EastLong 14.7 Sparkling 2.4 Current Biology (September 29, 2009). Fifteen volunteers
were blindfolded and asked to walk as straight as possible
Helmet 3.5 Tenderfoot 17.3 in a certain direction in a large field. Walking trajectories
Hiawatha 13.6 Trout Bog 38.8 were monitored every second for 50 minutes using GPS
Hummingbird 19.8 Trout Lake 3.0 and the average directional bias (degrees per second)
recorded for each walker. The data are shown in the table
Kickapoo 14.3 Ward 5.8 on p. 393. A strong tendency to veer consistently in the
Little Arbor Vitae 56.9 West Long 7.6 same direction will cause walking in circles. A mean di-
rectional bias of 0 indicates that walking trajectories were
Mary 25.1
random. Consequently, the researchers tested whether the
Source: Langman, O.C., et al. “Control of dissolved oxygen in
northern temperate lakes over scales ranging from minutes to days”,
Aquatic Biology, Vol. 9, May 2010 (Table 1).

GASTURBINE

14622 13196 11948 11289 11964 10526 10387 10592 10460 10086
14628 13396 11726 11252 12449 11030 10787 10603 10144 11674
11510 10946 10508 10604 10270 10529 10360 14796 12913 12270
11842 10656 11360 11136 10814 13523 11289 11183 10951 9722
10481 9812 9669 9643 9115 9115 11588 10888 9738 9295
9421 9105 10233 10186 9918 9209 9532 9933 9152 9295
16243 14628 12766 8714 9469 11948 12414
8.7 Testing the Difference Between Two Population Means: Independent Samples 393

CIRCLES 8.32 Deep hole drilling. ”Deep hole” drilling is a family of


drilling processes used when the ratio of hole depth to hole
- 4.50 - 1.00 - 0.50 - 0.15 0.00 0.01 0.02 0.05 0.15 diameter exceeds 10. Successful deep hole drilling depends
0.20 0.50 0.50 1.00 2.00 3.00 on the satisfactory discharge of the drill chip. An experi-
ment was conducted to investigate the performance of deep
Source: Souman, J.L., Frissen, I., Sreenivasa, M.N., & Ernst, M.O.
hole drilling when chip congestion exists (Journal of Engi-
“Walking straight into circles”, Current Biology, Vol. 19, No. 18,
Sep. 29, 2009 (Figure 2). neering for Industry, May 1993). The length (in millime-
ters) of 50 drill chips resulted in the following summary
true mean bias differed significantly from 0. A SAS print- statistics: y = 81.2 mm, s = 50.2 mm. Conduct a test to
out of the analysis is shown below. determine whether the true mean drill chip length, m, dif-
fers from 75 mm. Use a significance level of a = .01.
a. Interpret the results of the hypothesis test for the re-
searchers. Use a = .10.
b. Although most volunteers showed little overall bias,
the researchers produced maps of the walking paths Theoretical Exercise
showing that each occasionally made several small cir-
cles during the walk. Ultimately, the researchers sup- 8.33 Refer to Exercises 8.8–8.10 (p. 380, 381). Show that the re-
ported the “walking into circles” theory. Explain why jection region for the likelihood ratio test is given by
the data in the table is insufficient for testing whether Z 7 za, where P1Z 7 za2 = a.
an individual walks into circles. (Hint: Under the assumption that H0: m = 0 is true, show
that 2n1y2 is a standard normal random variable.)
SAS Output for Exercise 8.31

8.7 Testing the Difference Between Two Population


Means: Independent Samples
Consider independent random samples from two populations with means m1 and m2,
respectively. When the sample sizes are large (i.e., n1 Ú 30 and n2 Ú 30), a test of
hypothesis for the difference between the population means (m1 - m2) is based on the
pivotal z statistic given in Section 7.5. A summary of the large-sample test is provided
in the box.

Large-Sample Test of Hypothesis About ( M 1 ⴚ M 2): Independent Samples


One-Tailed Test Two-Tailed Test
H0: 1m1 - m22 = D0 H0: 1m1 - m22 = D0
Ha: 1m1 - m22 7 D0 Ha: 1m1 - m22 Z D0
3or Ha: 1m1 - m22 6 D04
1y1 - y22 - D0 1y1 - y22 - D0
Test statistic: Z = L
s1y1 - y22 s21 s22
+
B n1 n2
394 Chapter 8 Tests of Hypotheses

Rejection region: Rejection region:


Z 7 za 1or Z 6 - za2 ƒ Z ƒ 7 za/2
p-value = P1Z 7 z c2 3or, P1Z 6 z c24 p-value = 2P1Z 7 ƒ z c ƒ 2
where P1Z 7 za2 = a, P1Z 7 za/22 = a/2, m0 is our symbol for the particular
numerical value specified for m in the null hypothesis, and z c is the computed value
of the test statistic.
(Note: D0 is our symbol for the particular numerical value specified for 1m1 - m22
in the null hypothesis. In many practical applications, we wish to hypothesize that
there is no difference between the population means; in such cases, D0 = 0.)
Assumptions: 1. The sample sizes n1 and n2 are sufficiently large—say, n1 Ú 30
and n2 Ú 30.
2. The two samples are selected randomly and independently from
the target populations.

Example 8.12 To reduce costs, a bakery has implemented a new leavening process for preparing commercial
bread loaves. Loaves of bread were randomly sampled and analyzed for calorie content both before and
Testing m1 - m2: Comparing after implementation of the new process. A summary of the results of the two samples is shown in the
Two Leavening Processes Table 8.4. Do these samples provide sufficient evidence to conclude that the mean number of calories
per loaf has decreased since the new leavening process was implemented? Test using a = .05.

TABLE 8.4 Summary of Calories per Loaf


of Bread, Example 8.12

New Process Old Process


n1 = 50 n2 = 30
y1 = 1,255 calories y2 = 1,330 calories
s1 = 215 calories s2 = 238 calories

Solution We can best answer this question by performing a test of a hypothesis. Defining m1 as
the mean calorie content per loaf manufactured by the new process and m2 as the mean
calorie content per loaf manufactured by the old process, we will attempt to support
the research (alternative) hypothesis that m2 7 m1 [i.e., that 1m1 - m22 6 0]. Thus,
we will test the null hypothesis that 1m1 - m22 = 0, rejecting this hypothesis if
( y1 - y2) equals a large negative value. The elements of the test are as follows:
H0: 1m1 - m22 = 0 1i.e., D0 = 02
Ha: 1m1 - m22 6 0 1i.e., m1 6 m22
1y1 - y22 - D0 1y1 - y22 - 0
Test statistic: Z = =
s1y1 - y22 s1y1 - y22
(since both n1 and n2 are greater than or equal to 30)
Rejection region: Z 6 - za = - 1.645 (see Figure 8.13)
Assumptions: The two samples of bread loaves are independently selected.
We now calculate the test statistic:
1y1 - y22 - 0 11,255 - 1,3302
Z =
s1y 1 - y2 2
=
s21 s22
A n1 +
n2
- 75 - 75 - 75
= - 1.41
12152 + 12382
L = =
s21 s22 2 2 53.03
+
A n1 n2 A 50 30
8.7 Testing the Difference Between Two Population Means: Independent Samples 395

FIGURE 8.13 f(z)


Rejection region for Example 8.12

α = .05

z
–1.41 0
Rejection
region
–1.645

FIGURE 8.14
MINITAB Test to
Compare Means,
Example 8.12

This value is shaded on the MINITAB printout of the analysis, Figure 8.14. Note that
the p-value of the test (also shaded) is .081.
As you can see in Figure 8.13, the calculated z value does not fall in the rejection re-
gion. Also, a = .05 is less than p-value = .081. Consequently, we fail to reject H0;
the samples do not provide sufficient evidence (at a = .05) to conclude that the new
process yields a loaf with fewer mean calories.

When the sample sizes n1 and n2 are inadequate to permit use of the large-sample
procedure of Example 8.12, modifications may be made to perform a small-sample test of
hypothesis about the difference between two population means. The test procedure is based
on assumptions that are, again, more restrictive than in the large-sample case. The elements
of the hypothesis test and the assumptions required are listed in the box. Reminder: When the
assumption of normal population is grossly violated, the small-sample test outlined here will
be invalid. In this case, we must resort to a nonparametric method (Chapter 15).

Small-Sample Test of Hypothesis About ( M 1 - M 2): Independent Samples


One-Tailed Test Two-Tailed Test
H0: 1m1 - m22 = D0 H0: 1m1 - m22 = D0
Ha: 1m1 - m22 7 D0 Ha: 1m1 - m22 Z D0
3or Ha: 1m1 - m22 6 D04

1y1 - y22 - D0
Test statistic: T =
s2p a b
1 1
+
B n1 n2
396 Chapter 8 Tests of Hypotheses

Rejection region: T 7 ta Rejection region: ƒ T ƒ 7 ta/2


[or T 6 - ta]
p-value = P1T Ú t c2 p-value = 2P1T 7 ƒ t c ƒ 2
3or, P1T … t c24
where
1n1 - 12s21 + 1n2 - 12s22
s2p = ,
n1 + n2 - 2
the distribution of T is based on n1 + n2 - 2 df, and t c is the computed value of the
test statistic.
Assumptions: 1. The populations from which the samples are selected both have
approximately normal relative frequency distributions.
2. The variances of the two populations are equal, i.e., s21 = s22
3. The random samples are selected in an independent manner from
the two populations.
Warning: When the assumption of normal populations is violated, the test may lead
to erroneous inferences. In this case, use the nonparametric Wilcoxon test described
in Section 15.3.

Example 8.13 An industrial plant wants to determine which of two types of fuel—gas or electric—will produce more
useful energy at the lower cost. One measure of economical energy production, called the plant in-
Testing m1 - m2: Comparing
vestment per delivered quad, is calculated by taking the amount of money (in dollars) invested in the
Gas and Electric Energy particular utility by the plant and dividing by the delivered amount of energy (in quadrillion British
thermal units). The smaller this ratio, the less an industrial plant pays for its delivered energy. Inde-
pendent random samples of 11 plants using electrical utilities and 16 plants using gas utilities were
taken, and the plant investment/quad was calculated for each. The data are listed in Table 8.5. Do
these data provide sufficient evidence at a = .05 to indicate a difference in the average investment/
quad between all plants using gas and all those using electric utilities?

INVQUAD TABLE 8.5 Data on Plant Investment/Quad, Example 8.13

Electric: 204.15 0.57 62.76 89.72 0.35 85.46


0.78 0.65 44.38 9.28 78.60
Gas: 0.78 16.66 74.94 0.01 0.54 23.59 88.79 0.64
0.82 91.84 7.20 66.64 0.74 64.67 165.60 0.36

Solution Let m1 represent the mean investment/quad for all plants with electric utilities and let
m2 represent the mean investment/quad for all plants with gas utilities. Then, we want
to conduct the test:
H0: 1m1 - m22 = 0 1i.e., m1 = m22
Ha: 1m1 - m22 Z 0 1i.e., m1 7 m2 or m1 6 m22
Summary statistics for the two samples were produced using SPSS. The resulting
SPSS printout is shown in Figure 8.15. Note that y1 = 52.43, y2 = 37.74, s1 = 62.43,
and s2 = 49.05.
8.7 Testing the Difference Between Two Population Means: Independent Samples 397

To obtain the test statistics, we first calculate


1n1 - 12s21 + 1n2 - 12s22
s2p =
n1 + n2 - 2

111 - 12162.4322 + 116 - 12149.0522


=
11 + 16 - 2
75,051.31
= = 3002.05
25
Then, if we can assume that the distributions of the investment/quad data for the two
plant types are both approximately normal with equal variances, the test statistic is
1y1 - y 22 - D0 52.43 - 37.74 14.69
T = = = = .68
21.46
sp a
2 1
b b
1 1 1
+ 3002.05a +
B n1 n2 B 11 16
Note that this test statistic and the corresponding p-value for the test are both shaded
on the SPSS printout in Figure 8.15. Since the two-tailed p-value (for the equal variances
case), p-value = .500 exceeds a = .05, there is insufficient evidence to reject H0.
That is, we cannot conclude (at a = .05) that the mean investment/quad levels for
those plants with electric and gas utilities are different.

FIGURE 8.15
SPSS printout for Example 8.13

Recall from Section 7.5 that valid small-sample inferences about (m1 - m2) can
still be made when the assumption of equal variances is violated. We conclude this
section by giving the modifications required to obtain approximate small-sample tests
about (m1 - m2) when s21 Z s22 for the two cases described in Section 7.5: n1 = n2
and n1 Z n2.

Modifications to Small-Sample Tests About ( M 1 ⴚ M 2)


When S21 ⴝ S22: Independent Samples
n1 = n2 = n

Test statistic:

1y1 - y 22 - D0 1y1 - y22 - D0


T = =
s 21 s2 1s 1 + s 222
1 2
+ 2
B n1 n2 Bn
398 Chapter 8 Tests of Hypotheses

Degrees of freedom: n = n1 + n2 - 2 = 21n - 12

n1 Z n2

Test statistic:

1y1 - y 22 - D0
T =
s 21 s2
+ 2
Bn 1 n2
1s 21>n 1 + s 22>n 222
1s 21>n 122 1s 22>n 222
Degrees of freedom: n =
B + R
n1 - 1 n2 - 1

[Note: The value of n will generally not be an integer. Round down to the nearest
integer to use the T table (Table 7 of Appendix B).]

Applied Exercises
8.34 Drug content assessment. Refer to Exercise 7.39 (p. 319)
DRUGCON
and the Analytical Chemistry (Dec. 15, 2009) study in
which scientists used high-performance liquid chromatog- Site 1
raphy to determine the amount of drug in a tablet. Recall 91.28 92.83 89.35 91.90 82.85 94.83 89.83 89.00 84.62
that 25 tablets were produced at each of two different, in-
86.96 88.32 91.17 83.86 89.74 92.24 92.59 84.21 89.36
dependent sites, and drug concentration (measured as a
percentage) was determined for each tablet. These data are 90.96 92.85 89.39 89.82 89.91 92.16 88.67
reproduced in the accompanying table. In Exercise 7.39
you used a 95% confidence interval to determine whether Site 2
there is any difference between the mean drug concentra- 89.35 86.51 89.04 91.82 93.02 88.32 88.76 89.26 90.36
tion in tablets produced at the two sites. Now analyze the
data using a statistical test of hypothesis at a = .05. (See 87.16 91.74 86.12 92.10 83.33 87.61 88.20 92.78 86.35
the accompanying MINITAB printout.) Do the inferences 93.84 91.20 93.44 86.77 83.77 93.19 81.79
drawn from the test of hypothesis and confidence interval
Source: Borman, P.J., Marion, J.C., Damjanov,I., & Jackson, P. “Design
agree? and analysis of method equivalence studies”, Analytical Chemistry,
Vol. 81, No. 24, December 15, 2009 (Table 3).

MINITAB Output for Exercise 8.34


8.7 Testing the Difference Between Two Population Means: Independent Samples 399

8.35 Time required to complete a task. When asked, “How VGP NVGP
much time will you require to complete this task”, cogni-
tive theory posits that people (e.g., an electrical engineer) Sample size: 32 28
will typically underestimate the time required. Would the Mean score: 84.81 82.64
opposite theory hold if the question was phrased in terms
Standard deviation: 9.56 8.43
of how much work could be completed in a given amount
of time? This was the question of interest to researchers Source: Murphy, K. and Spencer, A. “Playing
writing in Applied Cognitive Psychology (Vol. 25, 2011). video games does not make for better visual
attention skills”, Journal of Articles in Support
For one study conducted by the researchers, each in a sam-
of the Null Hypothesis, Vol. 6, No. 1, 2009.
ple of forty University of Oslo students was asked how
many minutes it would take to read a 32-page technical re-
port. In a second study, forty-two students were asked how 8.37 Index of Biotic Integrity. Refer to the Journal of Agricultur-
many pages of a lengthy technical report they could read al, Biological, and Environmental Sciences (June 2005)
in 48 minutes. (The students in either study did not actual- analysis of the Index of Biotic Integrity (IBI), Exercise
ly read the report.) Numerical descriptive statistics (based 7.42 (p. 320). Recall that the IBI measures the biological
on summary information published in the article) for both condition or health of an aquatic region. Summary data on the
studies are provided in the accompanying table. IBI for sites located in two Ohio river basins, Muskingum
and Hocking, are reproduced in the next table. Conduct a test
Estimated Time Estimated Number of hypothesis (at a = .10) to compare the mean IBI values
(minutes) of Pages of the two river basins. Explain why the result will agree
Sample size, n 40 42 with the inference derived from the 90% confidence inter-
val, Exercise 7.42.
Sample mean, x 60 28
Sample standard
deviation, s 41 14 River Basin Sample Size Mean Standard Deviation
Muskingum 53 .035 1.046
a. The researchers determined that the actual mean time it
takes to read the report is m = 48 minutes. Is there ev- Hocking 51 .340 .960
idence to support the theory that the students, on aver- Source: Boone, E. L., Keying, Y., and Smith, E. P. “Evaluating the
age, will overestimate the time it takes to read the relationship between ecological and habitat conditions using hierarchical
report? Test using a = .10. models.” Journal of Agricultural, Biological, and Environmental
b. The researchers also determined that the actual mean Sciences, Vol. 10, No. 2, June 2005 (Table 1).
number of pages of the report that are read within the
allotted time is m = 32 pages. Is there evidence to sup-
8.38 Mineral flotation in water study. Refer to the Minerals
port the theory that the students, on average, will un-
Engineering (Vol. 46-47, 2013) study of the impact of cal-
derestimate the number of report pages that can be
cium and gypsum on the flotation properties of silica in
read? Test using a = 10.
water, Exercise 2.23 (p. 38). Fifty solutions of deionized
c. The researchers noted that the distribution of both esti-
water were prepared both with and without calcium/gyp-
mated time and estimated number of pages is highly
sum, and the level of flotation of silica in the solution was
skewed (i.e., not normally distributed). Does this fact
measured using a variable called zeta potential (measured
impact the inferences derived in parts a and b? Explain.
in millivolts, mV). The data (simulated, based on informa-
8.36 Do video game players have superior visual attention tion provided in the journal article) are reproduced in the
skills? Researchers at Griffin University (Australia) con- table on the next page. Conduct a test of hypothesis to
ducted a study to determine whether video game players compare the mean zeta potential values of the two types of
have superior visual attention skills than non-video game solutions. Can you conclude that the addition of calci-
players. (Journal of Articles in Support of the Null um/gypsum to the solution impacts silica flotation level?
Hypothesis, Vol. 6, 2009.) Two groups of male psychology
students—32 video game players (VGP group) and 28 GASTURBINE
non-players (NVGP group)—were subjected to a series of 8.39 Cooling method for gas turbines. Refer to the Journal of
visual attention tasks that included the attentional blink Engineering for Gas Turbines and Power (Jan. 2005)
test. A test for the difference between two means yielded study of gas turbines augmented with high-pressure inlet
t = - .93 and p-value = .358. Consequently, the re- fogging, Exercise 8.29 (p. 392). The researchers classified
searchers’ reported that “no statistically significant differ- gas turbines into three categories: traditional, advanced,
ences in the mean test performances of the two groups and aeroderivative. Summary statistics on heat rate (kilo-
were found”. Summary statistics for the comparison are joules per kilowatt per hour) for each of the three types of
provided in the next table. Do you agree with the gas turbines in the sample are shown in the MINITAB
researchers’ conclusion? printout on the next page.
400 Chapter 8 Tests of Hypotheses

Data for Exercise 8.38


SILICA
Without calcium/gypsum
- 47.1 - 53.0 - 50.8 - 54.4 - 57.4 - 49.2 - 51.5 - 50.2 - 46.4 - 49.7
- 53.8 - 53.8 - 53.5 - 52.2 - 49.9 - 51.8 - 53.7 - 54.8 - 54.5 - 53.3
- 50.6 - 52.9 - 51.2 - 54.5 - 49.7 - 50.2 - 53.2 - 52.9 - 52.8 - 52.1
- 50.2 - 50.8 - 56.1 - 51.0 - 55.6 - 50.3 - 57.6 - 50.1 - 54.2 - 50.7
- 55.7 - 55.0 - 47.4 - 47.5 - 52.8 - 50.6 - 55.6 - 53.2 - 52.3 - 45.7

With calcium/gypsum
- 9.2 - 11.6 - 10.6 - 8.0 - 10.9 - 10.0 - 11.0 - 10.7 - 13.1 - 11.5
- 11.3 - 9.9 - 11.8 - 12.6 - 8.9 - 13.1 - 10.7 - 12.1 - 11.2 - 10.9
- 9.1 - 12.1 - 6.8 - 11.5 - 10.4 - 11.5 - 12.1 - 11.3 - 10.7 - 12.4
- 11.5 - 11.0 - 7.1 - 12.4 - 11.4 - 9.9 - 8.6 - 13.6 - 10.1 - 11.3
- 13.0 - 11.9 - 8.6 - 11.3 - 13.0 - 12.2 - 11.3 - 10.5 - 8.8 - 13.4

MINITAB Output for


Exercise 8.39

a. Is there sufficient evidence of a difference between the voltage readings at two locations, Exercise 7.46 (p. 321).
mean heat rates of traditional augmented gas turbines The data for 30 production runs at both the old and new lo-
and aeroderivative augmented gas turbines? Test using cations are saved in the VOLTAGE file. The SAS printout
a = .05. of the analysis is reproduced below. Find and interpret the
b. Is there sufficient evidence of a difference between the p-value for the test to compare the mean process voltage
mean heat rates of advanced augmented gas turbines readings. What do you conclude? Does your answer agree
and aeroderivative augmented gas turbines? Test using with Exercise 7.46?
a = .05. 8.41 Shopping vehicle and judgment. Refer to the Journal of
Marketing Research (Dec., 2011) study of shopping cart
VOLTAGE design, Exercise 2.43 (p. 50). Design engineers want to
8.40 Process voltage readings. Refer to the Harris Corpora- know whether you may be more likely to purchase a vice
tion/University of Florida comparison of the mean process product (e.g., a candy bar) when your arm is flexed (as
SAS Output for
Exercise 8.40
8.7 Testing the Difference Between Two Population Means: Independent Samples 401

when carrying a shopping basket) than when your arm is 8.43 Wastewater treatment study. In Ecological Engineering
extended (as when pushing a shopping cart). To test this (Feb. 2004), the potential of floating aquatic plants to treat
theory, the researchers recruited 22 consumers and had dairy manure wastewater was investigated. For one part of
each push their hand against a table while they were asked the study, 16 treated wastewater samples were randomly
a series of shopping questions. Half of the consumers were divided into two groups—a control algal was cultured in
told to put their arm in a flex position (similar to a shop- half the samples and water hyacinth was cultured in the
ping basket) and the other half were told to put their arm in other half. The rate of increase in the amount of total phos-
an extended position (similar to a shopping cart). Partici- phorus was measured in each water sample; a summary of
pants were offered several choices between a vice and a the results is given in the accompanying table. Conduct a
virtue (e.g., a movie ticket vs. a shopping coupon, pay test to determine if there is a difference in mean rates of
later with a larger amount vs. pay now) and a choice score increase of total phosphorus for the two aquatic plants.
(on a scale of 0 to 100) was determined for each. (Higher Use a = .05.
scores indicate a greater preference for vice options.) The
average choice score for consumers with a flexed arm was Control Algal Water Hyacinth
59, while the average for consumers with an extended arm Number of Water Samples 8 8
was 43. Sample Mean .036 .026
a. Suppose the standard deviations of the choice scores
Standard Deviation .008 .006
for the flexed arm and extended arm conditions are 4
and 2, respectively. In Exercise 2.43a you were asked Source: Sooknah, R., and Wilkie, A. “Nutrient removal by floating
whether this information supports the researchers’ the- aquatic macrophytes cultured in anaerobically digested flushed dairy
ory. Now answer the question by conducting a hypoth- manure wastewater.” Ecological Engineering, Vol. 22, No. 1, Feb. 2004
esis test. Use a = .05. (Table 5).
b. Suppose the standard deviations of the choice scores 8.44 Insecticides used in orchards. Environmental Science &
for the flexed arm and extended arm conditions are 10 Technology (Oct. 1993) reported on a study of insecticides
and 15, respectively. In Exercise 2.43b you were asked used on dormant orchards in the San Joaquin Valley,
whether this information supports the researchers’ the- California. Ambient air samples were collected and analyzed
ory. Now answer the question by conducting a hypoth- daily at an orchard site during the most intensive period
esis test. Use a = .05. of spraying. The thion and oxon levels (in ng/m3) in the
air samples are recorded in the table, as well as the
8.42 Computer-mediated communication study. Computer-
oxon/thion ratios. Compare the mean oxon/thion ratios of
mediated communication (CMC) is a form of interaction
foggy and clear/cloudy conditions at the orchard using a
that heavily involves technology (e.g., instant messaging,
test of hypothesis. Use a = .05.
email). A study was conducted to compare relational inti-
macy in people interacting via CMC to people meeting ORCHARD
face-to-face (FTF). (Journal of Computer-Mediated Com- Oxon/
munication, Apr. 2004.) Participants were 48 undergradu- Date Condition Thion Oxon Thion Ratio
ate students, of which half were randomly assigned to the
Jan. 15 Fog 38.2 10.3 .270
CMC group and half assigned to the FTF group. Each
group was given a task that required communication with 17 Fog 28.6 6.9 .241
their group members. Those in the CMC group communi- 18 Fog 30.2 6.2 .205
cated using the “chat” mode of instant-messaging soft-
ware; those in the FTF group met in a conference room. 19 Fog 23.7 12.4 .523
The variable of interest, relational intimacy score, was 20 Fog 62.3 (Air sample lost) —
measured (on a 7-point scale) for each participant after 20 Clear 74.1 45.8 .618
each of three different meeting sessions. Summary statis-
tics for the first meeting session are given here. The re- 21 Fog 88.2 9.9 .112
searchers hypothesized that, after the first meeting, the 21 Clear 46.4 27.4 .591
mean relational intimacy score for participants in the
22 Fog 135.9 44.8 .330
CMC group would be lower than the mean relational inti-
macy score for participants in the FTF group. Test the re- 23 Fog 102.9 27.8 .270
searchers’ hypothesis using a = .10. 23 Cloudy 28.9 6.5 .225
25 Fog 46.9 11.2 .239
CMC FTF
25 Clear 44.3 16.6 .375
Number of Participants 24 24 Source: Selber, J. N., et al. “Air and fog deposition residues of four
Sample Mean 3.54 3.53 organophosphate insecticides used on dormant orchards in the San
Joaquin Valley, California.” Environmental Science & Technology,
Standard Deviation .49 .38 Vol. 27, No. 10, Oct. 1993, p. 2240 (Table V).
402 Chapter 8 Tests of Hypotheses

8.8 Testing the Difference Between Two Population Means: Matched Pairs
It may be possible to acquire more information on the difference between two popula-
tion means by using data collected in matched pairs instead of independent samples.
Consider, for example, an experiment to investigate the effectiveness of cloud seeding
in the artificial production of rainfall. Two farming areas with similar past meteoro-
logical records were selected for the experiment. One is seeded regularly; the other is
left unseeded. The monthly precipitation at the farms will be recorded for 6 randomly
selected months. The resulting data, matched on months, can be used to test a hypoth-
esis about the difference between the mean monthly precipitation in the seeded and
unseeded areas. The appropriate procedures are summarized in the boxes.

Large-Sample Test of Hypothesis About 1m1 - m22: Matched Pairs


One-Tailed Test Two-Tailed Test
H0: 1m1 - m22 = D0 H0: 1m1 - m22 = D0
Ha: 1m1 - m22 7 D0 3or Ha: 1m1 - m22 6 D04 Ha: 1m1 - m22 Z D0

d - D0 d - D0
Test statistic: Z =
sd> 2n sd> 2n
L

where d and sd represent the mean and standard deviation of the sample of differences.
Rejection region: Z 7 za 3or Z 6 - za4 Rejection region: ƒ Z ƒ 7 za/2
p-value = P1Z 7 z c2 3or, P1Z 6 z c24 p-value = 2P1Z 7 ƒ z c ƒ 2
where P1Z 7 za2 = a, P1Z 7 za/22 = a/2 and z c is the computed value of the
test statistic.
[Note: D0 is our symbol for the particular numerical value specified for (m1 - m2)
in H0. In many applications, we want to hypothesize that there is no difference
between the population means; in such cases, D0 = 0.]

Small-Sample Test of Hypothesis About (m1 - m2): Matched Pairs


One-Tailed Test Two-Tailed Test
H0: 1m1 - m22 = D0 H0: 1m1 - m22 = D0
Ha: 1m1 - m22 7 D0 3or Ha: 1m1 - m22 6 D04 Ha: 1m1 - m22 Z D0
d - D0 d - D0
sd> 2n sd> 2n
Test statistic: T = L

where d and sd represent the mean and standard deviation of the sample of differences.
Rejection region: T 7 ta Rejection region: ƒ T ƒ 7 ta/2
3or T 7 - ta4
p-value = P1T Ú t c2 3or, P1T … t c24 p-value = 2P1T Ú ƒ t c ƒ 2
where the T-distribution is based on (n - 1) degrees of freedom, P1T 7 t a2 = a,
P1T 7 t a/22 = a/2, m0 is our symbol for the particular numerical value specified
for m in the null hypothesis, and t c is the computed value of the test statistic.
[Note: D0 is our symbol for the particular numerical value specified for 1m1 - m22
in the null hypothesis. In many practical applications, we want to hypothesize that
there is no difference between the population means; in such cases, D0 = 0.]
8.8 Testing the Difference Between Two Population Means: Matched Pairs 403

Assumptions: 1. The relative frequency distribution of the population of differ-


ences is approximately normal.
2. The paired differences are randomly selected from the population
of differences.
Warning: When the assumption of normality is grossly violated, the t test may lead
to erroneous inferences. In this case, use the nonparametric Wilcoxon test described
in Section 15.4.

Example 8.14 Consider the cloud seeding experiment to compare monthly precipitation at the two farm areas. Do
the data given in Table 8.6 provide sufficient evidence to indicate that the mean monthly precipitation
Testing md: Cloud Seeding at the seeded farm area exceeds the corresponding mean for the unseeded farm area? Test using
a = .05.

CLOUDSEED TABLE 8.6 Monthly Precipitation Data (in Inches) for Example 8.14
Farm Area 1 2 3 4 5 6
Seeded 1.75 2.12 1.53 1.10 1.70 2.42
Unseeded 1.62 1.83 1.40 .75 1.71 2.33
d .13 .29 .13 .35 -.01 .09

Solution Let m1 and m2 represent the mean monthly precipitation values for the seeded and un-
seeded farm areas, respectively. Since we want to be able to detect m1 7 m2, we will
conduct the one-tailed test:
H0: 1m1 - m22 = 0
Ha: 1m1 - m22 7 0
Assuming the differences in monthly precipitation values for the two areas are from an
approximately normal distribution, the test statistic will have a t distribution based on
1n - 12 = 16 - 12 = 5 degrees of freedom. We will reject the null hypothesis if
T 7 t.05 = 2.015 1see Figure 8.16)
To conduct the test by hand, we must first calculate the difference d in monthly
precipitation at the two farm areas for each month. These differences (where the
observations for the unseeded farm area is subtracted from the observation for the
seeded area within each pair) are shown in the last row of Table 8.6. Next, we would
calculate the mean d and standard deviation sd for this sample of n = 6 differences to
obtain the test statistic.

FIGURE 8.16 f(t)


Rejection region for
Example 8.14
t distribution with 5
degrees of freedom

α = .05

t
0
t.05 = 2.015 Reject H0
Observed value of test statistic
T = 3.00
404 Chapter 8 Tests of Hypotheses

FIGURE 8.17
MINITAB printout for Example 8.14

Rather than perform these calculations, we will rely on the output from a computer.
The MINITAB printout for the analysis is shown in Figure 8.17. The test statistic,
shaded in Figure 8.17, is T = 3.01.
Since this value of the test statistic exceeds the critical value t.05 = 2.015, there is
sufficient evidence (at a = .05) to indicate that the mean monthly precipitation at the
seeded farm area exceeds the mean for the unseeded farm area.
The same conclusion can be reached by examining the p-value of the test. The
one-tailed p-value, shaded on the MINITAB printout, is .015. Since this value is less
than the chosen a level (.05), we reject H0. In fact, we will reject H0 for any a larger
than p-value ⫽ .015.

In the experiment of Example 8.14, why did we collect the data in matched pairs
rather than use independent random samples of months, with some assigned to only
the seeded area and others to only the unseeded area? The answer is that we expected
some months to have more rain than others. To cancel out this variation from month to
month, the experiment was designed so that precipitation at both farm areas would be
recorded during the same months. Then both farm areas would be subjected to the
same weather pattern in a given month. By comparing precipitation within each
month, we were able to obtain more information on the difference in mean monthly
precipitation than we could have obtained by independent random sampling.

Applied Exercises SKIN


8.45 Estimating well scale deposits. Scale deposits can cause a
Well (Type) Excel Spreadsheet EPS Software
serious reduction in the flow performance of a well. A
study published in the Journal of Petroleum and Gas Engi- 1 (Horizontal) 44.48 37.77
neering (April 2013) compared two methods of estimating 2 (Horizontal) 18.34 13.31
the damage from scale deposits (called skin factor). One
3 (Horizontal) 19.21 7.02
method of estimating the well skin factor uses a series of
Excel spreadsheets, while the second method employs 4 (Horizontal) 11.70 4.77
EPS computer software. Skin factor data was obtained 5 (Horizontal) 9.25 1.96
from applying both methods to 10 randomly selected oil
6 (Vertical) 317.40 281.74
wells—5 vertical wells and 5 horizontal wells. The results
are supplied in the accompanying table. 7 (Vertical) 181.44 192.16
a. Compare the mean skin factor values for the two esti- 8 (Vertical) 154.65 140.84
mation methods using all 10 sampled wells. Test at
9 (Vertical) 77.43 56.86
a = .05. What do you conclude?
b. Repeat part a, but analyze the data for the 5 horizontal 10 (Vertical) 49.37 45.01
wells only. Source: Rahuma, K.M., et al. “ Comparison between spreadsheet and
c. Repeat part a, but analyze the data for the 5 vertical specialized programs in calculating the effect of scale deposition on the
wells only. well flow performance”, Journal of Petroleum and Gas Engineering,
Vol. 4, No. 4, April 2013 (Table 2).
8.8 Testing the Difference Between Two Population Means: Matched Pairs 405

8.46 Computer-mediated communication study. Refer to the diamond mine in Africa are repeated in the accompanying
Journal of Computer-Mediated Communication (Apr. table. The geologists want to know if there is any evidence
2004) study to compare relational intimacy in people in- of a difference in the true THM means of all original holes
teracting via computer-mediated communication (CMC) and their twin holes drilled at the mine.
to people meeting face-to-face (FTF), Exercise 8.42 a. Conduct the appropriate test of hypothesis for the geol-
(p. 401). Recall that a relational intimacy score was meas- ogists. Use a = .10.
ured (on a 7-point scale) for each participant after each of b. In Exercise 7.49d, you formed a 90% confidence inter-
three different meeting sessions. The researchers also val for the true mean difference (“1st hole” minus “2nd
hypothesized that the mean relational intimacy score for hole”) in THM measurements and used this interval to
participants in the CMC group will significantly increase answer the question of interest to the geologists. Do the
between the first and third meetings, but the difference inferences derived from the hypothesis test and confi-
between the first and third meetings will not significantly dence interval agree? Is this a surprising result?
change for participants in the FTF group. Explain.
a. For the CMC group comparison, give the null and alter- 8.48 Settlement of shallow foundations. Refer to the Environ-
native hypotheses of interest. mental & Engineering Geoscience (Nov. 2012) study of
b. The researchers made the comparison, part a, using a methods for predicting settlement of shallow foundations
paired t test. Explain why the data should be analyzed on cohesive soil, Exercise 7.50 (p. 326). Actual settlement
as matched pairs. values for a sample of 13 structures built on a shallow
c. For the CMC group comparison, the reported test sta- foundation were determined, and these values compared
tistic was t = 3.04 with p-value = .003. Interpret to settlement predictions made using a formula that ac-
these results. Is the researchers’ hypothesis supported? counts for dimension, rigidity, and embedment depth of
d. For the FTF group comparison, give the null and alter- the foundation. The data (in millimeters) are reproduced in
native hypotheses of interest. the table below. Use the SAS printout on the next page to
e. For the FTF group comparison, the reported test statis- test the hypothesis of no difference between the mean
tic was t = .39 with p-value = .70. Interpret these re- actual and mean predicted settlement values. Test using
sults. Is the researchers’ hypothesis supported? a = .05.
8.47 Twinned drill holes. Refer to the Exploration and Mining
Geology (Vol. 18, 2009) study of drilling twin holes, Exer-
cise 7.49 (p. 326). Recall that geologists use data collected SHALLOW
at both holes to estimate the total amount of heavy miner-
als (THM) present at the drilling site. Data (THM percent- Structure Actual Predicted
ages) for a sample of 15 twinned holes drilled at a 1 11 11
2 11 11
TWINHOLE
3 10 12
Location 1st Hole 2nd Hole 4 8 6
1 5.5 5.7
5 11 9
2 11.0 11.2
6 9 10
3 5.9 6.0
7 9 9
4 8.2 5.6
8 39 51
5 10.0 9.3
9 23 24
6 7.9 7.0
10 269 252
7 10.1 8.4
8 7.4 9.0 11 4 3

9 7.0 6.0 12 82 68

10 9.2 8.1 13 250 264


11 8.3 10.0 Source: Ozur, M. “Comparing Methods
for Predicting Immediate Settlement of
12 8.6 8.1 Shallow Foundations on Cohesive Soils
13 10.5 10.4 Based on Hypothetical and Real Cases”,
Environmental & Engineering
14 5.5 7.0 Geoscience, Vol. 18, No. 4, November
15 10.0 11.2 2012 (from Table 4).
406 Chapter 8 Tests of Hypotheses

SAS Output for Exercise 8.48 hypothesis of no difference between the mean stan-
dardized growth of genes in the full-dark condition and
genes in the transient-light condition. Use a = .01.
b. Use a statistical software package to compute the mean
difference in standardized growth of the 103 genes in
the full-dark condition and the transient-light condi-
tion. Did the test, part a, detect this difference?
c. Repeat parts a and b for a comparison of the mean stan-
dardized growth of genes in the full-dark condition and
genes in the transient-dark condition.
d. Repeat parts a and b for a comparison of the mean stan-
dardized growth of genes in the transient-light condi-
tion and genes in the transient-dark condition.

8.50 Testing electronic circuits. Refer to the IEICE Transac-


tions on Information & Systems (Jan. 2005) comparison of
8.49 Light to dark transition of genes. Synechocystis, a type of two methods of testing electronic circuits, Exercise 7.52
cyanobacterium that can grow and survive in a wide range (p. 327). Each of 11 circuits was tested using the standard
of conditions, is used by scientists to model DNA behav- compression/depression method and the new Huffman-
ior. In the Journal of Bacteriology (July 2002), scientists based coding method, and the compression ratio recorded.
isolated genes of the bacterium responsible for photosyn- The data are reproduced in the accompanying table. In
thesis and respiration and investigated the sensitivity of theory, the Huffman coding method will yield a smaller
the genes to light. Each gene sample was grown to midex- mean compression ratio.
ponential phase in a growth incubator in “full light.” The a. Test the theory using a = .05.
lights were extinguished and growth measured after b. Does your conclusion, part a, agree with the inference
24 hours in the dark (“full dark”). The lights were then derived from the 95% confidence interval found in
turned back on for 90 minutes (“transient light”) followed Exercise 7.52?
immediately by an additional 90 minutes in the dark
(“transient dark”). Standardized growth measurements in CIRCUITS
each light/dark condition were obtained for 103 genes.
The complete data set is saved in the GENEDARK file. Circuit Standard Method Huffman Coding Method
Data for the first 10 genes are shown in the accompanying 1 .80 .78
table.
2 .80 .80
GENEDARK 3 .83 .86
(First 10 Observations Shown)
4 .53 .53
Gene ID Full-Dark Tr-Light Tr-Dark 5 .50 .51
SLR2067 - 0.00562 1.40989 -1.28569 6 .96 .68
SLR1986 - 0.68372 1.83097 -0.68723 7 .99 .82
SSR3383 - 0.25468 - 0.79794 -0.39719 8 .98 .72
SLL0928 - 0.18712 - 1.20901 -1.18618 9 .81 .45

SLR0335 - 0.20620 1.71404 - 0.73029 10 .95 .79


11 .99 .77
SLR1459 - 0.53477 2.14156 -0.33174
Source: Ichihara, H., Shintani, M., and Inoue, T. “Huffman-based
SLL1326 - 0.06291 1.03623 0.30392 test response coding.” IEICE Transactions on Information &
SLR1329 - 0.85178 - 0.21490 0.44545 Systems, Vol. E88-D, No. 1, Jan. 2005 (Table 3).

SLL1327 0.63588 1.42608 -0.13664 8.51 Concrete pavement response to temperature. Civil engi-
SLL1325 - 0.69866 1.93104 -0.24820 neers at West Virginia University have developed a 3D
model to predict the response of jointed concrete pave-
Source: Gill, R. T., et al. “Genome-wide dynamic transcriptional
ment to temperature variations. (The International Journal
profiling of the light to dark transition in Synechocystis Sp.
of Pavement Engineering, Sept. 2004.) To validate the
PCC6803.” Journal of Bacteriology, Vol. 184, No. 13, July 2002.
model, model predictions were compared to field meas-
a. Treat the data for the first 10 genes as a random sample urements on key concrete stress variables taken at a newly
collected from the population of 103 genes and test the constructed highway. One variable measured was slab top
8.8 Testing the Difference Between Two Population Means: Matched Pairs 407

transverse strain (i.e., change in length per unit length per solar panels above the two types of highways was deter-
unit time) at a distance of 1 meter from the longitudinal mined each month. The data for several randomly selected
joint. The 5-hour changes (8:20 P.M. to 1:20 A.M.) in slab months are provided in the table. The researchers conclud-
top transverse strain for 6 days are listed in the next table. ed that the “two-layer solar panel energy generation is
Is there a significant difference between the mean daily more viable for the north-south oriented highways as com-
transverse strain changes from field measurements and the pared to east-west oriented roadways”. Do you agree?
3D model? Test using a = .05.

SOLAR
SLABSTRAIN
Month East-West North-South
Change in Transverse Strain
February 8658 8921
Change in Field
Day Temperature (°C) Measurement 3D Model April 7930 8317
Oct. 24 - 6.3 - 58 - 52 July 5120 5274
Dec. 3 13.2 69 59 September 6862 7148
Dec. 15 3.3 35 32 October 8608 8936
Feb. 2 - 14.8 -32 - 24 Source: Sharma, P. and Harinarayana, T. “Solar energy
generation potential along national highways”,
Mar. 25 1.7 - 40 -39
International Journal of Energy and Environmental
May 24 - .2 - 83 -71 Engineering, Vol. 49, No. 1, December 2013 (Table 3).

Source: Shoukry, S., William, G., and Riad, M. “Validation of 3DFE


model of jointed concrete pavement response to temperature variations.” 8.53 Modeling transport of gases. In AlChE Journal (Jan. 2005),
The International Journal of Pavement Engineering, Vol. 5, No. 3, Sept. chemical engineers published a new method for modeling
2004 (Table IV). multicomponent transport of gases. Twelve gas mixtures
consisting of neon, argon, and helium were prepared at dif-
8.52 Solar energy generation along highways. The potential of ferent ratios and at different temperatures. The viscosity of
using solar panels constructed above national highways to each mixture 110-5Pa # s2 was measured experimentally
generate energy was explored in the International Journal and was calculated with the new model. The results are
of Energy and Environmental Engineering (Dec. 2013). shown in the table below. The chemical engineers con-
Two-layer solar panels (with 1 meter separating the pan- cluded that there is “an excellent agreement between our
els) were constructed above sections of both east-west and new calculation and experiments.” Do you agree? Your an-
north-south highways in India. The amount of energy swer should include a discussion of practical versus statis-
(kilo-Watt hours) supplied to the country’s grid by the tical significance.

VISCOSITY

Viscosity Measurements Viscosity Measurements

Mixture Experimental New Method Mixture Experimental New Method


1 2.740 2.736 7 2.886 2.910
2 2.569 2.575 8 2.957 2.965
3 2.411 2.432 9 3.790 3.792
4 2.504 2.512 10 3.574 3.582
5 3.237 3.233 11 3.415 3.439
6 3.044 3.050 12 3.470 3.476
Source: Kerkhof, P., and Geboers, M. “Toward a unified theory of isotropic molecular transport
phenomena.” AlChE Journal, Vol. 51, No. 1, January 2005 (Table 2).
408 Chapter 8 Tests of Hypotheses

8.9 Testing a Population Proportion


In Section 8.2, we gave several examples of a statistical test of hypothesis for a popu-
lation proportion p (e.g., the proportion of PC note book purchasers who buy a partic-
ular software package.). When the sample size is large, the sample proportion of
successes pN is approximately normal and the general formulas for conducting a large-
sample z test (given in Section 8.2) can be applied.
The procedure for testing a hypothesis about a population proportion p based on a
large sample from the target population is described in the box. (Recall that p repre-
sents the probability of success in a binomial experiment.) For the procedure to be
valid, the sample size must be sufficiently large to guarantee approximate normality of
the sampling distribution of the sample proportion, pN . As with confidence intervals, a
general rule of thumb for determining whether n is “sufficiently large” is that both npN
and nqN are greater than or equal to 4.

Large-Sample Test of Hypothesis About a Population Proportion


One-Tailed Test Two-Tailed Test
H0: p = p0 H0: p = p0
Ha: p 7 p0 3or Ha: p 6 p04 Ha: p Z p0
pN - p0
Test statistic: Z =
2p0q0>n
where q0 = 1 - p0
Rejection region:Z 7 za Rejection region: ƒ Z ƒ 7 za>2
3or Z 6 - za4
p-value = P1Z 7 z c2 [or, P1Z 6 z c2] p-value = 2P1Z 7 ƒ z c ƒ 2
where P1Z 7 za2 = a, P1Z 7 za/22 = a/2, p0 is our symbol for the particular nu-
merical value specified for p in the null hypothesis, and z c is the computed value of
the test statistic.
Assumption: The sample size n is sufficiently large so that the approximation is
valid. As a rule of thumb, the condition of “sufficiently large” will be satisfied
when npN Ú 4 and nqN Ú 4.

Example 8.15 Controversy surrounds the use of weathering steel in the construction of highway bridges. Critics have
recently cited serious corrosive problems with weathering steel and are currently urging states to pro-
Testing a Proportion: hibit its use in bridge construction. On the other hand, the steel corporations claim that these charges
Steel Highway Bridges are exaggerated and report that 95% of all weathering steel bridges in operation show “good” perform-
ance, with no major corrosive damage. To test this claim, a team of engineers and steel industry experts
evaluated 60 randomly selected weathering steel bridges and found 54 of them showing “good” per-
formance. Is there evidence, at a = .05, that the true proportion of weathering steel highway bridges
that show “good” performance is less than .95, the figure quoted by the steel corporations?

Solution The parameter of interest is a population proportion, p. We want to test


H0: p = .95
Ha: p 6 .95
where p is the true proportion of all weathering steel highway bridges that show
“good” performance.
At significance level a = .05, the null hypothesis will be rejected if
Z 6 - z.05
that is, H0 will be rejected if
Z 6 - 1.645 (see Figure 8.18)
8.9 Testing a Population Proportion 409

FIGURE 8.18 f(z)


Rejection region for
Example 8.15

α = .05

z
0
Reject H0 –Z.05 = –1.645

Observed value of test statistic


Z = –1.78

The sample proportion of bridges that show “good” performance is


54
pN = = .90
60
Thus, the test statistic has the value
pN - p0 .90 - .95
Z = = - 1.78
2p0q0>n
=
21.9521.052>60
This value of the test statistic is shown (shaded) on a MINITAB printout of the analy-
sis, Figure 8.19. The p-value of the test (also shaded on the printout) is .038. Of course,
we know we can conduct the test using either the rejection region or the p-value approach.
Since a = .05 exceeds p-value = .038, the null hypothesis can be rejected. There is
sufficient evidence to support the hypothesis that the proportion of weathering steel
highway bridges that show “good” performance is less than .95. [Note that both
npN = 601.902 = 54 and nqN = 601.102 = 6 exceed 4. Thus, the sample size is clearly
large enough to guarantee the validity of the hypothesis test.]

FIGURE 8.19
MINITAB Test of a Population
Proportion, Example 8.15

Although small-sample procedures are available for testing hypotheses about a


population proportion, the details are omitted from our discussion. It is our experience
that they are of limited utility, since most surveys of binomial populations (for exam-
ple, opinion polls) performed in the real world use samples that are large enough to
employ the techniques of this section.
410 Chapter 8 Tests of Hypotheses

Applied Exercises
8.54 Annual survey of computer crimes. The Computer Securi- university in Portugal investigated the degree to which
ty Institute (CSI) conducts an annual survey of computer wiki tools are accepted in an academic environment (Com-
crime at United States businesses. CSI sends survey puter Applications in Engineering Education, Vol. 20,
questionnaires to computer security personnel at all U.S. 2012). An online survey was made available to both pro-
corporations and government agencies. A total of fessors and students that were involved in engineering
351 organizations responded to the 2010 CSI survey. Of courses that make use of a wiki-based tool. A total of 136
these, 144 admitted unauthorized use of computer systems students responded to the survey. One of the survey ques-
at their firms during the year. (CSI Computer Crime and tions asked, “Have you ever edited content in a wiki-based
Security Survey, 2010/2011.) Let p represent the true pro- tool?” Of the 136 respondents, 72 answered “yes”. Do the
portion of U.S. organizations that experience unauthorized survey results support the claim that more than half of en-
use of computer systems at their firms. gineering students edit content in wiki-based tools? Test
a. Calculate a point estimate for p. using a = .10.
b. Set up the null and alternative hypothesis to test whether 8.58 Killing insects with low oxygen. A group of Australian en-
the value of p differs from .35. tomological toxicologists investigated the impact of expo-
c. Calculate the test statistic for the test, part b. sure to low oxygen on the mortality of insects. (Journal of
d. Find the rejection region for the test if a = .05. Agricultural, Biological, and Environmental Statistics,
e. Use the results of parts c and d to make the appropriate Sept. 2000.) Thousands of adult rice weevils were placed
conclusion. in a chamber filled with wheat grain and the chamber was
f. Find the p-value of the test and confirm that the conclu- exposed to nitrogen gas for 4 days. Insects were assessed
sion based on the p-value agrees with the conclusion in as dead or alive 24 hours after exposure. The results:
part e. 31,386 dead weevils and 35 weevils found alive. Previous
8.55 Toxic chemical incidents. Refer to the Process Safety studies have shown a 99% mortality rate in adult rice wee-
Progress (Sept. 2004) study of an emergency response vils exposed to carbon dioxide for 4 days. Is the mortality
system for incidents involving toxic chemicals in Taiwan, rate for adult rice weevils exposed to nitrogen higher than
Exercise 3.5 (p. 86). In a sample of 250 toxic chemical in- 99%? Test using a = .10.
cidents logged since the system was implemented, 15 oc- 8.59 Friction in paper-feeding process. Researchers at the Uni-
curred in a school laboratory. Suppose you want to versity of Rochester studied the friction that occurs in the
conduct a test of hypothesis to determine if the true per- paper-feeding process of a photocopier (Journal of Engi-
centage of toxic chemical incidents in Taiwan that occur in neering for Industry, May 1993). The experiment involved
a school laboratory is less than 10%. monitoring the displacement of individual sheets of paper
a. Set up the null and alternative hypothesis for the test. in a stack fed through the copier. If no sheet except the top
b. Give the rejection region for a = .01. one moved more than 25% of the total stroke distance, the
c. Compute the value of the test statistic. feed was considered successful. In a stack of 100 sheets of
d. Give the appropriate conclusion for the test. paper, the feeding process was successful 94 times. The
8.56 Underwater acoustic communication. Refer to the IEEE success rate of the feeder is designed to be .90. Test to de-
Journal of Oceanic Engineering (April 2013) study of the termine whether the true success rate of the feeder exceeds
characteristics of subcarriers—telecommunication signals .90. Use a = .10.
carried on top of one another—for underwater acoustic 8.60 Dehorning of dairy calves. For safety reasons, calf de-
communications, Exercise 4.43 (p. 158). Recall that a sub- horning has become a routine practice at dairy farms. A
carrier can be classified as either a data subcarrier (used 2009 report by Europe’s Standing Committee on the Food
for data transmissions), a pilot subcarrier (used for chan- Chain and Animal Health (SANKO) stated that 80% of
nel estimation and synchronization), or a null subcarrier European dairy farms carry out calf dehorning. A later
(used for direct current and guard banks transmitting no study, published in the Journal of Dairy Science (Vol. 94,
signal). In a sample of 1,024 subcarrier signals transmitted 2011), found that in a sample of 639 Italian dairy farms,
off the coast of Martha’s Vineyard, 672 were determined 515 dehorn calves. Does the Journal of Dairy Science
to be data subcarriers, 256 pilot subcarriers, and 96 null study support or refute the figure reported by SANKO?
subcarriers. Suppose a communications engineer who Explain.
works near Martha’s Vineyard believes that fewer than
8.61 Identifying organisms using a computer. National Science
70% of all subcarrier signals transmitted in the area are
Education Standards recommend that all life science stu-
data subcarriers. Is there evidence to support this belief?
dents be exposed to methods of identifying unknown bio-
Test using a = .05.
logical specimens. Due to certain limitations of traditional
8.57 Wiki usage in engineering education. A wiki is a web in- identification methods, biology professors at Slippery
formation depository with content that can be updated and Rock University (SRU) developed a computer-aided sys-
edited through a web browser. Engineering faculty at a tem for identifying common conifers (deciduous trees)
8.10 Testing the Difference Between Two Population Proportions 411

called Confir ID. (The American Biology Teacher, May is the “coat index,” that is, the proportion of grains that are
2010.) A sample of 171 life science students were exposed coated. According to soil evolution theory, the coat index
to both a traditional method of identifying conifers and will exceed .5 at the top of the core, equal .5 in the middle of
Confir ID and then asked which method they preferred. the core, and fall below .5 at the bottom of the core. Use the
The results: 138 students indicated their preference for summary data in the accompanying table to test each part of
Confir ID. In order to change the life sciences curriculum the three-part theory. Use a = .05 for each test.
at SRU to include Confir ID, the biology department re-
Location (depth)
quires that more than 70% of the students prefer the new,
computerized method. Should Confir ID be added to the Top Middle Bottom
curriculum at SRU? Explain your reasoning. (4.25 cm) (28.1 cm) (54.5 cm)

8.62 Study of lunar soil. Meteoritics (March 1995) reported the Number of Grains 84 73 81
results of a study of lunar soil evolution. Data were obtained Sampled
from the Apollo 16 mission to the moon, during which a Number Coated 64 35 29
62-cm core was extracted from the soil near the landing site.
Source: Basu, A., and McKay, D.S. “Lunar soil evolution processes and
Monomineralic grains of lunar soil were separated out and
Apollo 16 core 60013/60014.” Meteoritics, Vol. 30, No. 2, Mar. 1995,
examined for coating with dust and glass fragments. Each p. 166 (Table 2).
grain was then classified as coated or uncoated. Of interest

8.10 Testing the Difference Between Two Population Proportions


Consider a transportion engineer who wants to compare the proportion of cars
traveling with two or more people prior to adding a car-pool only lane on a major
highway to the proportion a month after the car-pool lane was added. Let p1 and p2
represent the proportions prior to and after adding the car-pool lane, respectively.
The method for performing a large-sample test of hypothesis about ( p1 - p2 ), the
difference between two binomial proportions, is outlined in the box (p. 412).
When testing the null hypothesis that 1p1 - p22 equals some specified difference—
say, D0—we make a distinction between the case D0 = 0 and the case D0 Z 0. For
the special case D0 = 0, i.e., when we are testing H0: 1p1 - p22 = 0 or, equivalently,
H0: p1 = p2, the best estimate of p1 = p2 = p is found by dividing the total number
of successes in the combined samples by the total number of observations in the two
samples. That is, if y1 is the number of successes in sample 1 and y2 is the number of
successes in sample 2, then

y1 + y2
pN =
n1 + n2

In this case, the best estimate of the standard deviation of the sampling distribution
of 1pN 1 - pN 22 is found by substituting pN for both p1 and p2:

p1q1 p2q2 pN qN pN qN 1 1
s(pN1 - pN2 ) = + L + = pN qN ¢ + ≤
B 1
n n 2 B 1
n n 2 C n 1 n 2

For all cases in which D0 Z 0 3for example, when testing H0: 1p1 - p22 = .24,
we use pN 1 and pN 2 in the formula for s( pN1 - pN2 ). However, in most practical situations, we
will want to test for a difference between proportions—that is, we will want to test
H0: 1p1 - p22 = 0.
412 Chapter 8 Tests of Hypotheses

Large-Sample Test of Hypothesis About 1p1 - p22: Independent Samples


One-Tailed Test Two-Tailed Test
H0: 1p1 - p22 = D0 H0: 1p1 - p22 = D0
Ha: 1p1 - p22 7 D0 3or Ha: 1p1 - p22 6 D04 Ha: 1p1 - p22 Z D0
1pN 1 - pN 22 - D0
Test statistic: Z =
s(pN1 - pN2 )
Rejection region: Z 7 za Rejection region: ƒ Z ƒ 7 za>2,
3or Z 6 - za4
p-value = P1Z 7 z c2 3or, P1Z 6 z c2] p-value = 2 # P1Z 7 ƒ z c ƒ 2
where P1Z 7 za2 = a, P1Z 7 za/22 = a/2 and z c is the computed value of the
test statistic.
When D0 Z 0,
pN 1qN 1 pN 2qN 2
s( pN1 - pN2 ) L +
B 1n n2
where qN1 = 1 - pN 1 and qN 2 = 1 - pN 2.
When D0 = 0,

pN qN a b
1 1
s(pN1 - pN2 ) L +
B n1 n2
where the total number of successes in the combined sample is 1y1 + y22 and
y1 + y2
pN 1 = pN 2 = pN =
n1 + n2
Assumption: The sample sizes, n1 and n2, are sufficiently large. This will be satis-
fied if n1pN 1 Ú 4, n1qN 1 Ú 4, and n2pN 2 Ú 4, n2qN 2 Ú 4.

The sample sizes n1 and n2 must be sufficiently large to ensure that the sampling
distributions of pN 1 and pN 2, and hence of the difference 1pN 1 - pN 22, are approximately
normal. The rule of thumb used to determine if the sample sizes are “sufficiently large”
is the same as that given in Section 7.8, namely, that the quantities n1 pN 1, n2 pN 2, n1qN 1, and
n2qN 2 are all greater than or equal to 4. (Note: If the sample sizes are not sufficiently
large, p1 and p2 can be compared using a technique to be discussed in Chapter 9.)

Example 8.16 Recently there have been intensive campaigns encouraging people to save energy by carpooling to
work. Some cities have created an incentive for carpooling by designating certain highway traffic lanes
Testing p1 - p2: Carpooling as “car-pool only” (i.e., only cars with two or more passengers can use these lanes). To evaluate the ef-
Study fectiveness of this plan, toll booth personnel in one city monitored 2,000 randomly selected cars prior
to establishing car-pool-only lanes and 1,500 cars after the car-pool-only lanes were established. The
results of the study are shown in Table 8.7, where y1 and y2 represent the numbers of cars with two or
more passengers (i.e., car-pool riders) in the “before” and “after” samples, respectively. Do the data in-
dicate that the fraction of cars with car-pool riders has increased over this period? Use a = .05.

TABLE 8.7 Results of Carpooling Study, Example 8.17


Before Car-Pool After Car-Pool
Lanes Established Lanes Established
Sample Size n1 = 2,000 n2 = 1,500
Car-Pool Riders y1 = 652 y2 = 576
8.10 Testing the Difference Between Two Population Proportions 413

Solution If we define p1 and p2 as the true proportions of cars with car-pool riders before and
after establishing car-pool lanes, respectively, the elements of our test are
H0: 1p1 - p22 = 0
Ha: 1p1 - p22 6 0
(The test is one-tailed since we are interested only in determining whether the propor-
tion of cars with car-pool riders has increased, i.e., whether p2 7 p1.2
1pN 1 - pN 22 - 0
Test statistic: Z =
s( pN1 - pN2 )
Rejection region: a = .05
Z 6 - z a = - z .05 = - 1.645 (see Figure 8.20)

FIGURE 8.20 f(z)


Rejection region for
Example 8.16

α = .05

z
–z .05 = –1.645 0
Rejection
region
Z = –3.56

We now calculate the sample proportions of cars with car-pool riders:


652 576
pN 1 = = .326 pN 2 = = .384
2,000 1,500
The test statistic is
1pN 1 - pN 22 - 0 1pN 1 - pN 22
Z = L
s( pN1 - pN2 )
1 1
pN qN ¢ + ≤
C n1 n2
where
y1 + y2 652 + 576
pN = = = .351
n1 + n2 2,000 + 1,500
Thus,
.326 - .384 - .058
Z = = = - 3.56
.0163
1.35121.6492 ¢
1 1
+ ≤
C 2,000 1,500

The test statistic value is also shown (shaded) on the MINITAB printout of the analy-
sis, Figures 8.21. The p-value of the test (also highlighted on the printout) is approxi-
mately 0. Note that Z = - 3.56 falls in the rejection region and a = .05 exceeds the
414 Chapter 8 Tests of Hypotheses

FIGURE 8.21
MINITAB Test of Difference
between Population Proportions,
Example 8.16

p-value. Thus, there is sufficient evidence at a = .05 to conclude that the proportion
of all cars with car-pool riders has increased after establishing car-pool lanes. We
could place a confidence interval on 1p1 - p22 if we were interested in estimating the
extent of the increase.

Applied Exercises
8.63 Producer willingness to supply biomass. Refer to the methyl tert-butyl ether (MTBE) contamination in
Biomass and Energy (Vol. 36, 2012) study of the willingness New Hampshire wells, Exercise 7.66 (p. 334). Recall
of producers to supply biomass products such as surplus hay, that 223 wells were classified according to well class (pub-
Exercise 7.67 (p. 334). Recall that independent samples of lic or private) and detectable level of MTBE (below limit or
Missouri producers and Illinois producers were surveyed detect). The SPSS printout below gives the number of
and the number of producers willing to supply windrowing wells in the sample with a detectable level of MTBE for
(mowing and piling) of hay was determined for each sample. both the 120 public wells and the 103 private wells.
Of the 558 Missouri producers surveyed, 187 were willing to a. Conduct a two-tailed test of hypothesis to compare the
offer windrowing services; of the 940 Illinois producers sur- true proportion of public wells with a detectable level
veyed, 380 were willing to offer windrowing services. In of MTBE to the true proportion of private wells with a
Exercise 7.67, you obtained a 99% confidence interval for the detectable level of MTBE. Use a = .05.
difference between the proportions of producers who are b. In Exercise 7.66, you compared the two proportions with
willing to offer windrowing services in Missouri and Illinois a 95% confidence interval. Explain why the inference
from a MINITAB printout (reproduced below). Now, use the derived from the two-tailed test, part a, will agree with
information on the printout to conduct a statistical test to de- the inference derived from the confidence interval.
termine if the proportion of producers who are willing to
offer windrowing services to the biomass market differ for SPSS Output for Exercise 8.64
the two areas. For what value of a will the inferences derived
from the test and confidence interval agree? Carry out the test
of hypothesis and make the appropriate inference.
MTBE
8.64 Groundwater contamination in wells. Refer to the
Environmental Science & Technology (Jan. 2005) study of

MINITAB Output for Exercise 8.63


8.10 Testing the Difference Between Two Population Proportions 415

8.65 Study of armyworm pheromones. Refer to the Journal of in the Journal of Transportation Engineering (June 2013).
Chemical Ecology (March 2013) study to determine the One portion of the study focused on the proportion of traf-
effectiveness of pheromones produced by two different fic signs that fail the minimum FHWA retroreflectivity re-
strains of fall armyworms, Exercise 7.68 (p. 334). Recall quirements. Of 1,000 signs maintained by the North
that both corn-strain and rice-strain male armyworms Carolina Department of Transportation (NCDOT), 512
were released into a field containing a synthetic were deemed failures. Of 1,000 signs maintained by coun-
pheromone made from a corn-strain blend. A count of the ty-owned roads in North Carolina, 328 were deemed fail-
number of males trapped by the pheromone was then de- ures. Conduct a test of hypothesis to determine whether
termined. The experiment was conducted once in a corn the true proportions of traffic signs that fail the minimum
field, then again in a grass field. The results are repeated in FHWA retroreflectivity requirements differ depending on
the accompanying table. In Exercise 7.78 you compared whether the signs are maintained by the NCDOT or by the
the proportions of corn-strain and rice-strain males county. Test using a = .05.
trapped by the pheromone. 8.68 Inactive oil and gas structures. Refer to the Oil & Gas
a. Now, the researchers want to compare the proportion of Journal (Jan. 3, 2005) study of 3,400 oil and gas structures
corn-strain males trapped in the corn field to the pro- in the Gulf of Mexico, Exercise 3.19 (p. 93). The accom-
portion of corn-strain males trapped in the grass field. panying table breaks down these structures by type (cais-
Carry out this comparison using a hypothesis test (at son, well protector, or fixed platform) and status (active or
a = .10). What inference can you draw from the data? inactive). Assume the 3,400 structures are a representative
b. Repeat part a for the proportions of rice-strain males sample of all oil and gas structures worldwide.
trapped by the pheromone.
Structure Type
Corn Field Grass Field Caisson Well Protector Fixed Platform Totals
Number of corn-strain males released 112 215 Active 503 225 1,447 2,175
Number trapped 86 164 Inactive 598 177 450 1,225

Number of rice-strain males released 150 669 Totals 1,101 402 1,897 3,400
Source: Kaiser, M., and Mesyanzhinov, D. “Study tabulates idle Gulf of
Number trapped 92 375
Mexico structures.” Oil & Gas Journal, Vol. 103, No. 1, Jan. 3, 2005
(Table 2).

8.66 Fluoride toxicity in Pakistan drinking water. The results of a. Conduct a test (at a = .10) to determine if the propor-
an evaluation of the drinking water quality in Pakistan was tion of caisson structures that are inactive exceeds the
reported in Drinking Water Engineering and Science proportion of well protector structures that are inactive.
(Vol. 6, 2013). Due to high levels of fluoride in the drink- b. Conduct a test (at a = .10) to determine if the pro-
ing water, Pakistanis are susceptible to fluoride toxicity portion of caisson structures that are inactive exceeds
(fluorosis)—which occurs when the fluoride level exceeds the proportion of fixed platform structures that are
1.5 milligrams per liter of water (mg/l). Water specimens inactive.
were collected from various surface or groundwater c. Conduct a test (at a = .10) to determine if the propor-
sources (e.g., hand pumps, wells, springs, dams, etc.) of tion of well protector structures that are inactive differs
major cities of the country. The table gives the results for from the proportion of fixed platform structures that are
two cities—Lahore and Faisalabad. Is there evidence to inactive.
indicate that the fraction of water specimens that exceed
8.69 Killing insects with low oxygen. Refer to the Journal of
1.5 mg/l of fluoride differs for the two cities? Test using
Agricultural, Biological, and Environmental Statistics
a = .10.
(Sept. 2000) study of the mortality of rice weevils ex-
posed to low oxygen. Exercise 8.58 (p. 410). Recall that
Lahore Faisalabad 31,386 of 31,421 rice weevils were found dead after
Number of water specimens sampled 79 30 exposure to nitrogen gas for 4 days. In a second experi-
ment, 23,516 of 23,676 rice weevils were found dead after
Number exceeding 1.5 mg/l of fluoride 21 4
exposure to nitrogen gas for 3.5 days. Conduct a test of
hypothesis to compare the mortality rates of adult rice
8.67 Traffic sign maintenance. The Federal Highway Adminis- weevils exposed to nitrogen at the two exposure times.
tration (FHWA) recently issued new guidelines for main- Is there a significant difference (at a = .10) in the mortal-
taining and replacing traffic signs. Civil engineers at North ity rates?
Carolina State University conducted a study of the effec- 8.70 Vulnerability of relying party websites. When you sign on
tiveness of various sign maintenance practices developed to your Facebook account, you are granted access to more
to adhere to the new guidelines and published the results than 1 million relying party (RP) websites. This single
416 Chapter 8 Tests of Hypotheses

sign-on (SSO) scheme is enabled by OAuth 2.0, an open academic-related outcomes. The following table gives
and standardized web resource authorization protocol. Al- the percentages of BE and BTech students who with-
though the protocol claims to be secure, there is anecdotal drew from two traditionally rigorous courses, engineering
evidence of critical vulnerabilities that allow an attacker to mathematics and engineering graphics/CAD.
gain unauthorized access to the user’s profile and allow
the attacker to impersonate the victim on the RP website. Engineering Mathematics BE Students BTech Students
Computer and systems engineers at the University of
British Columbia investigated the vulnerability of relying Number Enrolled 537 117
party websites and presented their results at the Proceed- Percentage Withdrawn 27.8% 19.7%
ings of the 5th AMC Workshop on Computers & Commu-
nication Security (Oct. 2012). RP websites were categorized
as server-flow or client-flow websites. Of the 40 server- Engineering Graphics/CAD BE Students BTech Students
flow sites studied, 20 were found to be vulnerable to im-
Number Enrolled 727 374
personation attacks. Of the 54 client-flow sites examined,
41 were found to be vulnerable to impersonation attacks. Percentage Withdrawn 39.5% 52.1%
Do these results indicate that a client-flow website is more Source: Palmer, S., and Bray, S. “Comparative academic performance of
likely to be vulnerable to an impersonation attack than a engineering and technology students at Deakin University, Australia.”
server-flow website? Test using a = .01. International Journal of Continuing Engineering Education and
Lifelong Learning, Vol. 13, No. 1–2, 2003 (Tables 5 and 8).
8.71 Engineering vs. technology degrees. In addition to the tra-
ditional bachelor of engineering (BE) degree, many uni- a. Is there sufficient evidence of a difference between the
versities worldwide offer a bachelor of technology percentage of BE students and percentage of BTech
(BTech) degree for students who wish to work as an engi- students who withdraw from engineering mathematics?
neering technician. There is a perception that BTech stu- Test using a = .05.
dents are not as “academically strong” as BE students. b. Is there sufficient evidence of a difference between the
This issue was addressed in the International Journal of percentage of BE students and percentage of BTech
Continuing Engineering Education and Lifelong Learning students who withdraw from engineering graphics/
(Vol. 13, 2003). The researchers compared BE and BTech CAD? Test using a = .05.
students at an Australian university on a variety of

8.11 Testing a Population Variance


In this section we consider a hypothesis test for a population variance, s2 (e.g., the
variation in daily amount of rainfall). Recall from Section 7.9 that the pivotal statistic
for estimating a population variance s2 does not possess a standard normal (Z) distri-
bution. Therefore, we cannot apply the procedure outlined in Section 8.3 when testing
hypotheses about s2.
When the sample is selected from a normal population, however, the pivotal sta-
tistic possesses a chi-square (χ2) distribution and the test can be conducted as outlined
in the box. Note that the assumption of normality is required regardless of whether the
sample size n is large or small.

Test of Hypothesis About a Population Variance s2


One-Tailed Test Two-Tailed Test
H0: s2 = s20 H0: s2 = s20
Ha: s2 7 s20 Ha: s2 Z s20
3or Ha: s2 6 s20

1n - 12s2
Test statistic: x2 =
s20
8.11 Testing a Population Variance 417

Rejection region: Rejection region:


x2 7 x2a 1or x2 6 x21 - a2 x2 6 x21 - a/2 or x2 7 x2a/2
p-value = P1x2 7 x2c 2 p-value = 2 min5P1x2 7 x2c 2,
3or, P1x2 6 x2c 24 P1x2 6 x2c 26
where x2a and x21 - a are values of x2 that locate an area of a to the right and a to the
left, respectively, of a chi-square distribution based on 1n - 12 degrees of freedom,
and x2c is the calculated value of the test statistic.
(Note: s20 is our symbol for the particular numerical value specified for s2 in the
null hypothesis.)
Assumption: The population from which the random sample is selected has an
approximately normal distribution.

Example 8.17 Refer to Example 7.15 (p. 337) concerning the variability of the amount of fill at a cannery. Suppose
regulatory agencies specify that the standard deviation of the amount of fill should be less than .1
Testing s2: Variation in Fill ounce. The quality control supervisor sampled n = 10 cans and measured the amount of fill in each.
Measurements The data are reproduced in Table 8.8. Does this information provide sufficient evidence to indicate
that the standard deviation s of the fill measurements is less than .1 ounce?

FILLWTS TABLE 8.8 Fill Weights of Cans

7.96 7.90 7.98 8.01 7.97 7.96 8.03 8.02 8.04 8.02

Solution Since the null and alternative hypotheses must be stated in terms of s2 (rather than s),
we will want to test the null hypothesis that s2 = .01 against the alternative that
s2 6 .01. Therefore, the elements of the test are
H0: s2 = .01 1i.e., s = .12
Ha: s2 6 .01 1i.e., s 6 .12
Assumption: The populaton of fill amounts is approximately normal.
1n - 12s2
Test statistic: x2 =
s20
Rejection region: The smaller the value of s2 we observe, the stronger the evidence in
favor of Ha. Thus, we reject H0 for “small values” of the test statis-
tic. With a = .05 and 9 df, the χ2 value for rejection is found in
Table 8 of Appendix B and pictured in Figure 8.22. We will reject

FIGURE 8.22 f(χ 2)


Rejection region for
Example 8.17

1 – α = .95
α = .05

χ2
0 3 6 9 12 15 18
1.664
Rejection
region
3.325
418 Chapter 8 Tests of Hypotheses

H0 if x2 6 3.32511. (Remember that the area given in Table 9 of


Appendix B is the area to the right of the numerical value in the
table. Thus, to determine the lower-tail value that has a = .05 to its
left, we use the x2.95 column in Table 8.)

To compute the test statistic, we need to find the sample standard deviation, s. Nu-
merical descriptive statistics for the sample data are provided in the MINITAB printout
shown in Figure 8.23. The value of s (shaded on the printout) is s = .043. Substituting
s = .043, n = 10, and s20 = .01 into the formula for the test statistic, we obtain

110 - 121.04322
x2 = = 1.67
.01
Note that this test statistic and the corresponding p-value of the test (.004) are both
given (shaded) at the bottom of the MINITAB printout, Figure 8.23.
Conclusion: Since the test statistic, x2 = 1.67, is less than 3.32511 (or, since
a = .05 7 p-value = .004), the supervisor can conclude (at a = .05) that the vari-
ance of the population of all amounts of fill is less than .01 (s 6 .1). If this procedure
is repeatedly used, it will incorrectly reject H0 only 5% of the time. Thus, the quality
control supervisor is confident in the decision that the cannery is operating within the
desired limits of variability.

FIGURE 8.23
MINITAB Test of a Population
Variance, Example 8.17
8.11 Testing a Population Variance 419

Applied Exercises
8.72 Characteristics of a rock fall. Refer to the Environmental istics of sweet potato chips fried at different temperatures,
Geology (Vol. 58, 2009) simulation study of how far a Exercise 7.75 (p. 338). Recall that a sample of 6 sweet pota-
block from a collapsing rock wall will bounce down a soil to slices were fried at 130º using a vacuum fryer and the in-
slope, Exercise 2.29 (p. 43). Rebound lengths (in meters) ternal oil content (gigagrams) was measured for each slice.
were estimated for 13 rock bounces. The data are repeated The results were: y = .178 g/g and s = .011 g/g.
in the table. Descriptive statistics for the rebound lengths a. Conduct a test of hypothesis to determine if the stan-
are shown on the accompanying SAS printout. Consider a dard deviation, s, of the population of internal oil con-
test of hypothesis for the variation in rebound lengths for tents for sweet potato slices fried at 130º differs from .1.
the theoretical population of rock bounces from the col- Use a = .05
lapsing rock wall. In particular, a geologist wants to deter- b. In Exercise 7.75 you formed a 95% confidence interval
mine if the variance differs from 10 m2. for the true standard deviation of the internal oil con-
tent distribution for the sweet potato chips. Use this in-
ROCKFALL terval to make an inference about whether s = .1.
Does the result agree with the test, part a?
10.94 13.71 11.38 7.26 17.83 11.92 8.75 Strand bond performance of pre-stressed concrete. An
11.87 5.44 13.35 4.90 5.85 5.10 6.77 experiment was carried out to investigate the strength of
pre-stressed, bonded concrete after anchorage failure has
Source: Paronuzzi, P. “Rockfall-induced block
occurred and the results published in Engineering Struc-
propagation on a soil slope, northern Italy”,
Environmental Geology, Vol. 58, 2009.
tures (June 2013). The maximum strand force, measured
(Table 2.) in kiloNewtons (kN), achieved after anchorage failure for
8 pre-stressed concrete strands is given in the accompany-
ing table. Conduct a test of hypothesis to determine if the
a. Define the parameter of interest.
b. Specify the null and alternative hypothesis.
c. Compute the value of the test statistic.
FORCE
d. Determine the rejection region for the test using
a = .10. 158.2 161.5 166.5 158.4 159.9 161.9
e. Make the appropriate conclusion. 162.8 161.2 160.1 175.6 168.8 163.7
f. What condition must be satisfied in order for the infer-
ence, part e, to be valid?
true standard deviation of the population of maximum
PONDICE strand forces is less than 5 kN. Test using a = .10
8.73 Albedo of ice meltponds. Refer to the National Snow and 8.76 Deep-hole drilling. Refer to the Journal for Engineering for
Ice Data Center (NSIDC) collection of data on the albedo Industry (May 1993) study of deep hole drilling under drill
of ice meltponds, Exercise 7.80 (p. 340). The visible albedo chip congestion, Exercise 8.32 (p. 393). Test to determine
values for a sample of 504 ice meltponds located in the whether the true standard deviation of drill chip lengths
Canadian Arctic are saved in the PONDICE file. differs from 75 mm. Recall that for n = 50 drill chips,
a. Conduct a test (at a = .10) to determine if the true s = 50.2.
variance of the visible albedo values of all Canadian
8.77 Electrical signal theory. Recording electrical activity of the
Arctic ice ponds differs from .0225. (Note: For 503 df,
brain is important in clinical problems as well as in neuro-
x2.95 = 451.991 and x2.05 = 556.283.)
physiological research. To improve the signal-to-noise
b. Discuss the practical significance of the test in part a.
ratio (SNR) in the electrical activity, it is necessary to re-
(Hint: Use the 90% confidence interval you found in
peatedly stimulate subjects and average the responses—
Exercise 7.80.)
a procedure that assumes that single responses are homog-
8.74 Oil content of fried sweet potato chips. Refer to the Jour- eneous. A study was conducted to test the homogeneous
nal of Food Engineering (Sep., 2013) study of the character- signal theory (IEEE Engineering in Medicine and Biology

SAS Output for Exercise 8.72


420 Chapter 8 Tests of Hypotheses

Magazine, Mar. 1990). The null hypothesis is that the vari- PCBFISH
ance of the SNR readings of subjects equals the “expected”
6.2 5.8 5.7 6.3 5.9 5.8 6.0
level under the homogeneous signal theory. For this study,
the “expected” level was assumed to be .54. If the SNR Suppose the EPA requires an instrument that yields PCB
variance exceeds this level, the researchers will conclude readings with a variance of less than .1. Does the new
that the signals are nonhomogeneous. instrument meet the EPA’s specifications? Test at a = .05.
a. Set up the null and alternative hypotheses for the 8.79 Rubber cement canning. A company produces a fast-
researchers. drying rubber cement in 32-ounce aluminum cans. A qual-
b. SNRs recorded for a sample of 41 normal children ity control inspector is interested in testing whether the
ranged from .03 to 3.0. Use this information to obtain variance of the amount of rubber cement dispensed into
an estimate of the sample standard deviation. (Hint: the cans is more than .3. If so, the dispensing machine is in
Assume that the distribution of SNRs is normal and need of adjustment. Since inspection of the canning
that most of the SNRs in the population will fall within process requires that the dispensing machines be shut
m ; 2s, i.e., from m - 2s to m + 2s. Note that the down, and shutdowns for any lengthy period of time cost
range of the interval equals 4s.) the company thousands of dollars in lost revenue, the in-
c. Use the estimate of s in part b to conduct the test of spector is able to obtain a random sample of only 10 cans
part a. Test using a = .10. for testing. After measuring the weights of their contents,
8.78 Measuring PCBs. Polychlorinated biphenyls (PCBs), used the inspector computes the following summary statistics:
in the manufacture of large electrical transformers and ca-
pacitors, are extremely hazardous contaminants when re- y = 31.55 ounces s = .48 ounce
leased into the environment. The Environmental Protection a. Does the sample evidence indicate that the dispensing
Agency (EPA) is experimenting with a new device for machines are in need of adjustment? Test at signifi-
measuring PCB concentration in fish. To check the preci- cance level a = .05.
sion of the new instrument, seven PCB readings were b. What assumption is necessary for the hypothesis test of
taken on the same fish sample. The data are recorded here part a to be valid?
(in parts per million):

8.12 Testing the Ratio of Two Population Variances


As in the one-sample case, the pivotal statistic for comparing two population vari-
ances, s21 and s22, has a nonnormal sampling distribution. Recall from Section 7.10
that the ratio of the sample variances s21>s22 possesses, under certain conditions, an
F distribution.
The elements of the hypothesis test for the ratio of two population variances,
s21>s22, are given in the box.

Test of Hypothesis for the Ratio of Two Population Variances s21> s22:
Independent Samples
One-Tailed Test Two-Tailed Test

s21 s21
H0 : = 1 H0 : = 1
s22 s22

s21 s21
Ha : 71 Ha : Z1
s22 s22

s21
B or, Ha : 6 1R
s22
8.12 Testing the Ratio of Two Population Variances 421

Test statistic: Test statistic:


s 21 s 22 Larger sample variance
F = B or, F = R F =
s 22 s 21 Smaller sample variance

s 21
when s 21 7 s 22
s 22
= e
s 22
when s 22 7 s 21
s 21

Rejection region: Rejection region:


F 7 Fa F 7 Fa/2
p-value = P1F 7 Fc2 p-value = 2 # P1F 7 Fc2
where Fa and Fa/2 are values that locate area a and a/2, respectively, in the upper
tail of the F distribution with n1 = numerator degrees of freedom (i.e., the df for
the sample variance in the numerator) and n2 = denominator degrees of freedom
(i.e., the df for the sample variance in the denominator) and Fc is the computed
value of the test statistic.
Assumptions: 1. Both of the populations from which the samples are selected have
relative frequency distributions that are approximately normal.
2. The random samples are selected in an independent manner from
the two populations.

Example 8.18 Heavy doses of ethylene oxide (ETO) in rabbits have been shown to alter significantly the DNA struc-
ture of cells. Although it is a known mutagen and suspected carcinogen, ETO is used quite frequently
A Test to Compare Variances:
in sterilizing hospital supplies. A study was conducted to investigate the effect of ETO on hospital per-
Hospital Sterilization sonnel involved with the sterilization process. Thirty-one subjects were randomly selected and as-
signed to one of two tasks. Thirteen subjects were assigned the task of opening and unloading a
sterilizer gun filled with ETO (task 1). The remaining 18 subjects were assigned the task of opening a
sterilization package containing ETO (task 2). After the tasks were performed, researchers measured
the amount of ETO (in milligrams) present in the bloodstream of each subject. A summary of the re-
sults appears in Table 8.9. Do the data provide sufficient evidence to indicate a difference in the vari-
ability of the ETO levels in subjects assigned to the two tasks? Test using a = .10.

Solution Let
s21 = Population variance of ETO levels in subjects assigned task 1
TABLE 8.9 Summary Data s22 = Population variance of ETO levels in subjects assigned task 2
for Example 8.18
Task 1 Task 2 For this test to yield valid results, we must assume that both samples of ETO levels
come from normal populations and that the samples are independent.
Sample Size 13 18
The hypotheses of interest are, then,
Mean 5.60 5.90
s21
Standard H0: = 1 1s21 = s222
Deviation 3.10 1.93 s22
s21
Ha: Z 1 1s21 Z s222
s22
The nature of the F tables given in Appendix B affects the form of the test statistic. To
form the rejection region for a two-tailed F test, we want to make certain that the
upper tail is used, because only the upper-tail values of F are shown in Tables 9–12 of
Appendix B. To accomplish this, we will always place the larger sample variance in
the numerator of the F test statistic. This has the effect of doubling the tabulated
422 Chapter 8 Tests of Hypotheses

FIGURE 8.24 f(F)


Rejection region for Example 8.18

α
2 = .05

F
0 1 2 3 4 5 6
2.58
Rejection region

F = 2.38

value for a, since we double the probability that the F ratio will fall in the upper tail by
always placing the larger sample variance in the numerator. That is, we make the test
two-tailed by putting the larger variance in the numerator rather than establishing re-
jection regions in both tails.
Thus, for our example, we have a numerator s21 with df = n1 - 1 = 12 and a
denominator s22 with df = n2 - 1 = 17. Therefore, the test statistic will be

Larger sample variance s21


F = = 2
Smaller sample variance s2

and we will reject H0: s21 = s22 for a = .10 when the calculated value of F exceeds
the tabulated value:
Fa>2 = F.05 = 2.38
We can now calculate the value of the test statistic and complete the analysis:

s21 13.1022 9.61


11.932
F = = = = 2.58
s22 2 3.72

When we compare this to the rejection region shown in Figure 8.24, we see that
F = 2.58 falls in the rejection region. Therefore, the data provide sufficient evidence
to indicate that the population variances differ. It appears that hospital personnel in-
volved with opening the sterilization package (task 2) have less variable ETO levels
than those involved with opening and unloading the sterilizer gun (task 1).
[Note: You can also use the p-value of the test to make the appropriate conclusion. The
p-value for this two-tailed F test is shown (shaded) on the MINITAB printout, Figure 8.25.
Since p-value = .073 is less than a = .10, there is sufficient evidence to reject H0.]

FIGURE 8.25
MINITAB printout for
Example 8.18
8.12 Testing the Ratio of Two Population Variances 423

What would you have concluded in Example 8.18 if the value of F calculated
from the samples had not fallen in the rejection region? Would you conclude that the
null hypothesis of equal variances is true? No, because then you risk the possibility of
a Type II error (failing to reject H0 if Ha is true) without knowing the value of b, the
probability of failing to reject H0: s21 = s22 if in fact it is false. Since we will not con-
sider the calculation of b for specific alternatives, when the F statistic does not fall in
the rejection region, we simply conclude that insufficient sample evidence exists to re-
fute the null hypothesis that s21 = s22.
Example 8.18 illustrates the technique for calculating the test statistic and rejection
region for a two-tailed test to avoid the problem of locating an F value in the lower tail
of the F distribution. In a one-tailed test this is much easier to accomplish since we can
control how we specify the ratio of the population variances in H0 and Ha. That is, we
can always make a one-tailed test an upper-tailed test. For example, if we want to test
whether s21 is greater than s22, then we write the alternative hypothesis as

s21
Ha: 7 1 1i.e., s21 7 s222
s22

and the appropriate test statistic is F = s21>s22. Conversely, if we want to test whether
s21 is less than s22 (i.e, whether s22 is greater than s21), we write

s22
Ha: 7 1 1i.e., s22 7 s212
s21
and the corresponding test statistic is F = s22>s21.

Applied Exercises
DRUGCON Pavement Subgrade
8.80 Drug content assessment. Refer to Exercise 7.84 (p. 344) Mineral Peat
and the Analytical Chemistry (Dec. 15, 2009) study in
which scientists used high-performance liquid chromatog- Number of Roads 32 40
raphy to determine the amount of drug in a tablet. Recall Mean Surface Deflection (mm) 1.53 3.80
that 25 tablets were produced at each of two different, in-
Standard Deviation 3.39 14.3
dependent sites. In Exercise 7.84 you used a 95% confi-
dence interval to determine if the two sites produce drug Source: Martin, A. M., et al. “Estimation of the serviceability of
concentrations with different variances. Now make the in- forest access roads.” International Journal of Forest
Engineering, Vol. 10, No. 2, July 1999 (adapted from Table 3).
ference with a test of hypothesis at a = .05. Use the infor-
mation provided in the MINITAB printout on p. 424. 8.82 Hippo grazing patterns in Kenya. Refer to the Landscape
8.81 Attributes of forest access roads. Refer to the International & Ecology Engineering (Jan., 2013) study of hippopota-
Journal of Forest Engineering (July 1999) study of the attrib- mus grazing patterns in Kenya, Exercise 7.85 (p. 344). Re-
utes of forest access roads in Ireland, Exercise 7.110 (p. 363). call that plots of land were sampled in two areas—a
Recall that the transient surface deflection (millimeters) was national reserve and a pastoral ranch—and the number of
measured for independent random samples of 32 mineral hippo trails from a water source was determined for each
subgrade access roads and 40 peat subgrade access roads. plot. Sample statistics are reproduced in the table on
The results are reproduced in the accompanying table. p. 424. In Exercise 7.85 you found a 90% confidence in-
a. Compare the surface deflection variances of the two terval for s21>s22, the ratio of the variances associated with
pavement types with a two-tailed test of hypothesis the two areas, and used it to determine if the variability in
using a = .05. number of hippo trails from a water source in the national
b. In Exercise 7.110, you used a 95% confidence interval reserve differs from the variability in number of hippo
to compare the surface deflection variances. Demon- trails from a water source in the pastoral ranch. Explain
strate that the inferences derived from the test and con- why a test of hypothesis at a = .10 will result in the same
fidence interval are identical. Will this always be the inference, then carry out the test to verify your results.
case? Explain.
424 Chapter 8 Tests of Hypotheses

MINITAB Output for Exercise 8.80

Table for Exercise 8.82


experienced inspectors than for novice inspectors. Do
National Reserve Pastoral Ranch
the sample data support her belief? Test using a = .05.
Sample size: 406 230 b. What is the appropriate p-value of the test you conduct-
Mean number of trails: .31 .13 ed in part a?

Standard deviation: .40 .30 ERRORS

Source: Kanga, E.M., et al. “Hippopotamus and livestock Novice Inspectors Experienced Inspectors
grazing: influences on riparian vegetation and facilitation of 30 35 26 40 31 15 25 19
other herbivores in the Mara Region of Kenya”, Landscape &
Ecology Engineering, Vol. 9, No. 1, January 2013. 36 20 45 31 28 17 19 18
33 29 21 48 24 10 20 21
8.83 Analyzing human inspection errors. Tests of product
quality using human inspectors can lead to serious inspec- GASTURBINE
tion error problems (Journal of Quality Technology). To 8.84 Cooling method for gas turbines. Refer to the Journal of
evaluate the performance of inspectors in a new company, Engineering for Gas Turbines and Power (Jan. 2005)
a quality manager had a sample of 12 novice inspectors study of gas turbines augmented with high-pressure inlet
evaluate 200 finished products. The same 200 items were fogging, Exercise 8.39 (p. 399). Heat rate data (kilojoules
evaluated by 12 experienced inspectors. The quality of per kilowatt per hour) for each of three types of gas tur-
each item—whether defective or nondefective—was bines (advanced, aeroderivative, traditional) are saved in
known to the manager. The next table lists the number of the GASTURBINE file. In order to compare the mean
inspection errors (classifying a defective item as nonde- heat rates of two types of gas turbines, you assumed that
fective or vice versa) made by each inspector. the heat rate variances were equal.
a. Prior to conducting this experiment, the manager be- a. Conduct a test (at a = .05) for equality of heat rate
lieved the variance in inspection errors was lower for variances for traditional and aeroderivative augmented
8.12 Testing the Ratio of Two Population Variances 425

gas turbines. Use the result to make a statement about 8.88 Shopping vehicle and judgment. Refer to the Journal of
the validity of the inference derived in Exercise 8.33 a. Marketing Research (Dec., 2011) study of shopping cart
b. Conduct a test (at a = .05) for equality of heat rate design, Exercise 8.41 (p. 400). Recall that design engi-
variances for advanced and aeroderivative augmented neers want to know whether the mean choice of vice-over-
gas turbines. Use the result to make a statement about virtue score is higher when a consumer’s arm is flexed (as
the validity of the inference derived in Exercise 8.39 b. when carrying a shopping basket) than when the con-
sumer’s arm is extended (as when pushing a shopping
ORCHARD cart). The average choice score for the n 1 = 11 con-
8.85 Insecticides used in orchards. Refer to Exercise 8.44 sumers with a flexed arm was y1 = 59, while the average
(p. 401). Recall that an Environmental Science & Technology for the n 2 = 11 consumers with an extended arm was
study was conducted to compare the mean oxon/thion ratios y2 = 43. In which scenario is the assumption required for
at a California orchard under two weather conditions—foggy a t-test to compare means more likely to be violated,
and clear/cloudy. The data are saved in the ORCHARD file. s1 = 4 and s2 = 2, or, s1 = 10 and s2 = 15? Explain.
Test the assumption of equal variances required for the com-
parison of means to be valid. Use a = .05.
Theoretical Exercises
8.86 Oil content of fried sweet potato chips. Refer to the
Journal of Food Engineering (Sep. 2013) study of the 8.89 Suppose we want to test H0: s21 = s22 versus Ha: s21 Z s22.
characteristics of fried sweet potato chips, Exercise 8.74 Show that the rejection region given by
(p. 419). Recall that a sample of 6 sweet potato slices s21 s21
fried at 130º using a vacuum fryer yielded the following 7 Fa/2 or 6 F11 - a/22
s22 s22
statistics on internal oil content (measured in giga-
grams): y1 = .178 g/g and s1 = .011 g/g. A second sam- where F depends on n1 = 1n1 - 12 df and n2 =
ple of 6 sweet potato slices was obtained, only these 1n2 - 12 df , is equivalent to the rejection region given by
were subjected to a two-stage frying process (again, at s21
130º) in an attempt to improve texture and appearance. 7 Fa/2
Summary statistics on internal oil content for this sec- s22
ond sample follows: y2 = .140 g/g and s2 = .002 g/g. where F depends on n1 numerator df and n2 denominator
The researchers want to compare the mean internal oil df, or
contents of sweet potato chips fried with the two meth-
s22
ods using a t-test. Do you recommend the researchers 7 F *a/2
carry out this analysis? Explain. (Recall your answer to s21
Exercise 7.86.) where F* depends on n2 numerator df and n1 denominator df.
8.87 Cracking torsion of T-beams. An experiment was conduct- [Hint: Use the fact (proof omitted) that
ed to study the effect of reinforced flanges on the tor-
1
sional capacity of reinforced concrete T-beams (Journal F11 - a/22 =
F*a/2
of the American Concrete Institute, Jan.–Feb. 1986).
Several different types of T-beams were used in the ex- where F depends on n1 numerator df and n2 denominator
periment, each type having a different flange width. The df and F* depends on n2 numerator df and n1 denomina-
beams were tested under combined torsion and bending tor df.]
until failure (cracking). One variable of interest is the
8.90 Use the results of Exercise 8.89 to show that
cracking torsion moment at the top of the flange of the
T-beam. Cracking torsion moments for eight beams with
Larger sample variance
70-cm slab widths and eight beams with 100-cm slab P¢ 7 Fa/2 ≤ = a
widths follow: Smaller sample variance

where F depends on numerator df = 31Sample size for


TBEAMS
numerator sample variance2 - 14 and denominator df =
70-cm 31Sample size for denominator sample variance2 - 14.
Slab Width: 6.00, 7.20, 10.20, 13.20, 11.40, 13.60, 9.20, 11.20 [Hint: First write
100-cm
Slab Width: 6.80, 9.20, 8.80, 13.20, 11.20, 14.90, 10.20, 11.80 Larger sample variance
P¢ 7 Fa/2 ≤
Smaller sample variance
a. Is there evidence of a difference in the variation in the
cracking torsion moments of the two types of T-beams? s 21 s 22
= P¢ 7 Fa/2 or 7 Fa/2 ≤
Use a = .10. s 22 s 21
Then use the fact that P1F 7 Fa/22 = a/2.]
b. What assumptions are required for the test to be valid?
426 Chapter 8 Tests of Hypotheses

8.13 Alternative Testing Procedures: Bootstrapping and


Bayesian Methods (Optional)
In optional Section 7.14, we introduced two alternative methods for finding confi-
dence intervals: the bootstrapping method and a Bayesian method. These procedures
can also be used to conduct a statistical test of hypothesis. In certain sampling situa-
tions, the conclusions drawn from one or both of these methods may be more valid
than those produced using the classical tests of Sections 8.4–8.12, especially when the
data do not adhere to the underlying assumptions.

Bootstrap Hypothesis Tests


Recall that the bootstrap is a Monte Carlo method that involves resampling—that is,
taking repeated samples of size n (with replacement) from the original sample data
set. The bootstrap testing procedure uses resampling to find an approximation for the
observed significance level (p-value) of the test. The steps required to obtain the boot-
strap p-value estimate for a test on a population mean are listed in the box.

Bootstrap p-Value for Testing a Population Mean, H0: M ⴝ M 0


Let y1, y2, y3, Á , yn represent a random sample of size n from a population with
mean E(Y) = m.

Step 1 Calculate the value of the test statistic for the sample: t c = 1y - m02 > 1s/ 1n2
where y is the sample mean and s is the sample standard deviation.
Step 2 Select j, where j is the number of times you will resample. (Usually, j is a
very large number, say, j = 1,000 or j = 3,000.)
Step 3 Transform each of the sample y values as follows: xi = yi - y + m0. That
is, take each sample y value, subtract the sample mean, then add m0. (This
step will generate sample values with a mean equal to the hypothesized
mean in H0.)
Step 4 Randomly sample, with replacement, n values of X from the transformed
sample data set x1, x2, x3, Á , xn.
Step 5 Repeat step 4 a total of j times.
Step 6 For each bootstrap sample, compute the test statistic: tj = 1xj - m02/1sj> 1n2 ,
where xj and sj are the mean and standard deviation, respectively, of boot-
strap sample j.
Step 7 Find the bootstrap estimated p-value—called the achieved significance
level (ASL)—as follows:
Upper-tailed test 1Ha: m 7 m02: ASL = 1Number of times t j 7 t c2> j
Lower-tailed test 1Ha: m 6 m02: ASL = 1Number of times t j 6 t c2>j
Two-tailed test 1Ha: m Z m02:
1Number of times t j 7 ƒ t c ƒ 2 + 1Number of times t j 6 - ƒ t c ƒ 2
ASL =
j

The bootstrap ASL in step 7 is based on the definition of a p-value given in Sec-
tion 8.6 (Definition 8.4): The p-value is the probability of observing a value of the test
statistic that is more contradictory to H0 than the value calculated in the sample. In the
8.13 Alternative Testing Procedures: Bootstrapping and Bayesian Methods (Optional) 427

case of an upper-tailed test, more contradictory to H0 implies a test statistic value that
is greater than the calculated value in the sample. We illustrate the bootstrap procedure
in the next example.

Example 8.19 Refer to Example 8.11 and the investigation of benzene contamination at a steel manufacturing plant.
The benzene level (parts per million) was determined for each in a random sample of 20 air samples. (The
Bootstrap Test for m: Benzene data are saved in the BENZENE file.) Recall that the OSHA wants to test H0: m = 1 against Ha: m 7 1.
Contamination Find the bootstrap ASL for this upper-tailed test. Make the appropriate conclusion using a = .05.
BENZENE

Solution To find the bootstrap ASL, we follow the steps outlined above.
Step 1 From Example 8.11, the calculated value of the test statistic is t c = 2.95.
Step 2 We chose j = 1,000 for resampling.
Step 3 Now y = 2.14 (see Example 8.11) and m0 = 1. Thus, we transform each of the
20 sampled benzene levels as follows: xi = yi - y + m0 = yi - 2.14 + 1.
The original sample data and the transformed values are shown in the
MINITAB worksheet, Figure 8.26
Steps 4–5 SAS was programmed to generate 1,000 random samples of size n = 20
(selecting observations with replacement) from the transformed sample data in
Figure 8.26. The data for the first three resamples are shown in Table 8.10.

FIGURE 8.26
MINITAB worksheet with
transformed benzene levels
428 Chapter 8 Tests of Hypotheses

TABLE 8.10 Bootstrap Resampling from Transformed Data in


Figure 8.26 (First 3 Samples)
Sample 1: -1.14 3.89 1.75 0.12 0.12 3.36 3.57 1.46 3.57 1.4
0.3 1.1 -1.14 1.83 3.89 -0.29 0.12 1.4 -1.14 3.36

Sample 2: 0.12 1.75 -0.93 1.4 3.36 0.3 -0.29 -0.99 -0.29 -0.93
0.3 -0.93 -1.14 -0.84 3.36 -0.29 1.4 1.27 1.1 1.46

Sample 3: 3.57 0.3 -0.93 0.12 1.1 1.75 3.57 3.36 1.27 1.27
-0.78 -1.14 1.83 -0.78 1.46 1.27 2.77 -0.29 -0.29 3.57

Step 6 Next, we used SAS to obtain the mean and standard deviation for each of the
1,000 samples. Then, we programmed SAS to compute the test statistic from
these values as follows: tj = 1xj - 12>1sj> 1202, j = 1, 2, 3, Á , 1,000.2
Step 7 Each of the t values in step 6 was compared to the calculated test statistic,
t c = 2.95. Only three t values (those associated with samples 126, 962, and
966) exceeded 2.95. Consequently, the bootstrap ASL value is ASL =
3>1,000 = .003.
The bootstrap-achieved significance level provides an estimate of the true p-value
of the test. (Note: The p-value obtained in Example 8.11 was .004.) Since a = .05 ex-
ceeds the ASL value, we have sufficient evidence to reject the null hypothesis and to
conclude that m 7 1.

The general procedure for obtaining a bootstrap p-value for a test on any popula-
tion parameter u is beyond the scope of this text. Consult the references if you wish to
learn about these methods. The procedure for testing a difference between two means,
1m1 - m22, however, is very similar to the procedure for a single mean, m. We list the
steps in the box.

Bootstrap p-Value for Testing Equality of


Population Means, H0: (M 1 ⴚ M 2) ⴝ 0
Let y1 and s1 represent the mean and standard deviation of a random sample of size
n1 from a population with mean m1. Let y2 and s2 represent the mean and standard
deviation of a random sample of size n2 from a population with mean m2.
Step 1 Calculate the value of the test statistic for the sample,

1y1 - y22
tc =
21s 21>n12 + 1s 22>n22

Step 2 Select j, where j is the number of times you will resample.


Step 3 Find the mean y of the combined samples, then transform each of the sample
values as follows:

Sample 1: xi = yi - y1 + y Sample 2: xi = yi - y2 + y

(That is, take each sample value, subtract its sample mean, then add y.)
Step 4 Randomly sample, with replacement, n1 transformed values from the first
sample. Randomly sample, with replacement, n2 transformed values from
the second sample.
8.13 Alternative Testing Procedures: Bootstrapping and Bayesian Methods (Optional) 429

Step 5 Repeat step 4 a total of j times.


Step 6 For each bootstrap sample, compute the test statistic:

1x1 - x22
tj =
21s21>n12 + 1s22>n22

where x1 and s1 are the mean and standard deviation, respectively, of bootstrap sam-
ple j for sample 1, and x2 and s2 are the mean and standard deviation, respectively, of
bootstrap sample j for sample 2.
Step 7 Find the bootstrap estimated p-value—called the achieved significance
level (ASL)—as follows:
Upper-tailed test 1Ha: m1 - m2 7 02: ASL = 1Number of times t j 7 t c2>j
Lower-tailed test 1Ha: m1 - m2 6 02: ASL = 1Number of times t j 6 t c2>j
Two-tailed test 1Ha: m1 - m2 Z 02:
1Number of times t j 7 ƒt c ƒ2 + 1Number of times t j 6 - ƒt c ƒ2
ASL =
j

Bayesian Testing Procedures


Let y1, y2, y3, Á , yn represent a random sample of size n selected from a population
with unknown population parameter u. The Bayesian approach to testing a hypothesis
about u considers u as a random variable with a known prior distribution, h(u). As with
interval estimation, we need to find the posterior distribution, g1u ƒ y1, y2, y3, Á , yn2.
As shown in optional Section 7.14, the posterior distribution is
f1y1, y2, y3, Á , yn ƒ u2 # h1u2
g1u ƒ y1, y2, y3, Á , yn2 =
f1y1, y2, y3, Á , yn2

where f1y1, y2, y3, Á , yn2 = 1 f1y1, y2, y3, Á , yn ƒ u2 # h1u2 du

Suppose you want to test H0: u … u0 versus Ha: u 7 u0. The simplest Bayesian
test uses the posterior distribution g1u ƒ y1, y2, y3, Á , yn2 to find the following condi-
tional probabilities:

P1u … u0 ƒ y1, y2, y3, Á , yn2 and P1u 7 u0 ƒ y1, y2, y3, Á , yn2
In other words, the posterior distribution is used to find the likelihoods of H0 and Ha
occurring. A simple rule is to accept the hypothesis that is associated with the largest
conditional probability. That is,
Accept H0 if P1u … u0 ƒ y1, y2, y3, Á , yn2
Ú P1u 7 u0 ƒ y1, y2, y3, Á , yn2
Reject H0 1i.e., Accept Ha if P1u … u0 ƒ y1, y2, y3, Á , yn2
6 P1u 7 u0 ƒ y1, y2, y3, Á , yn2
We illustrate the Bayesian testing method in the next example.

Example 8.20 Consider a random sample of size 20 selected from a Bernoulli probability distribution with unknown
probability of success p. The data (measured as zeros and ones) are shown in Table 8.11. Assume that
Bayesian Test of m
the prior distribution for p is a beta probability distribution with parameters a = 1 and b = 2. Use the
sum of the Bernoulli values to conduct a Bayesian test of H0: p … .5 versus Ha: p 7 .5.
430 Chapter 8 Tests of Hypotheses

TABLE 8.11 Sample of 20 Values from a Bernoulli Distribution

1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 0 1

Solution We know from Example 7.20 (p. 307), that X, the sum of the Bernoulli random vari-
ables, has a binomial distribution with n = 20 and probability of success p. We also
know that p has a prior beta distribution with a = 1 and b = 2. In Example 7.20, we
showed that the posterior distribution of p, g1p ƒ x2, has a beta distribution with pa-
rameters a = 1X + 12 and b = 1n - X + 22. Summing the sample Bernoulli val-
ues in Table 8.11, we obtain X = 15. Therefore, the posterior distribution of p is a beta
distribution with a = 1X + 12 = 16 and b = 1n - X + 22 = 7.
Since the null and alternative hypotheses are H0: p … .5 and Ha: p 7 .5, we need
to find the conditional probabilities, P1p … .5 ƒ X = 152 and P1p 7 .5 ƒ X = 152. Most
statistical software packages have routines for computing probabilities for a wide variety
of probability distributions. We use MINITAB to find P1p … .52 for a beta distribution
with a = 16 and b = 7. The result (highlighted) is shown in Figure 8.27. You can see
that P1p … .5 ƒ X = 152 = .026. Hence, P1p 7 .5 ƒ X = 152 = 1 - .026 = .974.
Since the conditional probability associated with Ha: p 7 .5 is larger, we reject H0 in
favor of Ha and conclude that the probability of success, p, exceeds .5.

FIGURE 8.27
MINITAB calculation of P1p … .52 using beta 1a = 16, b = 72 probability function

Another approach to Bayesian testing is to use the posterior distribution to find a


11 - a2100% credible interval for the parameter being tested. (See optional Section
7.14.) For example, a 90% credible interval for the probability of success p in Exam-
ple 8.20 is P1L 6 p 6 U2 = .90, where L and U are the 5th and 95th percentiles of
a beta (a = 16, b = 7) distribution. Using the inverse beta function of MINITAB, we
find that the 90% credible interval for p is (.53, .84). Note the interval does not contain
the null hypothesized value of .5. All values of p in the credible interval exceed .5,
supporting the alternative hypothesis.

Applied Exercises
8.91 Bearing strength of concrete FRP strips. Refer to the Com-
FRP
posites Fabrication Magazine (Sept. 2004) study of the
strength of fiber-reinforced polymer (FRP) composite ma- 240.9 248.8 215.7 233.6 231.4 230.9 225.3 247.3 235.5 238.0
terials, Exercise 7.100 (p. 354). Recall that 10 specimens
Source: Data are simulated from summary information provided in
of pultruded FRP strips were mechanically fastened to
Composites Fabrication Magazine, Sept. 2004, p. 32 (Table 1).
highway bridges and tested for bearing strength. The
strength measurements (recorded in mega pascal units, 8.92 Surface roughness of pipe. Refer to the Anti-corrosion
MPa) are reproduced in the table. Use the bootstrap proce- Methods and Materials (Vol. 50, 2003) study of the sur-
dure to test 1at a = .102 whether the true mean strength face roughness of coated interior pipe used in oil fields,
of mechanically fastened FRP strips exceeds 230 MPa. Exercise 8.24 (p. 390). The data (in micrometers) for
Statistics in Action Revisited 431

20 sampled pipe sections are reproduced in the table. Use


the bootstrap procedure to test 1at a = .052 whether the
INVQUAD

mean surface roughness of coated interior pipe, m, differs Electric: 204.15 0.57 62.76 89.72 0.35 85.46
from 2 micrometers. Compare the bootstrap ASL to the 0.78 0.65 44.38 9.28 78.60
p-value obtained from the test in Exercise 8.24. Gas: 0.78 16.66 74.94 0.01 0.54 23.59 88.79 0.64
0.83 91.84 7.20 66.64 0.74 64.67 165.60 0.36
ROUGHPIPE

1.72 2.50 2.16 2.13 1.06 2.24 2.31 2.03 1.09 1.40 8.95 Study of lunar soil. Refer to the Meteoritics (Mar. 1995)
study of lunar soil evolution, Exercise 8.62 (p. 411). Re-
2.57 2.64 1.26 2.05 1.19 2.13 1.27 1.51 2.41 1.95 call that one theory is that the proportion p of lunar soil
Source: Farshad, F., and Pesacreta, T. “Coated pipe interior surface grains that are coated with dust and/or glass fragments
roughness as measured by three scanning probe instruments.” Anti- will be less than .5 at the bottom of the lunar core soil
corrosion Methods and Materials, Vol. 50, No. 1, 2003 (Table III). sample. Assuming that the prior distribution for p is a beta
probability distribution with parameters a = 1 and b = 2,
8.93 Cooling method for gas turbines. Refer to the Journal of conduct a Bayesian test of the hypothesis of interest.
Engineering for Gas Turbines and Power (Jan. 2005) (Note: From Exercise 8.62, 29 of 81 grains sampled from
study of three types of gas turbines augmented with high- the bottom of the lunar core were coated.)
pressure inlet fogging, Exercise 8.39 (p. 399). Heat rate
data (kilojoules per kilowatt per hour) for advanced and
aeroderivative gas turbines are shown in the table. Use the Theoretical Exercises
bootstrap procedure to test 1at a = .052 for a difference
8.96 Let y1, y2, y3, Á , yn represent a random sample of size n
between the mean heat rates of advanced augmented gas
selected from a Poisson probability distribution with un-
turbines and aeroderivative augmented gas turbines. Com-
known mean l. Let X represent the sum of the Poisson val-
pare the bootstrap ASL to the p-value obtained from the
ues, X = gyi. Then X has a Poisson distribution with
test in Exercise 8.39b.
mean nl. Assume that the prior distribution for l is an ex-
GASTURBINE ponential probability distribution with parameter b. Find a
Bayesian decision rule for testing H0: l = l0 versus
Advanced: 9722 10481 9812 9669 9643 9115 9115 11588 Ha: l 7 l0. [Hint: Use the posterior distribution, g1l ƒ x2,
10888 9738 9295 9421 9105 10233 10186 9918 found in Exercise 7.104 (p. 355).]
9209 9532 9933 9152 9295
8.97 Let y1, y2, y3, Á , yn represent a random sample of size n
Aeroderivative: 16243 14628 12766 8714 9469 11948 12414 selected from a normal probability distribution with un-
known mean m and variance s2 = 1. Then the sample
8.94 Plant investment per delivered quad. Refer to Example mean, y, has a normal distribution with mean m and vari-
8.13 (p. 382) and the comparison of electric and gas utility ance s2 = 1>n. Assume that the prior distribution for m is
plants. The data on plant investment per delivered quad for a normal distribution with a mean of 5 and a variance of 1.
11 plants using electrical utilities and 16 plants using gas Find a Bayesian decision rule for testing H0: m = m0 ver-
utilities are reproduced in the next table. Use the bootstrap sus Ha: m 6 m0. [Hint: Use the posterior distribution,
procedure 1at a = .052 to test for a difference in the aver- g1m ƒ y2, found in Exercise 7.105 (p. 355).]
age investment/quad between all plants using gas and all
those using electric utilities.

• STATISTICS IN ACTION REVISITED



• Comparing Methods for Dissolving Drug Tablets—Dissolution Method Equivalence Testing

W
e now return to the drug assay problem outlined in the Statistics in Action application discussed at
the beginning of this chapter (p. 369). Recall that a pharmaceutical company first measures the dis-
solution of a new drug in a Research and Development (R&D) laboratory by quantifying how much
of the drug is contained in a dissolving solution; this value is expressed as percent of label strength (%LS).
The process is then repeated at a manufacturing facility. Federal regulations require that quality engineers
at the manufacturing site produce results equivalent to those at any other site (including the R&D lab).
Dissolution test data for an analgesic in tablet form conducted at two manufacturing sites (New Jersey
and Puerto Rico) were listed in Table SIA8.1 (p. 433) and are saved in the DISSOLVE file. Recall that %LS val-
ues were obtained at four different points in time – after 20 minutes, after 40 minutes, after 60 minutes, and
after 120 minutes – for each of the six test vessels. Based on the sample data, do the two sites produce
equivalent assay results?
432 Chapter 8 Tests of Hypotheses

An initially appealing approach to answering this question is to conduct a test of hypothesis on the dif-
ference between the mean %LS measurements at the two sites. Let m1 represent the population mean %LS
for tests conducted at the New Jersey site and let m2 represent the population mean %LS for tests con-
ducted at the Puerto Rico site. If the test results at the two sites are equivalent, then m1 = m2. The null and
alternative hypotheses can be stated:
H0: 1 m1 - m22 = 0 (i.e., dissolution equivalence)
Ha: 1m1 - m22 Z 0 (i.e., non equivalence)
To simplify the analysis, the statisticians suggested conducting this test at each of the four time peri-
ods separately. The above test was conducted using SAS at each time point, with the results shown in
Figure SIA8.1 on the next two pages. The p-values for the two-tailed tests (highlighted on the printout) for
20, 40, 60, and 120 minutes of dissolving time are .1528, .0395, .3499, and .4956, respectively. If we select a
Type I error rate of a = .05, then we fail to reject Ho (p-value 7 .05) for three of the four time points; only
when dissolving time is set at 40 minutes is there sufficient evidence to conclude that the mean %LS values
for the two sites differ. In other words, one might reasonably conclude from the hypothesis tests that the
two sites produce equivalent results at dissolving times of 20, 60, and 120 minutes, but do not produce
equivalent results at a dissolving time of 40 minutes.
There are several caveats to this hypothesis testing approach, as the statisticians warned in their chap-
ter, “Dissolution Method Equivalence”. First, the idea of equivalence in the above test is established by “ac-
cepting Ho”. Recall that a measure of reliability for the conclusion “accept Ho” is b = P(Type II error) =
P(Accept Ho | Ho is false). For this application, b is the probability of saying m1 = m2, when, in fact, the
means differ. Since the sampling distribution of m1 - m2 is unknown when the alternative condition,
m1 Z m2, is true, the exact value of b is unknown. Second, the notion of “practical significance” is ignored in
the hypothesis test. That is, although the population means may be statistically different at a = .05, the
true difference may be small and not considered a meaningful difference in practice. Finally, the test above
may have the unfortunate effect of penalizing a testing site with a small (smaller than average) %LS vari-
ance. You can see this by examining the formula for the test statistic in Section 8.7. When the difference in
sample means is divided by a small standard error (which will likely occur if one site has a small variance),
the resulting t-value will be large (and likely to be significant).
To overcome these problems, pharmaceutical companies have developed alternative approaches to
the equivalence problem. One method, suggested by the statisticians in their chapter, requires that you first
find a 90% confidence interval for m1 - m2. If the confidence interval for the difference between mean %LS
values lies within equivalence limits established by the company, then accept the assays of the two sites as
being equivalent. The company in this application uses the equivalence limits in Table SIA8.2. Note that the
limits depend on the magnitude of the mean %LS.
Using the equivalence limits of Table SIA8.2, we will accept the assays of the two sites as being equiv-
alent if the 90% confidence interval for m1 - m2 : (a) lies between -15 and 15 when the mean %LS is less
than 90, or (b) lies between - 7 and 7 when the mean %LS is greater than or equal to 90. Note that this ap-
proach is equivalent to testing the following hypotheses (for those assays with mean 6 90%) :
H0: 1 m1 - m22 6 -15 or 1m1 - m22 7 15 (i.e., nonequivalence)
Ha: - 15 6 1 m1 - m22 6 15 (i.e., dissolution equivalence)
For this reason, this methodology is referred to as the two one-sided t-test (TOST).

TABLE SIA8.2 Determining Dissolution


Equivalence
If Mean %LS is Dissolution Equivalence Occurs
If Mean Difference Is Between:
6 90% - 15% and 15%
Ú90% - 7% and 7%
Statistics In Action Revisited 433

FIGURE SIA8.1
SAS Dissolution Equivalence Hypothesis Tests
434 Chapter 8 Tests of Hypotheses

FIGURE SIA8.1
SAS Dissolution Equivalence Hypothesis Tests (Continued)
Quick Review 435

To apply TOST to the data of Table SIA8.1, we find the 90% confidence intervals for m1 - m2. These con-
fidence intervals, as well as the mean %LS values, are also shaded in the SAS printout, Figure SIA8.1. The
confidence intervals for each of the four time points are all within their respective equivalence limits (i.e.,
between - 15 and 15 for time points 20 and 40 minutes, and between -7 and 7 for time points 60 and 120
minutes). Consequently, the data support dissolution assay equivalence between the two sites for all four
dissolution times.
TOST is now considered the standard method for bioequivalence testing of pharmaceutical products
and is becoming widely accepted in process engineering, chemistry and environmental science. An excel-
lent tutorial on TOST is given in “Beyond the t-Test: Statistical Equivalence Testing”, Analytical Chemistry
(June 1, 2005). There, the authors provide insight into TOST sample size determination and on how to
choose the all-important equivalence limits.

Quick Review
Key Terms
Note: Starred (*) terms are from the optional section in this chapter.
*Achieved significance Conclusion 418 Observed significance level Test statistic 376
level (bootstrap) 426 Large-sample (normal) test (p-value) 383 Two one-sided t-test 432
Alternative (research) 378 One-tailed statistical test Two-tailed statistical test
hypothesis 370 Likelihood ratio test 377 377
*Bayesian testing method statistic 376 p-value 383 Type I error 372
429 Lower-tailed test 380 Power of a test 374 Type II error 372
*Bootstrap hypothesis test Null hypothesis 370 Rejection region 377 Upper-tailed test 380
426

Key Formulas
Summary of Hypothesis Tests: One-Sample Case
Null Point Additional
Parameter (u) Hypothesis (H0) Estimator 1uN 2 Test Statistic Sample Size Assumptions

y - m0 y - m0
m m = m0 y Z = L n Ú 30 None 386
s/ 1n s/ 1n
y - m0
T = n 6 30 Normal population 388
s/ 1n
where T is based on
n = 1n - 12 df

y pN - p0
p p = p0 pN = Z = n large enough so that None 408
n p0q0 npN Ú 4 and nqN Ú 4
B n

1n - 12s2
s2 s2 = s20 s2 x2 = where All n Normal population 416
s20
x2 is based on
n = 1n - 12 d.f.
436 Chapter 8 Tests of Hypotheses

Summary of Hypothesis Tests: Two-Sample Case


Null Point Additional
Parameter (u) Hypothesis (H0) Estimator 1uN 2 Test Statistic Sample Size Assumptions
1m1 - m22 1m1 - m22 = D0 1y - y22 1y1 - y22 - D0 n1 Ú 30, n2 Ú 30 None 395
Z =
Independent (If we want to s21 s22
samples detect a difference +
B n1 n2
1y1 - y22 - D0
between m1 and m2,
then D0 = 0.) L
s21 s2
+ 2
B n1 n2
1y1 - y22 - D0
T =
Either n1 6 30 or Both populations 395
s2p a b
1 1
+ n2 6 30 or both normal with equal
B n1 n2
variances 1s21 = s222
where T is based on
(For situations in
n = n1 + n2 - 2 df and
which s21 Z s22, see
1n1 - 12s21 + 1n2 - 12s22 the modifications
s2p =
n1 + n2 - 2 listed in the box
on p. 000.)

d = g i = 1di>n T = d
n - D0
md = md = D0 (If All nd (If nd Ú 30, Population of 402
1m1 - m22 we want to Mean of sd> 2nd then the standard differences di is
Matched detect a difference sample where T is based on normal (z) test normal
pairs between m1 differences n = 1nd - 12 df may be used.)
and m2, then
D0 = 0.)

1p1 - p22 1p1 - p22 = D0 1 pN 1 - pN 22 For D0 = 0: n1 and n2 large Independent 412


(If we want to 1pN 1 - pN 22 enough so that samples
detect a Z = n1pN 1 Ú 4,
pN qN a b
1 1
difference + n1qN 1 Ú 4 and
between p1 and B n1 n2
n2 pN 2 Ú 4,
p2, then D0 = 0.) y1 + y2 n2 qN 2 Ú 4
where pN =
n1 + n2
For D0 Z 0:
1pN 1 - pN 22 - D0
Z =
pN 1qN 1 pN qN
+ 2 2
A n1 n2

s21 s21 s21 s21


= 1 For Ha: s21 7 s22: F = All n1 and n2 Independent 420
s22 s22 s22 s22
random samples from
(i.e., s21 = s22) s22 normal populations
For Ha: s22 7 s21: F =
s21
For Ha: s21 Z s22:
Larger s2
F =
Smaller s2
where F is based on n1 =
numerator df and n2 =
denominator df
Quick Review 437

LANGUAGE LAB
Symbol Pronunciation Description

H0 h - oh Null hypothesis
Ha h-a Alternative hypothesis
a alpha Probability of Type I error
b beta Probability of Type II error
u0 theta naught Hypothesized value of population parameter in H0
m0 mu naught Hypothesized value of population mean in H0
D0 d naught Hypothesized value of population difference in H0
s20 sigma-squared naught Hypothesized value of population variance in H0

Chapter Summary Notes


• Elements of a test of hypothesis: null hypothesis, alternative hypothesis, test statistic, significance level (a),
rejection region, p-value, and conclusion.
• Two types of errors in a hypothesis test: Type I error (reject H0 when H0 is true), Type II error (accept H0 when H0 is
false).
• Probabilities of errors: A = P1Type I error2 = P1Reject H0 ƒ H0 true2, B = P1Type II error2 = P1Accept H0 ƒ H0
false).
• Three forms of the alternative hypothesis: lower-tailed test (<), upper-tailed test (>), two-tailed test (≠).
• Observed significance level ( p-value) is the smallest value of a that can be used to reject the null hypothesis.
• Decision rule for rejecting H0: (1) test statistic falls into rejection region, or (2) p-value 6 a.
• Power of the test = 1 - b = P1Reject H0 ƒ H0 false2.
• Key words for identifying m as the parameter of interest: mean, average.
• Key words/phrases for identifying M 1 - M 2 as the parameter of interest: difference between means or averages, com-
pare two means using independent samples.
• Key words/phrases for identifying md as the parameter of interest: mean or average of paired differences, compare two
means using matched pairs.
• Key words for identifying p as the parameter of interest: proportion, percentage, rate.
• Key words/phrases for identifying p1 - p2 as the parameter of interest: difference between proportions or percentages,
compare two proportions using independent samples.
• Key words for identifying s2 as the parameter of interest: variance, spread, variation.
• Key words/phrases for identifying S21>S22 as the parameter of interest: difference between variances, compare variation
in two populations using independent samples.

Supplementary Exercises
8.98 Mongolian desert ants. The Journal of Biogeography These data are listed in the table on p. 438. Is there evi-
(Dec. 2003) published a study of ants in Mongolia (Cen- dence to conclude that a difference exists between the av-
tral Asia). Botanists placed seed baits at five sites in the erage number of ant species found at sites in the two
Dry Steppe region and six sites in the Gobi Desert and ob- regions of Mongolia? Draw the appropriate conclusion
served the number of ant species attracted to each site. using a = .05.
438 Chapter 8 Tests of Hypotheses

Data for Exercise 8.98 liminary study found that about half (50%) of engineer-
ing students responded favorably to a general engineer-
GOBIANTS
ing program. Let Y represent the number of students in a
Site Region Number of Ant Species sample of 10 who favor a general engineering program
1 Dry Steppe 3 and let p represent the true proportion of all students who
favor a general engineering program. Suppose you want
2 Dry Steppe 3
to test H0: p = .5 against Ha: p Z .5. One possible pro-
3 Dry Steppe 52 cedure is to reject H0 if Y … 1 or Y Ú 8.
4 Dry Steppe 7 a. Find a for this test.
b. Find b if p = .4. What is the power of the test?
5 Dry Steppe 5
c. Find b if p = .8. What is the power of the test?
6 Gobi Desert 49
8.102 Accuracy of wet samplers. Wet samplers are standard de-
7 Gobi Desert 5 vices used to measure the chemical composition of precip-
8 Gobi Desert 4 itation. The accuracy of the wet deposition readings,
however, may depend on the number of samplers stationed
9 Gobi Desert 4
in the field. Experimenters in The Netherlands collected
10 Gobi Desert 5 wet deposition measurements using anywhere from one to
11 Gobi Desert 4 eight identical wet samplers (Atmospheric Environment,
Vol. 24A, 1990). For each sampler (or sampler combina-
Source: Pfeiffer, M., et al. “Community organization and species
tion), data were collected every 24 hours for an entire
richness of ants in Mongolia along an ecological gradient from
year; thus, 365 readings were collected per sampler (or
steppe to Gobi desert.” Journal of Biogeography, Vol. 30, No. 12,
Dec. 2003. sampler combination). When one wet sampler was used,
the standard deviation of the hydrogen readings (measured
8.99 Mongolian desert ants (continued). Refer to the Journal of as percentage relative to the average reading from all eight
Biogeography (Dec. 2003) study of ants in Mongolia (Cen- samplers) was 6.3%. When three wet samplers were used,
tral Asia), Exercise 8.98, where you compared the mean the standard deviation of the hydrogen readings (measured
number of ants at two desert sites. Since the sample sizes as percentage relative to the average reading from all eight
were small, the variances of the populations at the two sites samplers) was 2.6%. Conduct a test to compare the varia-
must be equal in order for the inference to be valid. tion in hydrogen readings for the two sampling schemes
a. Set up H0 and Ha for determining whether the variances (i.e., one wet sampler versus three wet samplers). Test
are the same. using a = .05.
b. Use the data in the GOBIANTS file to find the test sta- 8.103 Perceptions of automation problems. According to a
tistic for the test. popular model of managerial behavior, the current state
c. Give the rejection region for the test if a = .05. of automation in a manufacturing firm influences man-
d. Find the approximate p-value of the test. agers’ perceptions of problems of automation. To investi-
e. Make the appropriate conclusion in the words of the gate this proposition, researchers at Concordia University
problem. (Montreal) surveyed managers at firms with a high level
f. What conditions are required for the test results to be of automation and at firms with a low level of automation
valid? (IEEE Transactions on Engineering Management, Aug.
8.100 Coverage of fluid mechanics. The Journal of Profession- 1990). Each manager was asked to give his or her per-
al Issues in Engineering Education and Practice (Apr. ception of the problems of automation at the firm. Re-
2005) reported on the results of a 2005 survey of courses sponses were measured on a 5-point scale (1: No
offered at undergraduate engineering programs. Of the problem, . . . , 5: Major problem). Summary statistics for
90 engineering programs that participated in the 2005 the two groups of managers, provided in the table, were
survey, 68 covered fluid mechanics. In a survey taken 20 used to test the hypothesis of no difference in the mean
years earlier (Engineering Education, Apr. 1986), 66 of perceptions of automation problems between managers
the 100 undergraduate engineering programs covered of highly automated and less automated manufacturing
fluid mechanics. Conduct a test to determine whether the firms.
fraction of undergraduate engineering programs cover-
ing fluid mechanics increased from 1986 to 2005. Use Sample Size Mean Standard Deviation
a = .01. Low Level 17 3.274 .762
8.101 General engineering program. The European Journal of High Level 8 3.280 .721
Engineering Education (Vol. 38, 2013) published a study Source: Farhoomand, A. F., Kira D., and Williams, J. “Managers’
of the feasibility of adding a general engineering pro- perceptions towards automation in manufacturing.” IEEE Transactions
gram to a university’s specialized engineering programs on Engineering Management, Vol. 37, No. 3, Aug, 1990, p. 230.
(e.g., civil, mechanical, electrical engineering). A pre-
Supplementary Exercises 439

a. Conduct the test for the researchers, assuming that the their PhD degrees, found that 607 were U.S. citizens; the
perception variances for the two groups of managers majority (1,630) of the PhD degrees were awarded to for-
are equal. Use a = .01. eign nationals. Conduct a test to determine whether the true
b. Conduct the test for the researchers, if it is known that proportion of engineering PhD degrees awarded to foreign
the perception variances differ for managers at low- nationals exceeds .5. Use a = .01.
level and high-level firms.
DDT
8.104 Real-time scheduling with robots. Researchers at Purdue 8.107 Contamination of fish. Refer to the U.S. Army Corps of
University compared human real-time scheduling in a Engineers study of contaminated fish in the Tennessee
processing environment to an automated approach that River (Alabama).
utilizes computerized robots and sensing devices (IEEE a. Use a random number table (table 1 of Appendix B)
Transactions, Mar. 1993). The experiment consisted of to generate a random sample of n = 40 observations
eight simulated scheduling problems. Each task was per- on DDT concentration in fish from the DDT file.
formed by a human scheduler and by the automated sys- Compute y and s for the sample measurements.
tem. Performance was measured by the throughput rate, b. The Food and Drug Administration (FDA) sets the
defined as the number of good jobs produced weighted by limit for DDT content in individual fish at 5 parts per
product quality. The resulting throughput rates are shown million (ppm). Does the sample of part a provide suf-
in the accompanying table. Analyze the data using a test ficient evidence to conclude that the average DDT
of hypothesis. content of individual fish inhabiting the Tennessee
THRUPUT River and its creek tributaries exceeds 5 ppm? Test
using a significance level of a = .01.
Task Human Scheduler Automated Method c. Suppose the test of hypothesis, part b, was based on a
1 185.4 180.4 random sample of only n = 8 fish. What are the dis-
2 146.3 248.5 advantages of conducting this small-sample test?
d. Repeat part b using only the information on the DDT
3 174.4 185.5 contents of a sample of 8 fish (randomly selected
4 184.9 216.4 from the 40 observations of part a). Compare the re-
sults of the large- and small-sample tests.
5 240.0 269.3
8.108 Ball bearing specifications. In the manufacture of ma-
6 253.8 249.6
chinery, it is essential to utilize parts that conform to
7 238.8 282.0 specifications. In the past, diameters of the ball bearings
8 263.5 315.9 produced by a certain manufacturer had a variance of
.00156. To cut costs, the manufacturer instituted a less
Source: Yih, Y., Liang, T., and Moskowitz, H. “Robot scheduling in
expensive production method. The variance of the diam-
a circuit board production line: A hybrid OR/ANN approach.”
IEEE Transactions, Vol. 25, No. 2, March 1993, p. 31 (Table 1).
eters of 100 randomly sampled bearings produced by the
new process was .00211. Do the data provide sufficient
8.105 Radioactive water. A problem that occurs with certain evidence to indicate that diameters of ball bearings pro-
types of mining is that some by-products tend to be mild- duced by the new process are more variable than those
ly radioactive and these products sometimes get into our produced by the old process? Test at a = .05.
fresh water supply. The EPA has issued regulations con- 8.109 Active versus passive solar heating. Home solar heating
cerning a limit on the amount of radioactivity in supplies systems can be categorized into two groups, passive
of drinking water. Particularly, the maximum level for solar heating systems and active solar heating systems.
naturally occurring radiation is 5 picocuries per liter of In a passive solar heating system, the house itself is a
water. A random sample of 24 water specimens from a solar energy collector, whereas in an active solar heating
city’s water supply produced the sample statistics system, elaborate mechanical equipment is used to con-
y = 4.61 picocuries per liter and s = .87 picocurie per vert the sun’s rays into heat. Consider the difference be-
liter. tween the proportions of passive solar and active solar
a. Do these data provide sufficient evidence to indicate heating systems that require less than 200 gallons of oil
that the mean level of radiation is safe (below the per year in fuel consumption. Independent random sam-
maximum level set by the EPA)? Test using a = .01. ples of 50 passive and 50 active solar-heated homes are
b. Why should you want to use a small value of a for the selected and the numbers that required less than 200 gal-
test in part a? lons of oil last year are noted, with the results given in
c. Calculate the value of b for the test if ma = 4.5 pic- the table on the next page. Is there evidence of a differ-
ocuries per liter of water. ence between the proportions of passive and active solar-
d. Calculate and interpret the p-value for the test. heated homes that required less than 200 gallons of oil in
8.106 PhD’s in engineering. The National Science Foundation, in fuel consumption last year? Test at a level of significance
a survey of 2,237 engineering graduate students who earned of a = .02.
440 Chapter 8 Tests of Hypotheses

Table for Exercise 8.109 after a certain period of time, it is placed at the beginning
end of the maze and given another attempt to escape. The
Passive Active experiment is repeated until three successful escapes are
Solar Solar
accomplished by each rat pup. The number of swims re-
Number of Homes 50 50 quired by each pup to perform three successful escapes is
Number That Required Less Than reported in the table. Is there sufficient evidence (at
200 Gallons of Oil Last Year 37 46 a = .10) of a difference between the mean number of
swims required by male and female rat pups?
8.110 Cyanide contamination. Environmental Science & Tech-
RATPUPS
nology (Oct. 1993) reported on a study of contaminated
soil in The Netherlands. A total of 72 400-gram soil Litter Male Female Litter Male Female
specimens were sampled, dried, and analyzed for the
1 8 5 11 6 5
contaminant cyanide. The cyanide concentration (mil-
ligrams per kilogram of soil) of each soil specimen was 2 8 4 12 6 3
determined using an infrared microscopic method. The 3 6 7 13 12 5
sample resulted in a mean cyanide level of y = 84 mg/kg
4 6 3 14 3 8
and a standard deviation of s = 80 mg/kg.
a. Test the hypothesis that the true mean cyanide level in 5 6 5 15 3 4
soil in The Netherlands falls below 100 mg/kg. Use 6 6 3 16 8 12
a = .10.
7 3 8 17 3 6
b. Would you reach the same conclusion in part a using
a = .05? a = .01? Explain. 8 5 10 18 6 4
8.111 Organic carbon in sewage. Engineers periodically ana- 9 4 4 19 9 5
lyze water samples for various types of organic material. 10 4 4
The total organic carbon (TOC) level was measured in
Source: Bradstreet, Thomas E. Merck Research Labs, BL 3-2,
water samples collected at two sewage treatment sites in
West Point, PA 19486.
England. The accompanying table gives the summary in-
formation on the TOC levels (measured in mg/L) found in 8.113 Solder joint inspections. Current technology uses X-rays
the rivers adjacent to the two sewage facilities. Since the and lasers for inspection of solder-joint defects on print-
river at the Foxcote sewage treatment works was subject ed circuit boards (PCBs). (Quality Congress Transac-
to periodic spillovers, not far upstream of the plant’s in- tions, 1986.) A particular manufacturer of laser-based
take, it is believed that the TOC levels found at Foxcote inspection equipment claims that its product can inspect
will have greater variation than the levels at Bedford. on average at least 10 solder joints per second when the
Does the sample information support this hypothesis? joints are spaced .1 inch apart. The equipment was tested
Test at a = .05. by a potential buyer on 48 different PCBs. In each case,
the equipment was operated for exactly 1 second. The
Bedford Foxcote
number of solder joints inspected on each run follows:
n1 = 61 n2 = 52
PCB
y1 = 5.35 y2 = 4.27
10 9 10 10 11 9 12 8 8 9 6 10
s1 = .96 s2 = 1.27
7 10 11 9 9 13 9 10 11 10 12 8
Source: Pinchin, M. J. “A study of
the trace organics profiles of raw 9 9 9 7 12 6 9 10 10 8 7 9
and potable water systems.” Journal 11 12 10 0 10 11 12 9 7 9 9 10
of the Institute of Water Engineers &
Scientists, Vol. 40, No. 1, Feb. 1986,
p. 87. a. The potential buyer wants to know whether the sam-
ple data refute the manufacturer’s claim. Specify the
8.112 Single-T swim maze. Merck Research Labs conducted an null and alternative hypotheses that the buyer should
experiment to evaluate the effect of a new drug using the test.
Single-T swim maze. Nineteen impregnated dam rats b. In the context of this exercise, what is a Type I error?
were captured and allocated a dosage of 12.5 milligrams A Type II error?
of the drug. One male and one female pup were random- c. Conduct the hypothesis test you described in part a,
ly selected from each resulting litter to perform in the and interpret the test’s results in the context of this ex-
swim maze. Each rat pup is placed in water at one end of ercise. Use a = .05.
the maze and allowed to swim until it successfully es-
capes at the opposite end. If the rat pup fails to escape
Supplementary Exercises 441

8.114 Stacked menu displays. One feature of a user-friendly Feb. 1990). With respect to quality of reports and prod-
computer interface is a stacked menu display. Each time ucts, the competitive contracts had a mean performance
a menu item is selected, a submenu is displayed partially rating of 7.62, whereas the sole-source contracts had a
over the parent menu, thus creating a series of “stacked” mean of 6.95.
menus. The Special Interest Group on Computer Human a. Set up the null and alternative hypothesis for deter-
Interaction Bulletin (July 1993) reported on a study to mining whether the mean quality performance rating
determine the effects of the presence or absence of a of competitive R&D contracts exceeds the mean for
stacked menu structure on search time. Twenty-two sub- sole-source contracts.
jects were randomly placed into one of two groups, and b. Find the rejection region for the test using a = .05.
each was asked to search a menu-driven software pack- c. The p-value for the test was reported to be between
age for a particular item. In the experimental group .02 and .03. What is the appropriate conclusion?
(n1 = 11), the stacked menu format was used; in the
8.116 Strength of sewer pipe. The building specifications in a
control group (n2 = 11), only the current menu was dis-
certain city require that the sewer pipe used in residential
played.
areas have a mean breaking strength of more than 2,500
a. The researcher’s initial hypothesis is that the mean pounds per lineal foot. A manufacturer who would like to
time required to find a target item does not differ for supply the city with sewer pipe has submitted a bid and
the two menu displays. Describe the statistical provided the following additional information: An inde-
method appropriate for testing this hypothesis. pendent contractor randomly selected seven sections of
b. What assumptions are required for inferences derived the manufacturer’s pipe and tested each for breaking
from the analysis to be valid? strength. The results (pounds per lineal foot) follow:
c. The mean search times for the two groups were 11.02
seconds and 11.07 seconds, respectively. Is this SEWER
enough information to conduct the test? Explain.
2,610 2,750 2,420 2,510 2,540 2,490 2,680
d. The observed significance level for the test, part a,
exceeds .10. Interpret this result. a. Is there sufficient evidence to conclude that the man-
8.115 Performance of R&D. Does competition between sepa- ufacturer’s sewer pipe meets the required specifica-
rate research and development (R&D) teams in the U.S. tions? Use a significance level of a = .10.
Department of Defense, working independently on the b. Find the value of b for ma = 2575. What is the power
same project, improve performance? To answer this ques- of the test?
tion, performance ratings were assigned to each of 58 c. Find the value of b for ma = 2800.
multisource (competitive) and 63 sole-source R&D con- d. Find the power of the test for ma = 2800.
tracts (IEEE Transactions on Engineering Management,

You might also like