Statistical hypothesis testing. Formulation of hypotheses.
Criterion validation. Errors of the first and second kind.
Formulation of statistical inference. The total
consideration for testing hypotheses about the equality of
parameters independent normal populations.
l
1 INTRODUCTION
Statistics plays an important role in decision making. In statistics, one utilizes random samples to
make inferences about the population from which the samples were obtained. Statistical inference
regarding population parameters takes two forms: estimation and hypothesis testing, although both
hypothesis testing and estimation may be viewed as different aspects of the same general problem of
arriving at decisions on the basis of observed data. We already saw several estimation procedures in
earlier chapters. Hypothesis testing is the subject of this chapter. Hypothesis testing has an important
role in the application of statistics to real-life problems. Here we utilize the sampled data to make
decisions concerning the unknown distribution of a population or its parameters. Pioneering work
on the explicit formulation as well as the fundamental concepts of the theory of hypothesis testing
are due to J. Neyman and E. S. Pearson.
A statistical hypothesis is a statement concerning the probability distribution of a random variable
or population parameters that are inherent in a probability distribution. The following example
illustrates the concept of hypothesis testing. An important industrial problem is that of accepting or
rejecting lots of manufactured products. Before releasing each lot for the consumer, the manufacturer
usually performs some tests to determine whether the lot conforms to acceptable standards. Let us
say that both the manufacturer and the consumer agree that if the proportion of defectives in a lot is
less than or equal to a certain number p, the lot will be released. Very often, instead of testing every
item in the lot, we may test only a few items chosen at random from the lot and make decisions
about the proportion of defectives in the lot; that is, we make the decisions about the population
on the basis of sample information. Such decisions are called statistical decisions. In attempting to
reach decisions, it is useful to make some initial conjectures about the population involved. Such
conjectures are called statistical hypotheses. Sometimes the results from the sample may be markedly
different from those expected under the hypothesis. Then we can say that the observed differences
are significant and we would be inclined to reject the initial hypothesis. These procedures that enable
us to decide whether to accept or reject hypotheses or to determine whether observed samples differ
significantly from expected results are called tests of hypotheses, tests of significance, or rules of decision.
In any hypothesis testing problem, we formulate a null hypothesis and an alternative hypothesis such that
if we reject the null, then we have to accept the alternative. The null hypothesis usually is a statement
of either the “status quo” or “no effect.” A guideline for selecting a null hypothesis is that when the
objective of an experiment is to establish a claim, the nullification of the claim should be taken as
the null hypothesis. The experiment is often performed to determine whether the null hypothesis is
false. For example, suppose the prosecution wants to establish that a certain person is guilty. The null
hypothesis would be that the person is innocent and the alternative would be that the person is guilty.
Thus, the claim itself becomes the alternative hypothesis. Customarily, the alternative hypothesis is
the statement that the experimenter believes to be true. For example, the alternative hypothesis is
the reason a person is arrested (police suspect the person is not innocent). Once the hypotheses
have been stated, appropriate statistical procedures are used to determine whether to reject the null
hypothesis. For the testing procedure, one begins with the assumption that the null hypothesis is true.
If the information furnished by the sampled data strongly contradicts (beyond a reasonable doubt)
the null hypothesis, then we reject it in favor of the alternative hypothesis. If we do not reject the
null, then we automatically reject the alternative. Note that we always make a decision with respect
to the null hypothesis. Note that the failure to reject the null hypothesis does not necessarily mean
that the null hypothesis is true. For example, a person being judged “not guilty” does not mean the
person is innocent. This basically means that there is not enough evidence to reject the null hypothesis
(presumption of innocence) beyond “a reasonable doubt.”
We summarize the elements of a statistical hypothesis in the following.
THE ELEMENTS OF A STATISTICAL HYPOTHESIS
1. The null hypothesis, denoted by H0, is usually the nullification of a claim. Unless evidence from the
data indicates otherwise, the null hypothesis is assumed to be true.
2. The alternate hypothesis, denoted by Ha (or sometimes denoted by H1), is customarily the claim
itself.
3. The test statistic, denoted by TS, is a function of the sample measurements upon which the
statistical decision, to reject or not reject the null hypothesis, will be based.
4. A rejection region (or a critical region) is the region (denoted by RR) that specifies the values
of the observed test statistic for which the null hypothesis will be rejected. This is the range of
values of the test statistic that corresponds to the rejection of H0 at some fixed level of significance,
α, which will be explained later.
5. Conclusion: If the value of the observed test statistic falls in the rejection region, the null hypothesis
is rejected and we will conclude that there is enough evidence to decide that the alternative
hypothesis is true. If the TS does not fall in the rejection region, we conclude that we cannot reject
the null hypothesis.
In practice one may have hypotheses such as H0 : μ = μ0 against one of the following alternatives:
⎧
⎪
Ha : μ = μ0, called a two-tailed alternative
⎨
or Ha : μ < μ0, called a lower (or left) tailed alternative
⎪or Ha : μ > μ0, called an upper (or right) tailed alternative
⎩
A test with a lower or upper tailed alternative is called a one-tailed test. In an applied hypothesis testing
problem, we can use the following general steps.
GENERAL METHOD FOR HYPOTHESIS TESTING
1. From the (word) problem, determine the appropriate null hypothesis, H0, and the alternative, Ha.
2. Identify the appropriate test statistics and calculate the observed test statistic from the data.
3. Find the rejection region by looking up the critical value in the appropriate table.
4. Draw the conclusion: Reject or fail to reject the null hypothesis, H0.
5. Interpret the results: State in words what the conclusion means to the problem we started with.
It is always necessary to state a null and an alternate hypothesis for every statistical test performed.
All possible outcomes should be accounted for by the two hypotheses.
Example 7.1.1
In a coin-tossing experiment, let p be the probability of heads. We start with the claim that the coin is fair,
that is, H0 : p = 1/2. We test this against one of the following alternatives:
(a) Ha: The coin is not fair (p = 1/2). This is a two-tailed alternative.
(b) Ha: The coin is biased in favor of heads (p > 1/2). This is an upper tailed alternative.
(c) Ha: The coin is biased in favor of tails (p < 1/2). This is a lower tailed alternative.
It is important to observe that the test statistic is a function of a random sample. Thus, the test statistic
itself is a random variable whose distribution is known under the null hypothesis. The value of a test
statistic when specific sample values are substituted is called the observed test statistic or simply test
statistic.
For example consider the hypothesis H0 : μ = μo versus Ha : μ = μo, where μo is known. Assume
that the population is normal with a known variance σ2. Consider X, an unbiased estimator of μ
based on the random sample X1, . . . , Xn. Then Z = (X − μ0)/(σ/√n) is a function of the random
sample X1, . . . , Xn, and has a known distribution, a standard normal, under H0. If x1, x2, . . . , xn are
specific sample values, then z = (x − μ0)/(σ/√n) is called the observed sample statistic or simply sample
statistic.
Definition 1.1 A hypothesis is said to be a simple hypothesis if that hypothesis uniquely specifies
the distribution from which the sample is taken. Any hypothesis that is not simple is called a composite
hypothesis.
Example 1.2
Refer to Example 7.1.1. The null hypothesis p =1/2 is simple, because the hypothesis completely specifies
the distribution, which in this case will be a binomial with p = 1/2 and with n being the number of tosses.
The alternative hypothesis p = 1/2 is composite because the distribution now is not completely specified
(we do not know the exact value of p).
Because the decision is based on the sample information, we are prone to commit errors. In a statistical
test, it is impossible to establish the truth of a hypothesis with 100% certainty. There are two possible
types of errors. On the one hand, one can make an error by rejecting H0 when in fact it is true. On
the other hand, one can also make an error by failing to reject the null hypothesis when in fact it is
false. Because the errors arise as a result of wrong decisions, and the decisions themselves are based
on random samples, it follows that the errors have probabilities associated with them. We now have
the following definitions.
Table 7.1 Statistical Decision and Error Probabilities
Statistical
True state of null hypothesis
decision
H0 true
H0 false
Do not reject H0
Correct decision
Type II error (β)
Reject H0
Type I error (α)
Correct decision
The decision and the errors are represented in Table .1.
Definition 1.2 (a) A type I error is made if H0 is rejected when in fact H0 is true. The probability of
type I error is denoted by α. That is,
P (rejecting H0|H0 is true) = α.
The probability of type I error, α, is called the level of significance.
(b) A type II error is made if H0 is accepted when in fact Ha is true. The probability of a type II error is
denoted by β. That is,
P (not rejecting H0|H0 is false) = β.
It is desirable that a test should have a = β = 0 (this can be achieved only in trivial cases), or at least
we prefer to use a test that minimizes both types of errors. Unfortunately, it so happens that for a
fixed sample size, as α decreases, β tends to increase and vice versa. There are no hard and fast rules
that can be used to make the choice of α and β. This decision must be made for each problem based
on quality and economic considerations. However, in many situations it is possible to determine
which of the two errors is more serious. It should be noted that a type II error is only an error in
the sense that a chance to correctly reject the null hypothesis was lost. It is not an error in the sense
that an incorrect conclusion was drawn, because no conclusion is made when the null hypothesis is
not rejected. In the case of type I error, a conclusion is drawn that the null hypothesis is false when,
in fact, it is true. Therefore, type I errors are generally considered more serious than type II errors.
For example, it is mostly agreed that finding an innocent person guilty is a more serious error than
finding a guilty person innocent. Here, the null hypothesis is that the person is innocent, and the
Prob (TYPE II Error) 5 Beta
Prob (TYPE I Error) 5 Alpha
Under H0
Under Ha
Critical value
alternate hypothesis is that the person is guilty. “Not rejecting the null hypothesis” is equivalent to
acquitting a defendant. It does not prove that the null hypothesis is true, or that the defendant is
innocent. In statistical testing, the significance level α is the probability of wrongly rejecting the null
hypothesis when it is true (that is, the risk of finding an innocent person guilty). Here the type II risk
is acquitting a guilty defendant. The usual approach to hypothesis testing is to find a test procedure
that limits α, the probability of type I error, to an acceptable level while trying to lower β as much as
possible.
The consequences of different types of errors are, in general, very different. For example, if a doctor
tests for the presence of a certain illness, incorrectly diagnosing the presence of the disease (type I
error) will cause a waste of resources, not to mention the mental agony to the patient. On the other
hand, failure to determine the presence of the disease (type II error) can lead to a serious health risk.
To formulate a hypothesis testing problem, consider the following situation. Suppose a toy store
chain claims that at least 80% of girls under 8 years old prefer dolls over other types of toys. We feel
that this claim is inflated. In an attempt to dispose of this claim, we observe the buying pattern of 20
randomly selected girls under 8 years old, and we observe X, the number of girls under 8 years old
who buy stuffed toys or dolls. Now the question is, how can we use X to confirm or reject the store’s
claim? Let p be the probability that a girl under 8 chosen at random prefers stuffed toys or dolls. The
question now can be reformulated as a hypothesis testing problem. Is p ≥ 0.8 or p < 0.8? Because we
would like to reject the store’s claim only if we are highly certain of our decision, we should choose
the null hypothesis to be H0 : p ≥ 0.8, the rejection of which is considered to be more serious. The
null hypothesis should be H0 : p ≥ 0.8, and the alternative Ha : p < 0.8. In order to make the null
hypothesis simple, we will use H0 : p = 0.8, which is the boundary value with the understanding that
it really represents H0 : p ≥ 0.8. We note that X, the number of girls under 8 years old who prefer
stuffed toys or dolls, is a binomial random variable. Clearly a large sample value of X would favor
H0. Suppose we arbitrarily choose to accept the null hypothesis if X >12. Because our decision is
based on only a sample of 20 girls under 8, there is always a possibility of making errors whether
we accept or reject the store chain’s claim. In the following example, we will now formally state this
problem and calculate the error probabilities based on our decision rule.
Example 7.1.3
A toy store chain claims that at least 80% of girls under 8 years old prefer dolls over other types of toys.
After observing the buying pattern of many girls under 8 years old, we feel that this claim is inflated. In an
attempt to dispose of this claim, we observe the buying pattern of 20 randomly selected girls under 8 years
old, and we observe X, the number of girls who buy stuffed toys or dolls. We wish to test the hypothesis
H0 : p = 0.8 against Ha : p < 0.8. Suppose we decide to accept the H0 if X > 12 (that is X ≥ 13). This
means that if {X ≤ 12} (that is X < 13) we will reject H0.
(a) Find α.
(b) Find β for p = 0.6.
(c) Find β for p = 0.4.
(d) Find the rejection region of the form {X ≤ K} so that (i) α = 0.01; (ii) α = 0.05.
(e) For the alternative Ha :p = 0.6, find β for the values of α in part (d).
Then by definition,
β = P (X ≤ 15.8225 when μ = 16).
Consequently, for μ = 16,
(
)
X − 16
15.8225 − 16
β=P
≤
√
σ/√n
3/
36
= P (Z ≤ −0.36)
= 0.3594.
That is, under the given information, there is a 35.94% chance of not rejecting a false null hypothesis.
7.1.1
Sample Size
It is clear from the preceding example that once we are given the sample size n, an α, a simple
alternative Ha, and a test statistic, we have no control over β and it is exactly determined. Hence, for
a given sample size and test statistic, any effort to lower β will lead to an increase in α and vice versa.
This means that for a test with fixed sample size it is not possible to simultaneously reduce both α
and β. We also notice from Example 7.1.4 that by increasing the sample size n, we can decrease β
(for the same α) to an acceptable level. The following discussion illustrates that it may be possible to
determine the sample size for a given α and β.
Suppose we want to test H0 : μ = μ0 versus Ha : μ > μ0. Given α and β, we want to find n, the
sample size, and K, the point at which the rejection begins. We know that
α = P (X > K when μ = μ0)
(
)
X−μ0
=P
> σ/√μ0
when μ = μ0
(.1)
σ/√n
n ,
= P (Z > za)
and
β = P (X ≤ K, when μ = μa)
(
)
X−μa
=P
≤ σ/√μa
when μ = μa
(2)
σ/√n
n ,
= P (z ≤ −zβ).
From Equations (7.1) and (7.2),
K−μ0
zα =
σ/√n
and
K−μa
−zβ =
σ/√n
This gives us two equations with two unknowns (K and n), and we can proceed to solve them.
Eliminating K, we get
(
)
(
σ
σ )
μ0 + zα
=μa−zβ
√n
√n
From this we can derive
(zα + zβ)σ
√n =
μa − μ0
Thus, the sample size for an upper tail alternative hypothesis is
)2σ2
(zα + zβ
n=
(μa − μ0)2
The sample size increases with the square of the standard deviation and decreases with the square of the difference
between mean value of the alternative hypothesis and the mean value under the null hypothesis. Note that in
real-world problems, care should be taken in the choice of the value of μa for the alternative hypothesis. It may
be tempting for a researcher to take a large value of μa in order to reduce the required sample size. This will
seriously affect the accuracy (power) of the test. This alternative value must be realistic within the experiment
under study. Care should also be taken in the choice of the standard deviation σ. Using an underestimated
value of the standard deviation to reduce the sample size will result in inaccurate conclusions similar to
overestimating the difference of means. Usually, the value of σ is estimated using a similar study conducted
earlier. The problem could be that the previous study may be old and may not represent the new reality. When
accuracy is important, it may be necessary to conduct a pilot study only to get some idea on the estimate of σ.
Once we determine the necessary sample size, we must devise a procedure by which the appropriate data can
be randomly obtained. This aspect of the design of experiments is discussed in Chapter 9.
2 THE NEYMAN-PEARSON LEMMA
In practical hypothesis testing situations, there are typically many tests possible with significance level α for a null
hypothesis versus alternative hypothesis (see Project 7A). This leads to some important questions, such as (1)
how to decide on the test statistic and (2) how to know that we selected the best rejection region. In this section,
we study the answer to these questions using the Neyman-Pearson approach.
Definition 7.2.1 Suppose that W is the test statistic and RR is the rejection region for a test of hypothesis concerning
the value of a parameter θ. Then the power of the test is the probability that the test rejects H0 when the alternative is
true. That is,
π =Power(θ)
= P(W in RR when the parameter value is an alternative θ). If H0 : θ =
θ0 and Ha : θ = θ0, then the power of the test at some θ = θ1 = θ0 is
Power(θ1) = P(reject H0|θ = θ1).
But, β(θ1) = P(accept H0|θ = θ1). Therefore,
Power(θ1) = 1 − β(θ1).
A good test will have high power.
Note that the power of a test H0 cannot be found until some true situation Ha is specified. That is,
the sampling distribution of the test statistic when Ha is true must be known or assumed. Because
β depends on the alternative hypothesis, which being composite most of the time does not specify
the distribution of the test statistic, it is important to observe that the experimenter cannot control
β. For example, the alternative Ha : μ < μ0 does not specify the value of μ, as in the case of the null
hypothesis, H0 : μ = μ0.
Example 2.1
Let X1, . . . , Xn be a random sample from a Poisson distribution with parameter λ, that is, the pdf is
given by f (x) = e−λλx/(x!). Then the hypothesis H0 : λ = 1 uniquely specifies the distribution, because
f (x) = e−1/(x!) and hence is a simple hypothesis. The hypothesis Ha : λ > 1 is composite, because f (x) is
not uniquely determined.
Definition 2.2 A test at a given α of a simple hypothesis H0 versus the simple alternative Ha that has
the largest power among tests with the probability of type I error no larger than the given α is called a most
powerful test.
Consider the test of hypothesis H0 : θ = θ0 versus Ha : θ = θ1. If α is fixed, then our interest is to
make β as small as possible. Because β = 1 − Power(θ1), by minimizing β we would obtain a most
powerful test. The following result says that among all tests with given probability of type I error, the
likelihood ratio test given later minimizes the probability of a type II error, in other words, it is most
powerful.
Theorem 7.2.1 (Neyman-Pearson Lemma) Suppose that one wants to test a simple hypothesis H0 :
θ = θ0 versus the simple alternative hypothesis Ha :θ =θ1 based on a random sample X1,...,Xn from a
distribution with parameter θ. Let L(θ) ≡ L(θ; X1, . . . , Xn) > 0 denote the likelihood of the sample when
the value of the parameter is θ. If there exist a positive constant K and a subset C of the sample space Rn (the
Euclidean n-space) such that
L(θ0)
1.
≤ K for (x1,x2,...,xn) ∈ C
L(θ1)
L(θ0)
2.
≥ K for (x1,x2,...,xn) ∈ C′, where C′ is the complement of C, and
L(θ1)
3. P [(X1, . . . , Xn) ∈ C; θ0] = α.
Then the test with critical region C will be the most powerful test for H0 versus Ha. We call α the size of the
test and C the best critical region of size α.
Proof. We prove this theorem for continuous random variables. For discrete random variables, the
proof is identical with sums replacing the integral. Let S be some region in Rn, an n-dimensional
Euclidean space. For simplicity we will use the following notation:
∫
∫
∫
L(θ) = . . . L(θ; x1, x2, . . . , xn)dx1dx2, . . . , dxn
S
S
S
Note that
∫
P ((X1, . . . , Xn) ∈ C; θ0) = f (x1, . . . , xn; θ0)dx1, . . . , dxn
C
∫
= L(θ0; x1, . . . , xn)dx1, . . . , dxn.
C
Suppose that there
is another critical region, say B, of size less than or equal
to
α,
that
is
∫
B L(θ0) ≤ α. Then
∫
∫
∫
0
≤ L(θ0) − L(θ0), because L(θ0) = α by assumption 3.
C
B
C
Therefore,
∫
∫
0 ≤ L(θ0) − L(θ0)
C
B
∫
∫
∫
∫
= L(θ0) +
L(θ0) −
L(θ0) −
L(θ0)
C∩B
C∩B′
C∩B
C′∩B
∫
∫
= L(θ0) −
L(θ0).
C∩B′
C′∩B
Using assumption 1
of Theorem 7.2.1, KL(θ1) ≥ L(θ0) at each point in the region C and hence in
C ∩ B′. Thus
∫
∫
L(θ0) ≤ K
L(θ1).
C∩B′
C∩B′
By assumption 2 of the theorem, KL(θ1) ≤ L(θ0) at each point in C′, and hence in C′ ∩ B. Thus,
∫
∫
L(θ0) ≥ K
L(θ1).
C′∩B
C′∩B
Therefore,
∫
∫
0≤
L(θ0) −
L(θ0)
C∩B′
C′∩B
⎧
⎫
⎨
∫
∫
⎬
≤K
L(θ1)
⎩
⎪ L(θ1)−
⎭
C∩B′
C′∩B
That is,
⎧
⎫
⎨
∫
∫
∫
∫
⎬
0≤K
L(θ1) +
L(θ1)−
L(θ1) −
L(θ1)
⎩
⎭
C∩B
C∩B′
C∩B
C′∩B
⎧
⎫
⎨∫
∫
⎬
L(θ1) − L(θ1)
= K⎩
⎭.
C
B
As a result,
∫
∫
L(θ1) ≥ L(θ1).
C
B
Because this is true for every critical region B of size ≤ α, C is the best critical region of size α, and
the test with critical region C is the most powerful test of size α.
When testing two simple hypotheses, the existence of a best critical region is guaranteed by the
Neyman-Pearson lemma. In addition, the foregoing theorem provides a means for determining
what the best critical region is. However, it is important to note that Theorem 7.2.1 gives only the
form of the rejection region; the actual rejection region depends on the specific value of α.
In real-world situations, we are seldom presented with the problem of testing two simple hypotheses.
There is no general result in the form of Theorem 7.4.1 for composite hypotheses. However, for
hypotheses of the form H0 : θ = θ0 versus Ha : θ > θ0, we can take a particular value θ1 > θ0 and
then find a most powerful test for H0 : θ = θ0 versus Ha : θ > θ1. If this test (that is, the rejection
region of the test) does not depend on the particular value θ1, then this test is said to be a uniformly
most powerful test for H0 : θ = θ0 versus Ha : θ > θ0.
The following example illustrates the use of the Neyman-Pearson lemma.
X−μ0
Z=
∕
σ
√n
For Ha : μ = μ1 > μ0, the rejection region for the most powerful test would be
Reject H0 if z > zα.
On the other hand for Ha : μ = μ2 < μ0, the rejection region for the most powerful test would be
Reject H0 if z < −zα.
Thus, the rejection region depends on the specific alternative. Consequently, the two-sided hypothesis
just given has no UMP test.
In this section, we shall study a general procedure that is applicable when one or both H0 and Ha are
composite. In fact, this procedure works for simple hypotheses as well. This method is based on the
maximum likelihood estimation and the ratio of likelihood functions used in the Neyman-Pearson
lemma. We assume that the pdf or pmf of the random variable X is f (x, θ), where θ can be one or
more unknown parameters. Let represent the total parameter space that is the set of all possible
values of the parameter θ given by either H0 or H1.
Consider the hypotheses
H0 : θ ∈
0 vs. Ha : θ ∈ a =
−
0.
where θ is the unknown population parameter (or parameters) with values in
, and
0 is a subset
of
Let L(θ) be the likelihood function based on the sample X1, . . . , Xn. Now we define the likelihood
ratio corresponding to the hypotheses H0 and Ha. This ratio will be used as a test statistic for the
testing procedure that we develop in this section. This is a natural generalization of the ratio test used
in the Neyman-Pearson lemma when both hypotheses were simple.
Definition 7.3.1 The likelihood ratio λ is the ratio
max L(θ; x1, . . . , xn)
θ∈
L∗
0
0
λ=
=
max
L(θ; x1, . . . , xn)
L∗.
θ∈
We note that 0 ≤ λ ≤ 1. Because λ is the ratio of nonnegative functions, λ ≥ 0. Because
0 is a subset
of
, we know that max
L(θ) ≤ max L(θ). Hence, λ ≤ 1.
θ∈
0
θ∈
If the maximum of L in
0 is much smaller as compared with the maximum of L in
, that is, if
λ is small, it would appear that the data X1, . . . , Xn do not support the null hypothesis θ ∈
0. On
the other hand, if λ is close to 1, one could conclude that the data support the null hypothesis, H0.
Therefore, small values of λ would result in rejection of the null hypothesis, and large values nearer
to 1 will result a decision in support of the null hypothesis.
For the evaluation of λ, it is important to note that maxθ∈ L(θ) = L(θml.), where θml. is the maximum
likelihood estimator of θ ∈
, and maxθ∈
0 L(θ)isthelikelihoodfunctionwithunknownparameters
replaced by their maximum likelihood estimators subject to the condition that θ ∈
0. We can
summarize the likelihood ratio test as follows.
LIKELIHOOD RATIO TESTS (LRTs)
To test
H0 : θ ∈
0 vs. Ha : θ ∈ a
max L(θ; x1, . . . , xn )
θ∈
L∗
0
0
λ=
=
maxL(θ; x1, . . . , xn )
L∗
θ∈
will be used as the test statistic.
The rejection region for the likelihood ratio test is given by
Reject H0 if λ ≤ K .
K is selected such that the test has the given significance level α.
Example 3.1
Let X1, . . . , Xn be a random sample from an N(μ, σ2). Assume that σ2 is known. We wish to test, at level
α, H0 : μ = μ0 vs. Ha : μ = μ0. Find an appropriate likelihood ratio test.
Solution
We have seen that to test
H0 : μ = μ0
vs. Ha : μ = μ0
there is no uniformly most powerful test for this case. The likelihood function is
∑
(xi − μ)2
(
)n
−i=1
1
2σ2
L(μ) =
√
e
2πσ
Here,
0 = {μ0} and a = R − {μ0}.
Hence,
∑
(xi − μ)2
(
)n
−i=1
1
2σ2
L∗
max
√
e
0 =
μ=μ0
2πσ
∑
(xi − μ0)2
(
)n
−i=1
1
2σ2
=
√
e
2πσ
Similarly,
∑
(xi − μ)2
(
)n
−i=1
1
2σ2
L∗ = max
√
e
−∞<μ<∞
2πσ
Because the only unknown parameter in the parameter space is μ, −∞ < μ < ∞, the maximum of the
likelihood function is achieved when μ equals its maximum likelihood estimator, that is,
μml. = X.
Therefore, with a simple calculation we have
(
)
∑
−
(xi−μ0)2
/2σ2
e
i=1
λ=
(
)
=e−n(x−μ0)2/2σ2.
∑
−
(xi−x)2
/2σ2
e i=1
Thus, the likelihood ratio test has the rejection region
Reject H0
if λ ≤ K
which is equivalent to
− n
2σ2(X−μ0)2≤lnK⇔
(X − μ0)2
≥ 2lnK ⇔
σ2/n
X − μ0
σ/√n ≥2lnK=c1,say.
Note that we use the symbol ⇔ to mean ‘‘if and only if.’’ We now compute c1. Under H0
,
[(X − μ0
)/
(σ/√n)] ∼ N(0, 1).
Observe that
}
X − μ0
α=P
σ∕√n ≥c1
gives a possible value of c1 as c1
= zα/2. Hence, LRT for the given hypothesis is
X − μ0
Reject H0 if
σ/√n ≥za/2.
Thus, in this case, the likelihood ratio test is equivalent to the z-test for large random samples.
In fact, when both the hypotheses are simple, the likelihood ratio test is identical to the Neyman-
Pearson test. We can now summarize the procedure for the likelihood ratio test, LRT.
PROCEDURE FOR THE LIKELIHOOD RATIO TEST (LRT)
1. Find the largest value of the likelihood L(θ) for any θ0 ∈
0 by finding the maximum likelihood
estimate within
0 and substituting back into the likelihood function.
2. Find the largest value of the likelihood L(θ) for any θ ∈ by finding the maximum likelihood
estimate within and substituting back into the likelihood function.
3. Form the ratio
L(θ) in
0
λ = λ(x1,x2,...,xn) =
L(θ) in
4. Determine a K so that the test has the desired probability of type I error, α.
5. Reject H0 if λ ≤ K .
In the next example, we find a LRT for a testing problem when both H0 and Ha are simple.
4 HYPOTHESES FOR A SINGLE PARAMETER
In this section, we first introduce the concept of p-value. After that, we study hypothesis testing
concerning a single parameter.
4.1 The p-Value
In hypothesis testing, the choice of the value of α is somewhat arbitrary. For the same data, if the test
is based on two different values of α, the conclusions could be different. Many statisticians prefer to
compute the so-called p-value, which is calculated based on the observed test statistic. For computing
the p-value, it is not necessary to specify a value of α. We can use the given data to obtain the
p-value.
Definition 7.4.1 Corresponding to an observed value of a test statistic, the p-value
(or attained
significance level) is the lowest level of significance at which the null hypothesis would have been
rejected.
For example, if we are testing a given hypothesis with α = 0.05 and we make a decision to reject H0
and we proceeded to calculate the p-value equal to 0.03, this means that we could have used an α as
low as 0.03 and still maintain the same decision, rejecting H0.
Based on the alternative hypothesis, one can use the following steps to compute the p-value.
STEPS TO FIND THE p-VALUE
1. Let TS be the test statistic.
2. Compute the value of TS using the sample X1, . . . , Xn . Say it is a.
3. The p-value is given by
⎧
⎪P (T S < a|H0 ),
if lower tail test
⎨
p-value =
P (T S > a|H0 ),
if upper tail test
⎪
⎩P (|T S| > |a||H0 ), if two tail test.
Example 4.1
To test H0 : μ = 0 vs. Ha : μ = 0, suppose that the test statistic Z results in a computed value of 1.58.
Then, the p-value = P (|Z| > 1.58) = 2(0.0571) = 0.1142. That is, we must have a type I error of 0.1142 in
order to reject H0. Also, if Ha : μ > 0, then the p-value would be P (Z > 1.58) = 0.0582. In this case we
must have an α of 0.0582 in order to reject H0.
The p-value can be thought of as a measure of support for the null hypothesis: The lower its value,
the lower the support. Typically one decides that the support for H0 is insufficient when the p-value
drops below a particular threshold, which is the significance level of the test.
REPORTING TEST RESULT AS p-VALUES
1. Choose the maximum value of α that you are willing to tolerate.
2. If the p-value of the test is less than the maximum value of α, reject H0.
If the exact p-value cannot be found, one can give an interval in which the p-value can lie. For example,
if the test is significant at α = 0.05 but not significant for α = 0.025, report that 0.025 ≤ p-value ≤
0.05. So for α > 0.05, reject H0, and for α < 0.025, do not reject H0.
In another interpretation, 1−(p-value) is considered as an index of the strength of the evidence against
the null hypothesis provided by the data. It is clear that the value of this index lies in the interval
[0, 1]. If the p-value is 0.02, the value of index is 0.98, supporting the rejection of the null hypothesis.
Not only do p-values provide us with a yes or no answer, they provide a sense of the strength of the
evidence against the null hypothesis. The lower the p-value, the stronger the evidence. Thus, in any
test, reporting the p-value of the test is a good practice.
Because most of the outputs from statistical software used for hypothesis testing include the p-value,
the p-value approach to hypothesis testing is becoming more and more popular. In this approach,
the decision of the test is made in the following way. If the value of α is given, and if the p-value of the
test is less than the value of α, we will reject H0. If the value of α is not given and the p-value associated
with the test is small (usually set at p-value < 0.05), there is evidence to reject the null hypothesis in
favor of the alternative. In other words, there is evidence that the value of the true parameter (such as
the population mean) is significantly different (greater, or lesser) than the hypothesized value. If the
p-value associated with the test is not small (p > 0.05), we conclude that there is not enough evidence
to reject the null hypothesis. In most of the examples in this chapter, we give both the rejection region
and p-value approaches.
Example 4.2
The management of a local health club claims that its members lose on the average 15 pounds or more
within the first 3 months after joining the club. To check this claim, a consumer agency took a random
sample of 45 members of this health club and found that they lost an average of 13.8 pounds within the
first 3 months of membership, with a sample standard deviation of 4.2 pounds.
7.4 Hypotheses for a Single Parameter
363
(a) Find the p-value for this test.
(b) Based on the p-value in (a), would you reject the null hypothesis at α = 0.01?
Solution
(a) Let μ be the true mean weight loss in pounds within the first 3 months of membership in this club.
Then we have to test the hypothesis
H0 : μ = 15 versus Ha : μ < 15
Here n = 45, x = 13.8, and s = 4.2. Because n = 45 > 30, we can use normal approximation.
Hence, the test statistic is
13.8 − 15
z=
√
= −1.9166
4.2/
45
and
p-value = P (Z < −1.9166) ≃ P (Z < −1.92) = 0.0274.
Thus, we can use an α as small as 0.0274 and still reject H0.
(b) No. Because the p-value = 0.0274 is greater than α = 0.01, one cannot reject H0.
In any hypothesis testing, after an experimenter determines the objective of an experiment and decides
on the type of data to be collected, we recommend the following step-by-step procedure for hypothesis
testing.
STEPS IN ANY HYPOTHESIS TESTING PROBLEM
1. State the alternative hypothesis, Ha (what is believed to be true).
2. State the null hypothesis, H0 (what is doubted to be true).
3. Decide on a level of significance α.
4. Choose an appropriate TS and compute the observed test statistic.
5. Using the distribution of TS and α, determine the rejection region(s) (RR).
6. Conclusion: If the observed test statistic falls in the RR, reject H0 and conclude that based on the
sample information, we are (1 − α)100% confident that Ha is true. Otherwise, conclude that there is
not sufficient evidence to reject H0. In all the applied problems, interpret the meaning of your
decision.
7. State any assumptions you made in testing the given hypothesis.
8. Compute the p-value from the null distribution of the test statistic and interpret it.
4.2 Hypothesis Testing for a Single Parameter
Now we study the testing of a hypothesis concerning a single parameter, θ, based on a random sample
X1,...,Xn. Let θ be the sample statistic. First, we deal with tests for the population mean μ for large
and small samples. Next, we study procedures for testing the population variance σ2. We conclude
the section by studying a test procedure for the true proportion p.
To test the hypothesis H : μ = μ0 concerning the true population mean μ, when we have a large
sample (n ≥ 30) we use the test statistic Z given by
X−μ0
Z=
S/√n
where S is the sample standard deviation and μ0 is the claimed mean under H0 (if the population
variance is known, we replace S with σ.
For a small random sample (n < 30), the test statistic is
X−μ0
T =
S/√n
where μ0 is the claimed value of the true mean, and X and S are the sample mean and standard
deviation, respectively. Note that we are using the lowercase letters, such as z and t, to represent the
observed values of the test statistics Z and T , respectively.
In practice, with raw data, it is important to verify the assumptions. For example, in the small sample
case, it is important to check for normality by using normal plots. If this assumption is not satisfied,
the nonparametric methods described in Chapter 12 may be more appropriate. In addition, because
the sample statistic such as X and S will be greatly affected by the presence of outliers, drawing a box
plot to check for outliers is a basic practice we should incorporate in our analysis.
We now summarize the typical test of hypothesis for tests concerning population (true) mean.
In order to compute the observed test statistic, z in the large sample case and t in the small sample
case, calculate the values of z = (x − μ0)/(s/√n) and t = [(x − μ0)/(s/√n)], respectively.
SUMMARY OF HYPOTHESIS TESTS FOR μ
Large Sample (n ≥ 30)
Small Sample (n < 30)
To test
To test
H0 : μ = μ0
H0 : μ = μ0
versus
versus
μ > μ0, upper tail test
μ > μ0, upper tail test
μ < μ0, lower tail test
Ha :
Ha : μ < μ0, lower tail test
μ = μ0, two-tailed test
μ = μ0, two-tailed test
X −μ0
X −μ0
Test statistic: Z =
Test statistic: T =
σ/√n
S/√n
Replace σ by S, if σ is unknown.
⎧
⎧
⎪z >zα,
upper tail RR
⎪t >tα,n−1,
upper tail RR
⎨
⎨
Rejection region :
z < −zα, lower tail RR
RR :
t < −tα,n−1,
lower tail RR
⎪
⎪
⎩|z| > zα/2, two tail RR
⎩|t | > tα/2,n−1, two tail RR
Assumption: n ≥ 30
Assumption: Random sample
comes from a normal
population
Decision: Reject H0, if the observed test statistic falls in the RR and conclude that Ha is true with
(1 − α)100% confidence. Otherwise, keep H0 so that there is not enough evidence to conclude that
Ha is true for the given α and more experiments may be needed.
Example 7.4.3
It is claimed that sports-car owners drive on the average 18,000 miles per year. A consumer firm believes that
the average mileage is probably lower. To check, the consumer firm obtained information from 40 randomly
selected sports-car owners that resulted in a sample mean of 17,463 miles with a sample standard deviation
of 1348 miles. What can we conclude about this claim? Use α = 0.01.
Solution
Let μ be the true population mean. We can formulate the hypotheses as H0
: μ
= 18,000 versus
Ha : μ < 18,000.
The observed test statistic (for n ≥ 30) is
x−μ
0
17,463 − 18,000
z=
=
√
σ/√n
1348/
40
= −2.52.
Rejection region is {z < −z0.01} = {z < −2.33}.
Decision: Because z = −2.52 is less than −2.33, the null hypothesis is rejected at α = 0.01. There is
sufficient evidence to conclude that the mean mileage on sport cars is less than 18,000 miles per year.
Example 7.4.4
In a frequently traveled stretch of the I-75 highway, where the posted speed is 70 mph, it is thought that
people travel on the average of at least 75 mph. To check this claim, the following radar measurements of
the speeds (in mph) is obtained for 10 vehicles traveling on this stretch of the interstate highway.
66
74
79
80
69
77
78
65
79
81
Do the data provide sufficient evidence to indicate that the mean speed at which people travel on this
stretch of highway is at most 75 mph? Test the appropriate hypothesis using α = 0.01. Draw a box plot and
normal plot for this data, and comment.
Solution
We need to test
H0 : μ = 75 vs. Ha : μ > 75