Nothing Special   »   [go: up one dir, main page]

MBA Free Ebooks

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 56

Free EBooks Download

By:
ebooks.edhole.com
Validity/Reliability
ebooks.edhole.com
Reliability

From the perspective of classical test
theory, an examinee's obtained test
score (X) is composed of two
components, a true score component
(T) and an error component (E):

X=T+E

ebooks.edhole.com
Reliability
The true score component reflects the
examinee's status with regard to the
attribute that is measured by the test,
while the error component represents
measurement error.

Measurement error is random error. It is
due to factors that are irrelevant to what
is being measured by the test and that
have an unpredictable (unsystematic)
effect on an examinee's test score.

ebooks.edhole.com
Reliability
The score you obtain on a test is likely to
be due both to the knowledge you have
about the topics addressed by exam items
(T) and the effects of random factors (E)
such as the way test items are written,
any alterations in anxiety, attention, or
motivation you experience while taking the
test, and the accuracy of your "educated
guesses."


ebooks.edhole.com
Reliability
Whenever we administer a test to
examinees, we would like to know how
much of their scores reflects "truth" and
how much reflects error. It is a measure
of reliability that provides us with an
estimate of the proportion of variability in
examinees' obtained scores that is due to
true differences among examinees on the
attribute(s) measured by the test.


ebooks.edhole.com
Reliability

When a test is reliable, it provides
dependable, consistent results
and, for this reason, the term
consistency is often given as a
synonym for reliability (e.g.,
Anastasi, 1988).

Consistency = Reliability
ebooks.edhole.com
The Reliability Coefficient

Ideally, a test's reliability would be
calculated by dividing true score variance by
the obtained (total) variance to derive a
reliability index. This index would indicate
the proportion of observed variability in test
scores that reflects true score variability.

True Score Variance/Total Variance =
Reliability Index

ebooks.edhole.com
The Reliability Coefficient
A test's true score variance is not known, however,
and reliability must be estimated rather than
calculated directly.

There are several ways to estimate a test's
reliability. Each involves assessing the consistency
of an examinee's scores over time, across different
content samples, or across different scorers.

The common assumption for each of these
reliability techniques that consistent variability is
true score variability, while variability that is
inconsistent reflects random error.
ebooks.edhole.com
The Reliability Coefficient

Most methods for estimating reliability
produce a reliability coefficient, which is a
correlation coefficient that ranges in value
from 0.0 to + 1.0. When a test's reliability
coefficient is 0.0, this means that all
variability in obtained test scores is due to
measurement error. Conversely, when a
test's reliability coefficient is + 1.0, this
indicates that all variability in scores
reflects true score variability.
ebooks.edhole.com
The Reliability Coefficient
The reliability coefficient is symbolized
with the letter "r" and a subscript that
contains two of the same letters or
numbers (e.g., ''r
xx
'').

The subscript indicates that the
correlation coefficient was calculated by
correlating a test with itself rather than
with some other measure.

ebooks.edhole.com
The Reliability Coefficient
Regardless of the method used to calculate a
reliability coefficient, the coefficient is interpreted
directly as the proportion of variability in obtained
test scores that reflects true score variability. For
example, as depicted in Figure 1, a reliability
coefficient of .84 indicates that 84% of variability
in scores is due to true score differences among
examinees, while the remaining 16% (1.00 - .84)
is due to measurement error.



Figure 1. Proportion of variability in test scores

True Score Variability (84%) Error (16%)
ebooks.edhole.com
The Reliability Coefficient
Note that a reliability coefficient does not provide
any information about what is actually being
measured by a test!

A reliability coefficient only indicates whether the
attribute measured by the test whatever it isis
being assessed in a consistent, precise way.

Whether the test is actually assessing what it was
designed to measure is addressed by an analysis of
the test's validity.

ebooks.edhole.com
The Reliability Coefficient
Study Tip: Remember that, in contrast to other correlation
coefficients, the reliability coefficient is never squared to
interpret it but is interpreted directly as a measure of true
score variability. A reliability coefficient of .89 means that
89% of variability in obtained scores is true score variability.
ebooks.edhole.com
Methods for Estimating
Reliability
The selection of a method for
estimating reliability depends on the
nature of the test.

Each method not only entails
different procedures but is also
affected by different sources of error.
For many tests, more than one
method should be used.

ebooks.edhole.com
1. Test-Retest Reliability:
The test-retest method for estimating
reliability involves administering the
same test to the same group of
examinees on two different occasions and
then correlating the two sets of scores.
When using this method, the reliability
coefficient indicates the degree of
stability (consistency) of examinees'
scores over time and is also known as the
coefficient of stability.


ebooks.edhole.com
Test-Retest Reliability
The primary sources of measurement error for
test-retest reliability are any random factors
related to the time that passes between the two
administrations of the test.

These time sampling factors include random
fluctuations in examinees over time (e.g.,
changes in anxiety or motivation) and random
variations in the testing situation.

Memory and practice also contribute to error
when they have random carryover effects; i.e.,
when they affect many or all examinees but not
in the same way.

ebooks.edhole.com
Test-Retest Reliability
Test-retest reliability is appropriate for
determining the reliability of tests designed to
measure attributes that are relatively stable
over time and that are not affected by repeated
measurement.

It would be appropriate for a test of aptitude,
which is a stable characteristic, but not for a test
of mood, since mood fluctuates over time, or a
test of creativity, which might be affected by
previous exposure to test items.

ebooks.edhole.com
2. Alternate (Equivalent,
Parallel) Forms Reliability:
To assess a test's alternate forms reliability,
two equivalent forms of the test are
administered to the same group of
examinees and the two sets of scores are
correlated.

Alternate forms reliability indicates the
consistency of responding to different item
samples (the two test forms) and, when the
forms are administered at different times,
the consistency of responding over time.

ebooks.edhole.com
Alternate (Equivalent,
Parallel) Forms Reliability
The alternate forms reliability
coefficient is also called the coefficient
of equivalence when the two forms are
administered at about the same time;

and the coefficient of equivalence and
stability when a relatively long period
of time separates administration of the
two forms.

ebooks.edhole.com
Alternate (Equivalent, Parallel)
Forms Reliability
The primary source of measurement error
for alternate forms reliability is content
sampling, or error introduced by an
interaction between different examinees'
knowledge and the different content
assessed by the items included in the two
forms (eg: Form A and Form B)


ebooks.edhole.com
Alternate (Equivalent,
Parallel) Forms Reliability
The items in Form A might be a better match of
one examinee's knowledge than items in Form
B, while the opposite is true for another
examinee.

In this situation, the two scores obtained by
each examinee will differ, which will lower the
alternate forms reliability coefficient.

When administration of the two forms is
separated by a period of time, time sampling
factors also contribute to error.

ebooks.edhole.com
Alternate (Equivalent, Parallel)
Forms Reliability
Like test-retest reliability, alternate
forms reliability is not appropriate
when the attribute measured by the
test is likely to fluctuate over time
(and the forms will be administered at
different times) or when scores are
likely to be affected by repeated
measurement.


ebooks.edhole.com
Alternate (Equivalent,
Parallel) Forms Reliability
If the same strategies required to solve
problems on Form A are used to solve problems
on Form B, even if the problems on the two
forms are not identical, there are likely to be
practice effects.

When these effects differ for different examinees
(i.e., are random), practice will serve as a
source of measurement error.

Although alternate forms reliability is considered
by some experts to be the most rigorous (and
best) method for estimating reliability, it is not
often assessed due to the difficulty in developing
forms that are truly equivalent.
3. Internal Consistency
Reliability:
Reliability can also be estimated by measuring the
internal consistency of a test.

Split-half reliability and coefficient alpha are two
methods for evaluating internal consistency. Both
involve administering the test once to a single
group of examinees, and both yield a reliability
coefficient that is also known as the coefficient of
internal consistency.


ebooks.edhole.com
Internal Consistency
Reliability
To determine a test's split-half reliability,
the test is split into equal halves so that
each examinee has two scores (one for
each half of the test).

Scores on the two halves are then
correlated. Tests can be split in several
ways, but probably the most common
way is to divide the test on the basis of
odd- versus even-numbered items.


ebooks.edhole.com
Internal Consistency Reliability
A problem with the split-half method is that it produces a
reliability coefficient that is based on test scores that were
derived from one-half of the entire length of the test.

If a test contains 30 items, each score is based on 15 items.
Because reliability tends to decrease as the length of a test
decreases, the split-half reliability coefficient usually
underestimates a test's true reliability.

For this reason, the split-half reliability coefficient is
ordinarily corrected using the Spearman-Brown prophecy
formula, which provides an estimate of what the reliability
coefficient would have been had it been based on the full
length of the test.


ebooks.edhole.com
Internal Consistency
Reliability
Cronbach's coefficient alpha also involves administering
the test once to a single group of examinees. However,
rather than splitting the test in half, a special formula is
used to determine the average degree of inter-item
consistency.

One way to interpret coefficient alpha is as the average
reliability that would be obtained from all possible splits
of the test. Coefficient alpha tends to be conservative
and can be considered the lower boundary of a test's
reliability (Novick and Lewis, 1967).

When test items are scored dichotomously (right or
wrong), a variation of coefficient alpha known as the
Kuder-Richardson Formula 20 (KR-20) can be used.

ebooks.edhole.com
Internal Consistency
Reliability
Content sampling is a source of error for both
split-half reliability and coefficient alpha.

For split-half reliability, content sampling refers to
the error resulting from differences between the
content of the two halves of the test (i.e., the
items included in one half may better fit the
knowledge of some examinees than items in the
other half);

for coefficient alpha, content (item) sampling
refers to differences between individual test items
rather than between test halves.
Internal Consistency
Reliability
Coefficient alpha also has as a
source of error, the heterogeneity of
the content domain.

A test is heterogeneous with regard
to content domain when its items
measure several different domains
of knowledge or behavior.


ebooks.edhole.com
Internal Consistency
Reliability
The greater the heterogeneity of the content
domain, the lower the inter-item correlations and
the lower the magnitude of coefficient alpha.

Coefficient alpha could be expected to be smaller
for a 200-item test that contains items assessing
knowledge of test construction, statistics, ethics,
epidemiology, environmental health, social and
behavioral sciences, rehabilitation counseling, etc.
than for a 200-item test that contains questions on
test construction only.

ebooks.edhole.com
Internal Consistency
Reliability
The methods for assessing internal consistency
reliability are useful when a test is designed to
measure a single characteristic, when the
characteristic measured by the test fluctuates over
time, or when scores are likely to be affected by
repeated exposure to the test.

They are not appropriate for assessing the
reliability of speed tests because, for these tests,
they tend to produce spuriously high coefficients.
(For speed tests, alternate forms reliability is
usually the best choice.)
4. Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability:
Inter-rater reliability is of concern
whenever test scores depend on a rater's
judgment.

A test constructor would want to make sure
that an essay test, a behavioral observation
scale, or a projective personality test have
adequate inter-rater reliability. This type of
reliability is assessed either by calculating a
correlation coefficient (e.g., a kappa
coefficient or coefficient of concordance) or
by determining the percent agreement
between two or more raters.

Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability
Although the latter technique is frequently used, it
can lead to erroneous conclusions since it does not
take into account the level of agreement that
would have occurred by chance alone.

This is a particular problem for behavioral
observation scales that require raters to record the
frequency of a specific behavior.

In this situation, the degree of chance agreement
is high whenever the behavior has a high rate of
occurrence, and percent agreement will provide an
inflated estimate of the measure's reliability.

ebooks.edhole.com
Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability
Sources of error for inter-rater reliability
include factors related to the raters such
as lack of motivation and rater biases and
characteristics of the measuring device.

An inter-rater reliability coefficient is
likely to be low, for instance, when rating
categories are not exhaustive (i.e., don't
include all possible responses or
behaviors) and/or are not mutually
exclusive.

Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability
The inter-rater reliability of a behavioral rating scale
can also be affected by consensual observer drift,
which occurs when two (or more) observers
working together influence each other's ratings so
that they both assign ratings in a similarly
idiosyncratic way.

(Observer drift can also affect a single observer's
ratings when he or she assigns ratings in a
consistently deviant way.) Unlike other sources of
error, consensual observer drift tends to artificially
inflate inter-rater reliability.

ebooks.edhole.com
Inter-Rater (Inter-Scorer,
Inter-Observer) Reliability
The reliability (and validity) of ratings can
be improved in several ways:
Consensual observer drift can be eliminated by
having raters work independently or by
alternating raters.
Rating accuracy is also improved when raters
are told that their ratings will be checked.
Overall, the best way to improve both inter- and
intra-rater accuracy is to provide raters with
training that emphasizes the distinction between
observation and interpretation (Aiken, 1985).
RELIABILITY AND VALIDITY


Study Tip: Remember the Spearman-Brown formula is related
to split-half reliability and KR-20 is related to the coefficient
alpha. Also know that alternate forms reliability is the most
thorough method for estimating reliability and that internal
consistency reliability is not appropriate for speed tests.
ebooks.edhole.com
Factors That Affect The
Reliability Coefficient
The magnitude of the reliability coefficient
is affected not only by the sources of error
discussed earlier, but also by the length of
the test, the range of the test scores, and
the probability that the correct response to
items can be selected by guessing.

Test Length
Range of Test Scores
Guessing


ebooks.edhole.com
1. Test Length:
The larger the sample of the attribute being
measured by a test, the less the relative
effects of measurement error and the more
likely the sample will provide dependable,
consistent information.

Consequently, a general rule is that the
longer the test, the larger the test's
reliability coefficient.


ebooks.edhole.com
Test Length
The Spearman-Brown prophecy formula is most
associated with split-half reliability but can actually
be used whenever a test developer wants to
estimate the effects of lengthening or shortening a
test on its reliability coefficient.

For instance, if a 100-item test has a reliability
coefficient of .84, the Spearman-Brown formula
could be used to estimate the effects of increasing
the number of items to 150 or reducing the number
to 50.

A problem with the Spearman-Brown formula is that
it does not always yield an accurate estimate of
reliability: In general, it tends to overestimate a
test's true reliability (Gay, 1992).
Test Length
This is most likely to be the case when the added
items do not measure the same content domain as
the original items and/or are more susceptible to
the effects of measurement error.

Note that, when used to correct the split-half
reliability coefficient, the situation is more complex,
and this generalization does not always apply:
When the two halves are not equivalent in terms of
their means and standard deviations, the
Spearman-Brown formula may either over- or
underestimate the test's actual reliability.

2. Range of Test Scores:
Since the reliability coefficient is a
correlation coefficient, it is
maximized when the range of
scores is unrestricted.

The range is directly affected by
the degree of similarity of
examinees with regard to the
attribute measured by the test.

Range of Test Scores
When examinees are heterogeneous, the range of
scores is maximized.

The range is also affected by the difficulty level of
the test items.

When all items are either very difficult or very easy,
all examinees will obtain either low or high scores,
resulting in a restricted range.

Therefore, the best strategy is to choose items so
that the average difficulty level is in the mid-range
(r = .50).

3. Guessing:
A test's reliability coefficient is also affected by the
probability that examinees can guess the correct
answers to test items.

As the probability of correctly guessing answers
increases, the reliability coefficient decreases.

All other things being equal, a true/false test will
have a lower reliability coefficient than a four-
alternative multiple-choice test which, in turn, will
have a lower reliability coefficient than a free recall
test.


The Interpretation of
Reliability
The interpretation of a test's
reliability entails considering
its effects on the scores
achieved by a group of
examinees as well as the
score obtained by a single
examinee.

ebooks.edhole.com
Interpretation of
Reliability Coefficient
The Reliability Coefficient: As discussed
previously, a reliability coefficient is interpreted
directly as the proportion of variability in a set of
test scores that is attributable to true score
variability.

A reliability coefficient of .84 indicates that 84% of
variability in test scores is due to true score
differences among examinees, while the remaining
16% is due to measurement error.

While different types of tests can be expected to
have different levels of reliability, for most tests in
the social sciences, reliability coefficients of .80 or
larger are considered acceptable.

The Interpretation of
Reliability
When interpreting a reliability coefficient, it
is important to keep in mind that there is
no single index of reliability for a given test.

Instead, a test's reliability coefficient can
vary from situation to situation and sample
to sample. Ability tests, for example,
typically have different reliability
coefficients for groups of individuals of
different ages or ability levels.
Interpretation of Standard
Error of Measurement
While the reliability coefficient is useful for
estimating the proportion of true score variability
in a set of test scores, it is not particularly helpful
for interpreting an individual examinee's obtained
test score.

When an examinee receives a score of 80 on a
100-item test that has a reliability coefficient of
.84, for instance, we can only conclude that, since
the test is not perfectly reliable, the examinee's
obtained score might or might not be his or her
true score.


Interpretation of Standard
Error of Measurement
A common practice when interpreting an
examinees obtained score is to construct a
confidence interval around that score.

The confidence interval helps a test user estimate
the range within which an examinee's true score is
likely to fall given his or her obtained score.

This range is calculated using the standard error of
measurement, which is an index of the amount of
error that can be expected in obtained scores due
to the unreliability of the test. (When raw scores
have been converted to percentile ranks, the
confidence interval is referred to as a percentile
band.)
Interpretation of Standard
Error of Measurement
The following formula is used to estimate the
standard error of measurement:

Formula 1: Standard Error of Measurement

SE
meas
= SD
x
*(1 r
xx
)
1/2


Where:
SE
meas
= standard error of measurement
SD
x
= standard deviation of test scores
r
xx
= reliability coefficient
Interpretation of Standard
Error of Measurement
As shown by the formula, the magnitude of
the standard error is affected by two factors:

the standard deviation of the test scores (SD
x
),
and

the test's reliability coefficient (r
xx
).

The lower the test's standard deviation and
the higher its reliability coefficient, the
smaller the standard error of measurement
(and vice versa).
Interpretation of Standard
Error of Measurement
Because the standard error is a type of standard
deviation, it can be interpreted in terms of the
areas under the normal curve.

With regard to confidence intervals, this means
that a 68% confidence interval is constructed by
adding and subtracting one standard error to an
examinee's obtained score; a 95% confidence
interval is constructed by adding and subtracting
two standard errors; and a 99% confidence
interval is constructed by adding and subtracting
three standard errors.

Interpretation of Standard
Error of Measurement
Example: A psychologist administers a interpersonal assertiveness
test to a sales applicant who receives a score of 80. Since the test's
reliability is less than 1.0, the psychologist knows that this score
might be an imprecise estimate of the applicant's true score and
decides to use the standard error of measurement to construct a
95% confidence interval. Assuming that the tests reliability
coefficient is .84 and its standard deviation is 10, the standard error
of measurement is equal to 4.0:

SE
meas
= SD
x
*(1 r
xx
)
1/2
= 10 (1 - .84)
1/2
= 10(.4) = 4.0

The psychologist constructs the 95% confidence interval by adding
and subtracting two standard errors from the applicant's obtained
score: 80 + 2(4.0) = 72 to 88. This means that there is a 95%
chance that the applicant's true score falls somewhere between 72
and 88.
Interpretation of Standard
Error of Measurement
One problem with the standard error is that
measurement error is not usually equally
distributed throughout the range of test scores.

Use of the same standard error to construct
confidence intervals for all scores in a distribution
can, therefore, be somewhat misleading.

To overcome this problem, some test manuals
report different standard errors for different score
intervals.
Estimating True Scores from
Obtained Scores
As discussed earlier, because of the effects
of measurement error, obtained test scores
tend to be biased (inaccurate) estimates of
true scores.

More specifically, scores above the mean of
a distribution tend to overestimate true
scores, while scores below the mean tend
to underestimate true scores.
ebooks.edhole.com

You might also like