Chapter 8 Test Development (Unfinished)
Chapter 8 Test Development (Unfinished)
Chapter 8 Test Development (Unfinished)
Test Development
TEST CONCEPTUALIZATION
Self-talk, in behavioral terms.
ought to be a test designed to measure a particular construct
Review of the available literature on existing tests designed to measure a particular construct
might indicate that such tests leave much to be desired in psychometric soundness.
What special training will be required of test users for administering or interpreting the test?
What background and qualifications will a prospective user of data derived from an
administration of this test need to have?
What restrictions, if any, should be placed on distributors of the test and on the tests usage?
Is there any potential for harm as the result of an administration of this test?
What safeguards are built into the recommended testing procedure to prevent any sort of harm
to any of the parties involved in the use of this test?
Pilot Work
Pilot work, pilot study, and pilot research refer,
in general, to the preliminary research surrounding the creation of a prototype of the test.
Pilot work
the test developer typically attempts to determine how best to measure a targeted construct
entail the creation, revision, and deletion of many test items in addition to literature reviews,
experimentation, and related activities.
Once pilot work has been completed, the process of test construction begins
TEST CONSTRUCTION
Pilot work is a necessity when constructing tests or other measuring instruments for publication
and wide distribution
pilot work need not be part of the process of developing teacher-made tests for classroom use (
SCALING
Measurement
as the assignment of numbers according to rules
Scaling
May be defined as the process of setting rules for assigning numbers in measurement.
is the process by which a measuring device is designed and calibrated and by which numbers (or
other indices)scale valuesare assigned to different amounts of the trait, attribute, or
characteristic being measured.
L. L. Thurstone
( Figure 82 ) is credited for being at the forefront of efforts to develop methodologically sound
scaling methods
He adapted psychophysical scaling methods to the study of psychological variables such as
attitudes and values
Types of scales
Scales
may also be conceived of as instruments used to measure.
NOTE:
There is no one method of scaling.
There is no best type of scale.
Test developers scale a test in the manner they believe is optimally suited to their conception of
the measurement of the trait (or whatever) that is being measured.
Scaling methods
Note:
The higher or lower the score, the more or less of the characteristic the testtaker presumably
possesses.
SORTING
Comparative scaling
One method of sorting
entails judgments of a stimulus in comparison with every other stimulus on the scale
Testtakers would be asked to sort the cards from most justifi able to least justifiable.
Comparative scaling could also be accomplished by providing testtakers with a list of 30 items
on a sheet of paper and asking them to rank the justifi ability of the items from 1 to 30
Categorical scaling
Stimuli are placed into one of two or more alternative categories that differ quantitatively with
respect to some continuum.
testtakers might be given 30 index cards on each of which is printed one of the 30 items.
Testtakers would be asked to sort the cards into three piles: those behaviors that are never
justifi ed, those that are sometimes justifi ed, and those that are always justifi ed.
Guttmann scale
is yet another scaling method that yields ordinal level measures. Items on it range sequentially
from weaker to stronger expressions of the attitude, belief, or feeling being measured.
that all respondents who agree with the stronger statements of the attitude will also agree with
milder statements.
this were a perfect Guttman scale, then all respondents who agree with a (the most extreme
position) should also agree with b, c, and d. All respondents who disagree with a but
agree with b should also agree with c and d, and so forth.
The resulting data are then analyzed by means of scalogram analysis, an item-analysis
procedure and approach to test development that involves a graphic mapping of a testtakers
responses.
where an objective may be to learn if a consumer who will purchase one product will purchase
another product.
Equal-appearing intervals,
first described by Thurstone (1929), is one scaling method used to obtain data that are
presumed to be interval in nature
is an example of a scaling method of the direct estimation variety.
indirect estimation
there is no need to transform the test takers responses into some other scale.
Writing Items
Test developer/item writer immediately faces three questions related to the test blueprint:
What range of content should the items cover?
Which of the many different types of item formats should be employed?
How many items should be written in total and for each content area covered?
Should
First draft contain approximately twice the number of items that the final version of the test
will contain. Because approximately half of these items will be eliminated from the tests final
version, the test developer needs to ensure that the final version also contains items that
adequately sample the domain.
Another consideration here is whether or not alternate forms of the test will be created and, if
so, how many Multiply the number of items required in the pool for one form of the test by the
number of forms planned, and you have the total number of items needed for the initial item
pool.
How does one develop items for the item pool? The test developer may write a large number
of items from personal experience or academic acquaintance with the subject matter. Help
may also be sought from others, including experts For psychological tests designed to be used in
clinical settings, clinicians, patients, patients family members, clinical staff, and others may be
interviewed for insights that could assist in item writing.
Item pool
is the reservoir or well from which items will or will not be drawn for the final version of the
test.
Note:
A comprehensive sampling provides a basis for content validity of the final version of the test.
Item format
the form, plan, structure, arrangement, and layout of individual test items
TWO TYPES OF ITEM FORMAT
Selected-response format
require testtakers to select a response from a set of alternative responses.
achievement and if the items are written in a selectedresponse format, then examinees must
select the response that is keyed as correct
Constructed-response format
require testtakers to supply or to create the correct answer, not merely to select it.
SELECTED-RESPONSE FORMAT
MULTIPLE-CHOICE FORMAT has three elements:
a stem,
a correct alternative or option, and
several incorrect alternatives or options variously referred to as distractors or foils
MATCHING ITEM
the testtaker is presented with two columns:
premises on the left and
responses on the right
Note
The testtakers task is to determine which response is best associated with which premise.
young test takers = direct line
other than young children = write letter
Note
two columns contain different numbers of items.
If the number of items in the two columns were the same, then a person unsure about one of
the actors roles could merely deduce it by matching all the other options first.
A perfect score would then result even though the testtaker did not actually know all the
answers.
Providing more options than needed minimizes such a possibility
Should:
wording of the premises and the responses should be fairly short and to the point
two columns contain different numbers of items
No more than a dozen or so premises should be included; otherwise, some students will forget
what they were looking for as they go through the lists
The lists of premises and responses should both be homogeneousthat is, lists of the same
sort of thing
Disadvantage of Binary
probability of obtaining a correct response purely on the basis of chance (guessing) on any one
item is .5, or 50% (multiple choice is 25%)
CONSTRUCTED-RESPONSE FORMAT
COMPLETION ITEM
requires the examinee to provide a word or phrase that completes a sentence, as in the
following example
Should
should be worded so that the correct answer is specifi
Note:
Completion items that can be correctly answered in many ways lead to scoring problems.
The correct completion here is variability. )
An alternative way of constructing this question would be as a short-answer item:
What descriptive statistic is generally considered the most useful measure of variability?
SHORT-ANSWER ITEM
completion item may also be referred
to be written clearly enough that the testtaker can respond succinctlythat is, with a short
answer
There are no hard-and-fast rules for how short an answer must be to be considered a short
answer; a word, a term, a sentence, or a paragraph may qualify.
ESSAY ITEM
as a test item that requires the test taker to respond to a question by writing a composition,
typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation.
useful when the test developer wants the examinee to demonstrate a depth of knowledge
about a single topic
the essay question not only permits the restating of learned material but also allows for the
creative integration and expression of the material in the testtakers own words
Essay vs other types of response
latter types of items require only recognition an essay requires recall, organization, planning,
and writing ability
Disadvantage:
tends to focus on a more limited area than can be covered in the same amount of time when
using a series of selected-response items or completion items
can be subjectivity in scoring and inter-scorer differences
Item bank
is a relatively large and easily accessible collection of test questions.
Instructors who regularly teach a particular course sometimes create their own item bank of
questions that they have found to be useful on examinations
Advantage:
accessibility to a large number of test items conveniently classified by subject area, item
statistics, or other variables
And just as funds may be added to or withdrawn from a more traditional bank, so items may be
added to, withdrawn from, and even modified in an item bank
Advantages of CAT
only a Sample of the total number of items in the item pool is administered to any one test
taker.
Note:
On the basis of previous response patterns, items that have a high probability of being answered
in a particular fashion (correctly if an ability test) are not presented, thus providing economy
in terms of testing time and total number of items presented.
has been found to reduce the number of test items that need to be administered by as much as
50% while simultaneously reducing measurement error by 50%
CAT tends to reduce floor effects and ceiling effects
Floor effect
refers to the diminished utility of an assessment tool for distinguishing test takers at the low end
of the ability, trait, or other attribute being measured
Ceiling effect
Refers to the diminished utility of an assessment tool for distinguishing test takers at the high
end of the ability, trait, or other attribute being measured.
Returning to our example of the ninth-grade mathematics test, what would happen if all of the
test takers answered all of the items correctly? It is likely that the test user would conclude that
the test was too easy for this group of test takers and so discrimination was impaired by a ceiling
effect
Item branching
The ability of the computer to tailor the content and order of presentation of test items on the
basis of responses to previous items
Note:
achievement but also of personality. For example, if a respondent answers an item in a way that
suggests he or she is depressed, the computer might automatically probe for depression-related
symptoms and behavior. The next item presented might be designed to probe the respondents
sleep patterns or the existence of suicidal ideation.
Item-branching technology may be used in personality tests to recognize non purposive or
inconsistent responding
For example, on a computer-based truefalse test, if the examinee responds true to an item
such as I s ummered in Baghdad last year, then there would be reason to suspect that the
examinee is responding nonpurposively, randomly, or in some way other than genuinely. And if
the same respondent responds false to the identical item later on in the test, the respondent is
being inconsistent as well.
Should the computer recognize a nonpurposive response pattern, it may be programmed to
respond in a prescribed wayfor example, by admonishing the respondent to be more careful
or even by refusing to proceed until a purposive response is given.
Scoring Items
cumulative model
most commonly used modelowing, in part, to its simplicity and logic
that the higher the score on the test, the higher the testtaker is on the ability, trait, or other
characteristic that the test purports to measure.
Class or category scoring
testtaker responses earn credit toward placement in a particular class or category with other
testtakers whose pattern of responses is presumably similar in some way.
This approach is used by some diagnostic systems wherein individuals must exhibit a certain
number of symptoms to qualify for a specifi c diagnosis. A
ipsative scoring
departs radically in rationale from either cumulative or class models.
is comparing a test takers score on one scale within a test to another scale within that same
test.
The test does not yield information on the strength of a test takers need relative to the
presumed strength of that need in the general population
TEST TRYOUT
WHO. The test should be tried out on people who are similar in critical respects to the people
for whom the test was designed
HOW MANY An informal rule of thumb is that there should be no fewer than fi ve subjects and
preferably as many as ten for each item on the test. the more subjects in the tryout the better
phantom factorsfactors that actually are just artifacts of the small sample sizemay emerge.
WHEN/WHERE executed under conditions as identical as possible to the conditions under which
the standardized test will be administered all instructions, and everything from the time limits
allotted for completing the test to the atmosphere at the test site, should be as similar as
possible.
Note:
endeavors to ensure that differences in response to the tests items are due in fact to the items,
not to extraneous factors.
ITEM ANALYSIS
Statistical procedures used to analyze items may become quite complex, and our treatment of
this subject should be viewed as only introductory
The criteria for the best items may differ as a function of the test developers objectives.
Among the tools test developers might employ to analyze and select items are:
an index of the items difficulty
an index of the items reliability
an index of the items validity
an index of item discrimination
ITEM-CHARACTERISTIC CURVES
IRT
can be a powerful tool not only for understanding how test items perform but also for creating
or modifying individual test items, building new tests, and revising existing tests
Item characteristic curves (ICCs)
can play a role in decisions about which items are working well and which items are not
is a graphic representation of item difficulty and discrimination.
The steeper the slope, the greater the item discrimination
An easy item will shift the ICC to the left along th ability axis, indicating that many people will
likely get the item correct.
A difficult item will shift the ICC to the right along the horizontal axis, indicating that fewer
peoplewill answer the item correctly
Speed tests
Item analyses of tests taken under speed conditions yield misleading or uninterpretable results
closer an item is to the end of the test, the more difficult it may appear to be.
Expert panels
may also provide qualitative analyses of test items
sensitivity review
is a study of test items, typically conducted during the test development process, in which items
are examined for fairness to all prospective testtakers and for the presence of offensive
language, stereotypes, or situations.
TEST REVISION
cross-validation
refers to the revalidation of a test on a sample of test takers other than those on whom test
performance was originally found to be a valid predictor of some criterion
validity shrinkage
The decrease in item validities that inevitably occurs after cross-validation of fi ndings is referred
to as.
shrinkage is expected and is viewed as integral to the test development proces
co-validation
may be defi ned as a test validation process conducted on two or more tests using the same
sample of testtakers.
Co-validation is benefi cial to test publishers because it is economical.
co-norming
When used in conjunction with the creation of norms or the revision of existing norms, this
process may also be referred to as
co-validate and/or co-norm tests
current trend among test publishers who publish more than one test designed for use with the
same p opulation
DIF analysis
test developers scrutinize group-by-group item response curves, looking for what are termed
DIF items
DIF items
are those items that respondents from different groups at the same level of the underlying trait
have different probabilities of endorsing as a function of their group membership.