Journal of Psychoeducational
Assessment
http://jpa.sagepub.com
Psychometric Assessment and Reporting Practices: Incongruence Between Theory
and Practice
Kathleen L. Slaney, Masha Tkatchouk, Stephanie M. Gabriel and Michael D. Maraun
Journal of Psychoeducational Assessment 2009; 27; 465 originally published online Jul 9, 2009;
DOI: 10.1177/0734282909335781
The online version of this article can be found at:
http://jpa.sagepub.com/cgi/content/abstract/27/6/465
Published by:
http://www.sagepublications.com
Additional services and information for Journal of Psychoeducational Assessment can be found at:
Email Alerts: http://jpa.sagepub.com/cgi/alerts
Subscriptions: http://jpa.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations http://jpa.sagepub.com/cgi/content/refs/27/6/465
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Psychometric Assessment
and Reporting Practices
Journal of Psychoeducational
Assessment
Volume 27 Number 6
December 2009 465-476
© 2009 SAGE Publications
10.1177/0734282909335781
http://jpa.sagepub.com
hosted at
http://online.sagepub.com
Incongruence Between Theory and Practice
Kathleen L. Slaney
Masha Tkatchouk
Stephanie M. Gabriel
Michael D. Maraun
Simon Fraser University
The aim of the current study is twofold: (a) to investigate the rates at which researchers assess
and report on the psychometric properties of the measures they use in their research and (b) to
examine whether or not researchers appear to be generally employing sound/unsound rationales when it comes to how they conduct test evaluations. Based on a sample of 368 articles
published in four journals in the year 2004, the findings suggest that, although evidence bearing on score precision/reliability and the internal structure of item responses remains underreported, researchers appear to be assessing the relationships between test scores and external
variables relatively more frequently than in the past. However, findings also indicate that, all
told, very few researchers are assessing and reporting on internal score validity, and score
precision/reliability, and external score validity, and in that sequence, suggesting that applied
researchers may not always be adopting sound test-evaluative rationales in their psychometric
assessments.
Keywords:
psychometric assessment; psychometric reporting practices; test-analysis; internal score validity; external score validity
T
he past several decades have seen substantial and impressive developments in psychometric theory, resulting in the availability of an ever-growing set of concepts and tools
from which the applied researcher may choose when analyzing the properties of a test.
However, despite the existence of guidelines such as those in the most recent versions of
Standards for Educational and Psychological Testing (American Educational Research
Association, the American Psychological Association, and the National Council on
Measurement in Education, 1999) and the set of recommendations published by the APA
Task Force on Statistical Inference (Wilkinson & The APA TFSI, 1999), practices concerning the analysis and reporting of test data remain inconsistent.
Authors’ Note: This study was supported by a Simon Fraser University–Social Sciences and Humanities
Research Council of Canada (SFU–SSHRC) Institutional Grant awarded to the first author. Please address correspondence to Kathleen L. Slaney, Department of Psychology, Simon Fraser University, 8888 University
Drive, Burnaby, British Columbia, V5A 1S6, Canada; e-mail: klslaney@sfu.ca.
465
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
466
Journal of Psychoeducational Assessment
Almost 30 years ago, in his review of articles published in the American Educational
Research Journal (AERJ) between 1969 and 1978, Willson (1980) found that only 37.0%
of the studies explicitly reported reliability coefficients for the data analyzed and only
another 18% reported reliability estimates from previously published studies. He consequently remarked, “That reliability . . . is unreported in almost half the published research
is . . . inexcusable at this late date,” and that “Editors and reviewers ought to routinely
return papers that fail to establish the psychometric properties of the instruments they use”
(p. 9). However, the situation in more recent years appears to be little improved: In their
meta-analysis of reliability generalization (RG) studies, Vacha-Hasse, Henson, and Caruso
(2002) found that in most empirical studies, authors fail to report reliability of their own
scores (M = 75.6%, SD = 17.0%) and often do not even mention reliability (M = 56.3%,
SD = 14.4%); Vacha-Haase, Ness, Nilsson, and Reetz (1999) found that on average 36.4%
of reviewed articles did not make any reference to reliability at all, only 35.6% provided
reliability coefficients for the data analyzed in the study at hand, 22.9% reported reliability
coefficients from previous studies, and 3.8% provided only citations of articles in which
reliability was reported; in their review of all of the 1996 issues of the Journal of Counseling
& Development (JCD), Thompson and Snyder (1998) found that only 36.0% of the
reviewed studied reported reliability for the data analyzed, but 84% reported reliability
estimates from previous studies in which the same measure was employed; Whittington
(1998) found, among others things, that researchers often fail to consider sample/
population characteristics when reporting reliability (75% measures from other sources/64%
of articles) or validity evidence (86% of measures from other sources/82% of articles). Qualls
and Moss (1996) examined all articles published in 22 or the then 25 Ameriacn Psychological
Association (APA) journals for the year 1992 and found that score reliability was reported
for 41% and that validity evidence was reported for only 31.7% of measures and, furthermore, that this evidence was not always based on the data collected for the study at hand.
Research in this area has not only examined whether psychometric evidence is reported
but also what type of evidence is reported. Hogan, Benjamin, and Brezinski (2000) found
that, although reliability information was reported for 93.8% of measures, in most cases,
only one type of reliability was reported, and this was most often coefficient alpha. In a
subsequent study, Hogan and Agnello (2004) found that when validity evidence was
reported, it was most often quantified in terms of bivariate correlations between test scores
and other variables (i.e., criterion-related validity).
In addition to these findings, we believe another striking feature of current applied testanalytic practice is the lack of consistency in terms of how test evaluation proceeds. In particular, although researchers almost universally appear to understand the difference between
the reliability (or, more generally, precision) and the validity of test scores, the relationship
between the two and its relevance to test evaluation, seems oft not to be recognized. Slaney
and Maraun (2008) have distinguished between two major components of data-based validity assessment that must be decoupled so that each may be appropriately dealt with. They
refer to these components as internal test validity and external test validity.
Internal test validity refers, roughly speaking, to the extent to which item responses relate
to one another in a way predicted by the theory about the construct(s) that the test was
designed to measure (Cronbach & Meehl, 1955; Loevinger, 1957; Peak, 1953). Typically,
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Slaney et al. / Psychometric Assessment and Reporting Practices 467
assessment of the internal item validity involves determining whether the item responses
can be adequately described by a measurement model that has been chosen to represent the
internal association structure of responses to the items of the test under consideration. The
external test validity of a test, conversely, encapsulates any and all evidence supporting
particular predicted relations among the item scores (or, more typically, composites of item
scores) and variables external to the test (e.g., particular criteria, such as GPA, other test
scores, etc.).
It is argued by Slaney and Maraun (2008) that only once a set of item responses has been
shown to have internal test validity, can the responses be justifiably composited and the
precision/reliability of the resulting composite(s) assessed. If such a composite(s) is shown
to be adequately precise, then it (they) can be entered into investigations of external test
validity (e.g., correlating it [them] with relevant variables which are external to the measure
in question). In this way, the various psychometric properties of a test are intimately interrelated. To our knowledge, very little, if any, research has examined test-evaluative practices bearing on this interrelated nature of the three psychometric characteristics of tests.
Here, to keep clear that what is (or is not) being assessed is the validity of test scores, or
interpretations of scores, and not test, per se, we will refer to internal score validity and
external score validity.
The aim of the current work is, thus, twofold. First, we would like to follow up on previous work investigating psychometric assessment and reporting practices by conducting a
general investigation into the rates at which score precision/reliability and validity evidence are being reported in more recently published studies. Second, we will investigate
the extent to which researchers’ understandings of the particular relationships among distinct components of test evaluation are reflected in the manners in which they analyze/
report on these different features of test data.
Method
Sample
Articles published in four peer-reviewed journals in the year 2004 were reviewed.
Because the aim of the study was to describe commonly occurring test-evaluative practices
and the rationales underlying them, the following journals were chosen because of the high
likelihood that they would contain articles in which test data was collected and analyzed:
Educational and Psychological Measurement (EPM; volume 64—all articles, not just validity studies), Psychological Assessment (PA; volume 16), Journal of Personality Assessment
(JPA; volumes 82 and 83), and Personality and Individual Differences (PAID; volumes 36
and 37). Only articles which could be identified as employing at least one quantitative measure, and in which some component of evaluation of at least one of the measures employed
(i.e., reporting of reliability and/or validity evidence of any kind) were subject to further
review, resulting in a sample of 368 articles. This sample of articles contained information
on a total of 1,211 measures. Given that many of the articles reviewed employed more than
one measure, we coded the test-evaluative practices separately for each measure. The results
reported herein are summarized at both the article- and measure-level of analysis.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
468
Journal of Psychoeducational Assessment
Procedure
A coding form was developed for assessing frequencies of a number of test-evaluative
practices. Amongst other things, for each measure we coded (1) whether the measure was
(a) preexisting, (b) newly developed, or (c) modified for the study at hand; (2) type of
response format. We also coded for (3) whether (a) the issue of precision/reliability was
addressed at all, and/or evidence of precision/reliability of measurement was (b) invoked
from test manuals and/or previous studies in which the same measure(s) was used, and/or
(c) generated from the data analyzed in the study; (4) whether (a) the issue of validity was
addressed at all, and/or validity evidence was (b) invoked from external sources, and/or
(c) generated from the data analyzed in the study. For measures in which evidence bearing
on precision/reliability of measurement was reported directly, we coded for (5) whether
reported precision/reliability estimates were (a) “internal consistency” (e.g., coefficient
alpha, KR20, etc.), (b) test–retest, (c) other coefficients, and (d) if test–retest, was it explicitly used as a measure of stability. For measures in which validity evidence was reported
directly, we assessed (6) whether or not the “theoretical structure” of the test was explicitly
identified (i.e., how many and which attributes/qualities/properties the measure has been
designed to measure); (7) whether and how the (a) internal score validity and (b) external
score validity of the test were assessed. Finally, for measures for which precision/reliability
of measurement was directly assessed and internal score validity assessed and external
score validity assessed, we coded (8) whether (a) internal score validity evidence was
reported prior to reporting evidence bearing on the precision/reliability of scores, or vice
versa, (b) internal score validity evidence was reported prior to reporting external score
validity evidence, or vice versa, and (c) score precision/reliability evidence was reported
prior to external score validity evidence, or vice versa.
The coding was completed over a period of 1 year by two of the authors on the current
article. Each coded a separate set of articles; no articles were coded by more than one author.
To determine whether coding was stable over time, a subset of the items for 20 articles
was recoded (5 randomly chosen articles from each of the four reviewed journals) and
the percentage of absolute agreement calculated for a subset of the items appearing on
the coding form. The resulting percentages of absolute agreement for the original and
recoded items ranged between 85% and 100%, with an average of 93.3% over the 12
recoded items.
Results
Demographics
Table 1 summarizes both article- and measure-level demographic information. Of the
368 articles, 39 (10.6%) were published in EPM, 38 (10.3%) in PA, 36 (9.8%) in JPA, and
255 (69.3%) in PAID. Of the 1,211 measures coded, 49 (4.0%) appeared in EPM, 146
(12.1%) in PA, 178 (14.7%) in JPA, and 838 (68.7%) in PAID. The studies reviewed in
EPM, JPA, PA, and PAID reported on an average 1.26, 3.84, 4.94, and 3.29 measures,
respectively. Overall, the average number of measures employed in a given article was 3.29.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Slaney et al. / Psychometric Assessment and Reporting Practices 469
Table 1
Demographic Article/Measure Information
Journal (Volume)
Number of articles
reviewed
Number of measures
coded
Average number of
employed measures
Status of measure
Preexisting
New
Modified
Type of response
format
Nominal
Dichotomous
Ordered-categorical
Continuous
EPMa (64)
PAb (16)
JPAc (82, 83)
PAIDd (36, 37)
Overall
39 (10.6%)
38 (10.3%)
36 (9.8%)
255 (69.3%)
n = 368
49 (4.0%)
146 (12.1%)
178 (14.7%)
838 (69.2%)
n = 1,211
1.26 (SD = 0.74)
3.84 (SD = 3.24)
4.94 (SD = 5.55)
3.29 (SD = 2.17)
3.29 (SD = 2.86)
22
5
17
114
15
16
157
13
8
692
61
83
985 (81.3%)
94 (7.8%)
124 (10.2%)
3
5
31
0
1
14
59
0
0
14
85
0
1
74
387
3
5 (0.4%)
107 (8.8%)
562 (46.4%)
3 (0.2%)
NOTES: EPM = Educational and Psychological Measurement; PA = Psychological Assessment; JPA = Journal of Personality
Assessment; PAID = Personality and Individual Differences.
Table 1 also summarizes demographic information related to measure status (i.e., preexisting, newly developed, or modified) and type of response format for both the total sample
and broken down by journal.
Psychometric Assessment and Reporting
Table 2 lists both the article- and measure-level frequencies of the primary psychometric
assessment and reporting practices investigated in the current study. At the article level, for at
least one of the measures employed, 334 (90.8%) of the reviewed articles addressed (i.e., at
least mentioned) the issue of precision/reliability, 225 (61.1%) invoked score precision/
reliability evidence from external sources, and 268 (72.8%) reported reliability estimates for
the data analyzed in the study at hand. In all, 101 (27.4%) cited precision/reliability evidence
from other studies but did not actually directly assess the precision/reliability of the test/
subtest scores. Of the 268 articles in which precision/reliability estimates were generated for
at least one measure, 245 (91.4%) employed an “internal consistency” (e.g., Cronbach’s
alpha, KR20, etc.) coefficient, whereas only 51 (19.0%) employed other estimates of precision/reliability, including test–retest coefficients, the latter of which were employed in 32
(11.9%) of the articles reviewed. However, only 31.3% (n = 10) of the test–retest coefficients
were explicitly interpreted as measures of stability of measurement over time.
Validity was addressed in 354 (96.2%), validity evidence invoked in 259 (70.4%), and
validity evidence generated in 340 (92.4%) of the reviewed articles. Validity evidence was
cited but not generated for at least one of the measures employed in 219 (59.5%) articles.
Overall, at the article level of analysis, validity was addressed/ invoked/reported more often
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
470
Journal of Psychoeducational Assessment
Table 2
Frequency of Psychometric Assessment and Reporting by Article and Measure
Assessment and Reporting Practice
Articlesa (n = 368)
Measures (n = 1,211)
Precision/reliability addressed
Precision/reliability evidence invoked
Precision/reliability evidence generated
Internal consistency
Test–retest
Other
Validity addressed
Validity evidence invoked
Validity evidence generated
Theoretical structure identified
Internal score validity assessed
External score validity assessed
334 (90.8%)
225 (61.1%)
268 (72.8%)
245 (91.4% of 268)
32 (11.9% of 268)
19 (7.1% of 268)
354 (96.2%)
259 (70.4%)
340 (92.4%)
233 (68.5% of 340)
175 (51.5% of 340)
287 (84.4% of 340)
947 (78.2%)
547 (45.2%)
696 (57.5%)
635 (91.2% of 696)
47 (6.8% of 696)
63 (9.1% of 696)
1167 (96.4%)
552 (45.6%),
1139 (94.1%)
622 (54.6% of 1,139)
315 (27.7% of 1,139)
1040 (91.3% of 1,139)
a. For at least one of the measures described in the study.
than was evidence bearing on precision/reliability. Of the 340 articles in which validity
evidence was generated, the theoretical structure of at least one of the measures employed
in the study was explicitly identified in 233 (68.5%). However, of these, only 36 involved
studies in which more than one measure was employed and, of these, only 8 (22.2%)
explicitly identified the theoretical structure for all of the measures employed in the study
at hand. For those articles in which validity evidence was generated, the internal score
validity and external score validity of at least one measure were assessed in 175 (51.4%)
and 287 (84.4%), respectively.
As can be seen in Table 2, the measure-level results are generally consistent with this
pattern of findings. Most of the measure-level percentages reported are similar to or lower
than the corresponding percentages reported for particular practices seen at the article level.
This indicates that researchers engage in certain practices for a subset, but not necessarily
all, of the measures they employ. However, there were two notable exceptions to this general pattern: Whereas, the rate at which internal score validity was much higher at the
article level versus measure level (51.5% vs. 27.7%), the opposite pattern was true for
external score validity (84.4% vs. 91.3%). This indicates that internal score validity is often
assessed only for a subset of the measures employed; external score validity, conversely,
when it is being assessed by researchers, is being assessed for most if not all of the measures employed.
Test-Analytic Rationale
For the purposes of the current study, we take as an indication that researchers have
adopted a minimally appropriate test evaluation rationale when they provide evidence for
both the score precision/reliability and validity of the measures they employ in their
research. However, if the aim is to determine whether a test score provides a good “indicator” of the attribute purportedly measured by the test, then merely assessing a measure’s
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Slaney et al. / Psychometric Assessment and Reporting Practices 471
Table 3
Frequency of Practices Bearing on Test-Analytic
Rationale by Article and Measure
Articlea (n = 368)
Test-Analytic Practice
Precision/reliability or validity evidence (or both) generated
Precision/reliability or validity evidence (but not both)
generated
Precision/reliability (but not validity) evidence generated
Validity (but not precision/reliability) evidence generated
Both precision/reliability and validity evidence generated
Theoretical structure identified, internal score validity, and
external score validity assessed
Internal score validity, precision/reliability, and external
score validity assessed
Internal prior to precision/reliability
Internal prior to external
Precision/reliability prior to external
In full sequence
Precision/reliability prior to internal
External prior to internal
External prior to precision/reliability
361 (98.1%)
200 (54.3%)
Measureb (n = 1,211)
1182 (97.6%)
529 (43.7%)
27 (13.5% of 200)
173 (86.5% of 200)
244 (66.3%)
89 (24.2%)
43 (8.1% of 529)
486 (91.9% of 529)
653 (53.9%)
136 (11.2%)
103 (28.0%)
177 (14.6%)
41 (39.8% of 103)
82 (79.6% of 103)
89 (86.4% of 103)
33 (32.0% of 103)
57 (55.3% of 103)
14 (13.6% of 103)
7 (6.8% of 103)
61 (34.5% of 177)
131 (74.0% of 177)
153 (86.4% of 177)
50 (28.2% of 177)
101 (57.1% of 177)
30 (16.9% of 177)
9 (5.1% of 177)
a. For at least one of the measures described in the study.
b. Six cases were removed from this analysis because of inconsistency in coding.
precision/reliability and validity falls somewhat short of this aim. Rather, for reasons outlined above, a coherent approach to data-based test analysis requires that internal score
validity is assessed prior to the assessment of precision/reliability, that internal score validity is assessed prior to external score validity, that score precision/reliability is assessed
prior to external score validity, and, ideally, that internal score validity, score precision/
reliability, and external score validity are all assessed, and in that sequence.
Table 3 summarizes a number of practices which, we believe, indicate whether or not a
sound test-analytic rationale was adopted by researchers. At the article level of analysis,
98.1% (n = 361) of the studies generated either score precision/reliability evidence, validity
evidence, or both, 54.3% (n = 200) generated either one or the other (but not both), but only
66.3% (n = 244) assessed both score precision/reliability and validity for at least one of the
measures described in the study (and many did not assess both the precision/reliability and
validity for all of the measures described). In only 89 (24.2%) of the reviewed articles was
the theoretical structure identified and both the internal score validity and the external score
validity of the measure assessed, all of which, we argue, are minimally required to assess
whether the particular administration of the test yields “valid” scores of the attribute(s) in
question and the population under study.
Of the 103 articles for which internal score validity, precision/reliability, and external
score validity were all assessed for at least one of the measures employed, 41 (39.8%)
assessed internal score validity prior to assessing precision/reliability of scores as compared
to the 57 (55.3%) articles in which the pattern was reversed for at least one of the measures
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
472
Journal of Psychoeducational Assessment
employed; 82 (79.6%) assessed internal score validity prior to investigating external score
validity as compared with 14 (13.6%) in which the pattern was reversed for at least one of
the measures employed; 89 (86.4%) investigated measurement precision/reliability prior to
investigating external score validity as compared with the 7 (6.8%) in which the pattern was
reversed for at least one of the measures employed. However, in only 33 (32.0%) of the
articles were internal score validity, score precision/reliability, and external score validity
all assessed, and in that sequence, for at least one of the measures employed.
This indicates potentially that not enough researchers are aware or appreciative of the
notion that it may be of little use to assess the precision/reliability of a “score,” if that score
is in fact a composite of items measuring relatively distinct attributes or qualities (or distinct facets of a given attribute or quality). However, researchers appear to have a much
better understanding of the relationship between internal and external score validity, and
also between score precision/reliability and external score validity.
The measure-level results are, once again, quite consistent with those at the article level
of analysis. The only exceptions are the relative rates at which theoretical structure, internal
score validity and external score validity are all assessed and the relative rates at which internal score validity, precision/reliability, and external score validity are all assessed, with the
article-level rates being higher than the measure-level rates. This indicates that some
researchers are not only employing a potentially unsound rationale with respect to assessing
internal score validity, score precision/reliability, and external score validity, but that they
may also be inconsistent in their assessments of different measures used in the same study.
Measure-Level Results by Journal and Status
It seems reasonable to consider that the extent to which psychometric evidence is
reported could be influenced by the (possibly differing) editorial guidelines provided by
different journals. Likewise, when a study involves the use of either a newly developed or
modified measure, it is possible that researchers, reviewers, and editors alike may place
more importance on the assessment of the psychometric properties of such measures than
they might on previously developed measures, especially those which have had long-term
and broad application. For these reasons, we thought it pertinent to provide a subset of the
results presented above broken down by both journal and status of measure.
Tables 4 and 5 summarize the relative rates at which a subset of psychometric assessment
and reporting practices occur, broken down by journal and measure status, respectively.
Relatively speaking, both precision/reliability and validity were addressed, and precision/
reliability and validity evidence invoked and generated, more often in measures described
in articles published in PA than in the other three journals. However, whereas external score
validity was also assessed relatively most often in PA, internal score validity was assessed
with the highest relative frequency for measures published in EPM. With respect to the
(usually implicit) logic guiding the evaluation of psychometric properties of measures,
evidence that sound test-analytic rationales are being adopted by researchers is most
evident for measures appearing in JPA and PA.
With regard to measure status, precision/reliability was addressed and evidence generated proportionately more for new and modified measures than for preexisting measures.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Slaney et al. / Psychometric Assessment and Reporting Practices 473
Table 4
Percentage of Measures Reporting Psychometric Practices by Journal
Journal
Practice
EPM
JPA
PA
PAID
Precision/reliability addressed
Precision/reliability evidence invoked
Precision/reliability evidence generated
Validity addressed
Validity evidence invoked
Validity evidence generated
Internal score validity assessed
External score validity assessed
Precision/reliability and validity evidence generated
Internal prior to precision/reliability
Internal prior to external
Precision/reliability prior to external
In full sequence
59.5
30.4
54.4
57.0
32.9
57.0
51.9
43.0
83.7
37.9
48.2
87.5
31.0
60.8
44.7
37.7
86.9
41.2
81.4
12.6
78.4
38.2
60.0
93.3
93.3
53.3
88.3
66.0
58.7
97.3
64.7
91.3
19.3
86.0
54.8
50.0
83.3
94.4
44.4
75.4
37.3
54.5
89.3
38.6
88.0
24.6
80.6
55.4
26.4
72.7
94.2
21.7
NOTES: EPM = Educational and Psychological Measurement; PA = Psychological Assessment; JPA = Journal
of Personality Assessment; PAID = Personality and Individual Differences.
Not surprisingly, precision/reliability evidence was invoked, relatively speaking, most often
for preexisting measures. As regards validity, there appears to be little difference between
the three statuses of measure in the relative rates for which validity was addressed and
validity evidence generated. However, as with precision/reliability evidence, validity evidence was invoked relatively more often for preexisting measures, and internal score validity was assessed at a higher rate for new measures. The percentages of measures for which
external score validity was assessed were similar across the difference statuses of measure.
As for whether researchers appeared to adopt differing logics for assessing new or modified
as opposed to preexisting measures, generally speaking, sounder test-evaluative rationales
were adopted for assessing new measures than for preexisting and modified measures.
Discussion
Consistent with previous related research, there is evidence that score precision/reliability continues to be underreported by applied researchers. In particular, too high a
percentage of researchers still rely heavily on invoked precision/reliability evidence.
This practice, in the absence of a demonstration of similarities between the sample compositions and variabilities of the sample at hand and that from which the reliability estimate was inducted (Vacha-Haase, Kogan, & Thompson, 2000) may be unjustified. The
present study also indicates that researchers may be underreporting validity evidence
bearing on internal structure. However, in contrast to the findings of others (e.g., Hogan
& Agnello, 2004; Qualls & Moss, 1996; Whittington, 1998) that the rates of reporting
validity evidence lagged substantially behind that of reliability, we found the opposite
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
474
Journal of Psychoeducational Assessment
Table 5
Percentage of Measures Reporting
Psychometric Practices by Status of Measure
Status of Measure
Practice
Precision/reliability addressed
Precision/reliability evidence invoked
Precision/reliability evidence generated
Validity addressed
Validity evidence invoked
Validity evidence generated
Internal score validity assessed
External score validity assessed
Precision/reliability and validity evidence generated
Internal prior to precision/reliability
Internal prior to external
Precision/reliability prior to external
In full sequence
Preexisting
New
Modified
78.5
48.9
52.1
96.3
49.8
93.7
21.4
86.8
48.6
27.0
68.5
84.7
21.7
87.2
9.6a
83.0
97.9
7.4a
95.7
56.4
81.9
79.8
45.9
81.1
91.9
41.7
90.3
44.4
79.8
95.2
41.1
95.2
37.9
83.9
75.0
32.3
67.7
87.1
25.8
a. Likely due to coding unreliability.
pattern to be the case. One potential explanation for this discrepancy is that previous
studies did not distinguish between the assessment of internal and external score validity;
decoupled in the current study, we found that although the rate at which external score
validity was assessed was much higher than in previous studies, internal score validity
was ignored by researchers in the majority of the articles reviewed in the current study.
A second possibility is that our particular sample of journals might be more likely to
include studies whose primary focus is the assessment of the validity and, so, one is more
likely to encounter the reporting of validity evidence in articles published therein than
would generally be the case.
With respect to whether researchers appear to be adopting sound test-analytic rationales, the answer seems to be a resounding “somewhat.” Researchers are generally not
assessing in a confirmatory manner whether the items of a given test are measuring a
specific attribute or set of attributes in a specific manner (i.e., they are not often identifying the theoretical structures of measures, and then assessing internal test score
validity). Additionally, despite the finding that most researchers who do assess both
internal and external score validity, do so in that order, and also score precision/reliability prior to external test score validity, far too few are assessing internal test score
validity prior to assessing precision/reliability, or assessing precision/reliability and
test validity (both internal and external) in the proper sequence. This is not an insignificant finding: If the question of test’s internal score validity remains unanswered,
there are obvious implications for the relevance of findings bearing on score precision/
reliability and external test score validity. Given the paucity of work in investigating
the logic inherent in test-analytic practices, further research is required to shed greater
light on these findings.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
Slaney et al. / Psychometric Assessment and Reporting Practices 475
The current study could also be improved on in two important ways. First, it should be
determined whether data-based test evaluation practices have improved since 2004, the
year in which the reviewed articles were published. Second, it is possible that our results
were influenced by the particular sample of journals that were reviewed and do not represent well the test evaluation practices across a more diverse set of research areas and/or
types of measurement instruments. To explore this issue further, we are involved in
another study in which the sample consists of articles published more recently in a crosssection of journals. A third potential limitation is that the results may not simply reflect
what the researchers did or did not do with respect to psychometric assessment, but other
factors, such as the editorial policies of the journals reviewed or the requirements of the
specific purpose for which the test is being evaluated (e.g., internal score validity studies,
etc.). Future research (along the lines of Fidler, Thomason, Cumming, Finch, & Leeman,
2004; Vacha-Haase, Nilsson, Reetz, Lance, & Thompson, 2000, etc.) is required to explore
these issues. The primary value of the current study is in providing a reference point in
time for particular test analytic conventions and in indicating where future work in this
area is needed.
References
American Educational Research Association, American Psychological Association, National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC:
American Educational Research Association.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity and psychological tests. Psychological Bulletin, 52,
281-302.
Fidler, F., Thomason, N, Cumming, G, Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science,
15, 119-126.
Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of various
types. Educational and Psychological Measurement, 60, 523-531.
Hogan, T. P., & Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity.
Educational and Psychological Measurement, 64, 802-812.
Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 634-694.
Peak, H. (1953). Problems of objective observation. In L. Festinger & D. Katz (Eds.), Research methods in the
behavioral sciences (pp. 243-299). New York: Holt, Rinehart & Winston.
Qualls, A. L., & Moss, A. D. (1996). The degree of congruence between test standards and test documentation
within journal publications. Educational and Psychological Measurement, 56, 209-214.
Slaney, K. L., & Maraun, M. D. (2008). A proposed framework for conducting data-based test analysis.
Psychological Methods, 13, 376-390.
Thompson, B., & Synder, P. A. (1998). Statistical significance and reliability analyses in recent Journal of
Counseling & Development research articles. Journal of Counseling & Development, 76, 436-441.
Vacha-Haase, T., Henson, R. K., & Caruso, J. C. (2002). Reliability generalization: Moving toward improved
understanding and use of score reliability. Educational and Psychological Measurement, 62, 562-569.
Vacha-Haase, T., Kogan, L. R., & Thompson, B. (2000). Sample compositions and variabilities in published
studies versus those in test manuals: Validity and score reliability inductions. Educational and Psychological
Measurement, 60, 509-522.
Vacha-Haase, T., Ness, C., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reliability coefficients: A review of three journals. Journal of Experimental Education, 67, 335-341.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009
476
Journal of Psychoeducational Assessment
Vacha-Haase, T., Nilsson, J., Reetz, D., Lance, T., & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory and Psychology, 10, 413-425.
Whittington, D. (1998). How well do researchers report their measures? An evaluation of measurement in
published educational research. Educational and Psychological Measurement, 58, 21-37.
Wilkinson, L., & The APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Willson, V. L. (1980). Research techniques in AERJ articles: 1969-1978. Educational Researcher, 9(6), 5-10.
Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009