Psychometric Assessment and Reporting Practices: Incongruence Between Theory and Practice

There is a need to improve the validity, reliability, and replicability of social and health science research and its applications through raising the quality of measurement. An important step is to establish and implement a clear and useful guideline for reporting and assessing psychometric properties of measures. We propose five basic criteria as a minimal checklist to help end users assess the quality of psychometric studies: unidimensionality; ordered response categories, invariance; targeting; and, contingent upon the previous four being fulfilled, reliability. An expanded and detailed reporting guideline is also presented, intended for use in reports and scientific publications of psychometric analyses. We argue that reliability should be reported using a Test Information Function curve, which describes the properties of the items, rather than a point estimate describing sample properties. Additionally, we present a freely available R package to streamline psychometric analysi...

Journal of Psychoeducational Assessment http://jpa.sagepub.com Psychometric Assessment and Reporting Practices: Incongruence Between Theory and Practice Kathleen L. Slaney, Masha Tkatchouk, Stephanie M. Gabriel and Michael D. Maraun Journal of Psychoeducational Assessment 2009; 27; 465 originally published online Jul 9, 2009; DOI: 10.1177/0734282909335781 The online version of this article can be found at: http://jpa.sagepub.com/cgi/content/abstract/27/6/465 Published by: http://www.sagepublications.com Additional services and information for Journal of Psychoeducational Assessment can be found at: Email Alerts: http://jpa.sagepub.com/cgi/alerts Subscriptions: http://jpa.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations http://jpa.sagepub.com/cgi/content/refs/27/6/465 Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Psychometric Assessment and Reporting Practices Journal of Psychoeducational Assessment Volume 27 Number 6 December 2009 465-476 © 2009 SAGE Publications 10.1177/0734282909335781 http://jpa.sagepub.com hosted at http://online.sagepub.com Incongruence Between Theory and Practice Kathleen L. Slaney Masha Tkatchouk Stephanie M. Gabriel Michael D. Maraun Simon Fraser University The aim of the current study is twofold: (a) to investigate the rates at which researchers assess and report on the psychometric properties of the measures they use in their research and (b) to examine whether or not researchers appear to be generally employing sound/unsound rationales when it comes to how they conduct test evaluations. Based on a sample of 368 articles published in four journals in the year 2004, the findings suggest that, although evidence bearing on score precision/reliability and the internal structure of item responses remains underreported, researchers appear to be assessing the relationships between test scores and external variables relatively more frequently than in the past. However, findings also indicate that, all told, very few researchers are assessing and reporting on internal score validity, and score precision/reliability, and external score validity, and in that sequence, suggesting that applied researchers may not always be adopting sound test-evaluative rationales in their psychometric assessments. Keywords: psychometric assessment; psychometric reporting practices; test-analysis; internal score validity; external score validity T he past several decades have seen substantial and impressive developments in psychometric theory, resulting in the availability of an ever-growing set of concepts and tools from which the applied researcher may choose when analyzing the properties of a test. However, despite the existence of guidelines such as those in the most recent versions of Standards for Educational and Psychological Testing (American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999) and the set of recommendations published by the APA Task Force on Statistical Inference (Wilkinson & The APA TFSI, 1999), practices concerning the analysis and reporting of test data remain inconsistent. Authors’ Note: This study was supported by a Simon Fraser University–Social Sciences and Humanities Research Council of Canada (SFU–SSHRC) Institutional Grant awarded to the first author. Please address correspondence to Kathleen L. Slaney, Department of Psychology, Simon Fraser University, 8888 University Drive, Burnaby, British Columbia, V5A 1S6, Canada; e-mail: klslaney@sfu.ca. 465 Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 466 Journal of Psychoeducational Assessment Almost 30 years ago, in his review of articles published in the American Educational Research Journal (AERJ) between 1969 and 1978, Willson (1980) found that only 37.0% of the studies explicitly reported reliability coefficients for the data analyzed and only another 18% reported reliability estimates from previously published studies. He consequently remarked, “That reliability . . . is unreported in almost half the published research is . . . inexcusable at this late date,” and that “Editors and reviewers ought to routinely return papers that fail to establish the psychometric properties of the instruments they use” (p. 9). However, the situation in more recent years appears to be little improved: In their meta-analysis of reliability generalization (RG) studies, Vacha-Hasse, Henson, and Caruso (2002) found that in most empirical studies, authors fail to report reliability of their own scores (M = 75.6%, SD = 17.0%) and often do not even mention reliability (M = 56.3%, SD = 14.4%); Vacha-Haase, Ness, Nilsson, and Reetz (1999) found that on average 36.4% of reviewed articles did not make any reference to reliability at all, only 35.6% provided reliability coefficients for the data analyzed in the study at hand, 22.9% reported reliability coefficients from previous studies, and 3.8% provided only citations of articles in which reliability was reported; in their review of all of the 1996 issues of the Journal of Counseling & Development (JCD), Thompson and Snyder (1998) found that only 36.0% of the reviewed studied reported reliability for the data analyzed, but 84% reported reliability estimates from previous studies in which the same measure was employed; Whittington (1998) found, among others things, that researchers often fail to consider sample/ population characteristics when reporting reliability (75% measures from other sources/64% of articles) or validity evidence (86% of measures from other sources/82% of articles). Qualls and Moss (1996) examined all articles published in 22 or the then 25 Ameriacn Psychological Association (APA) journals for the year 1992 and found that score reliability was reported for 41% and that validity evidence was reported for only 31.7% of measures and, furthermore, that this evidence was not always based on the data collected for the study at hand. Research in this area has not only examined whether psychometric evidence is reported but also what type of evidence is reported. Hogan, Benjamin, and Brezinski (2000) found that, although reliability information was reported for 93.8% of measures, in most cases, only one type of reliability was reported, and this was most often coefficient alpha. In a subsequent study, Hogan and Agnello (2004) found that when validity evidence was reported, it was most often quantified in terms of bivariate correlations between test scores and other variables (i.e., criterion-related validity). In addition to these findings, we believe another striking feature of current applied testanalytic practice is the lack of consistency in terms of how test evaluation proceeds. In particular, although researchers almost universally appear to understand the difference between the reliability (or, more generally, precision) and the validity of test scores, the relationship between the two and its relevance to test evaluation, seems oft not to be recognized. Slaney and Maraun (2008) have distinguished between two major components of data-based validity assessment that must be decoupled so that each may be appropriately dealt with. They refer to these components as internal test validity and external test validity. Internal test validity refers, roughly speaking, to the extent to which item responses relate to one another in a way predicted by the theory about the construct(s) that the test was designed to measure (Cronbach & Meehl, 1955; Loevinger, 1957; Peak, 1953). Typically, Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Slaney et al. / Psychometric Assessment and Reporting Practices 467 assessment of the internal item validity involves determining whether the item responses can be adequately described by a measurement model that has been chosen to represent the internal association structure of responses to the items of the test under consideration. The external test validity of a test, conversely, encapsulates any and all evidence supporting particular predicted relations among the item scores (or, more typically, composites of item scores) and variables external to the test (e.g., particular criteria, such as GPA, other test scores, etc.). It is argued by Slaney and Maraun (2008) that only once a set of item responses has been shown to have internal test validity, can the responses be justifiably composited and the precision/reliability of the resulting composite(s) assessed. If such a composite(s) is shown to be adequately precise, then it (they) can be entered into investigations of external test validity (e.g., correlating it [them] with relevant variables which are external to the measure in question). In this way, the various psychometric properties of a test are intimately interrelated. To our knowledge, very little, if any, research has examined test-evaluative practices bearing on this interrelated nature of the three psychometric characteristics of tests. Here, to keep clear that what is (or is not) being assessed is the validity of test scores, or interpretations of scores, and not test, per se, we will refer to internal score validity and external score validity. The aim of the current work is, thus, twofold. First, we would like to follow up on previous work investigating psychometric assessment and reporting practices by conducting a general investigation into the rates at which score precision/reliability and validity evidence are being reported in more recently published studies. Second, we will investigate the extent to which researchers’ understandings of the particular relationships among distinct components of test evaluation are reflected in the manners in which they analyze/ report on these different features of test data. Method Sample Articles published in four peer-reviewed journals in the year 2004 were reviewed. Because the aim of the study was to describe commonly occurring test-evaluative practices and the rationales underlying them, the following journals were chosen because of the high likelihood that they would contain articles in which test data was collected and analyzed: Educational and Psychological Measurement (EPM; volume 64—all articles, not just validity studies), Psychological Assessment (PA; volume 16), Journal of Personality Assessment (JPA; volumes 82 and 83), and Personality and Individual Differences (PAID; volumes 36 and 37). Only articles which could be identified as employing at least one quantitative measure, and in which some component of evaluation of at least one of the measures employed (i.e., reporting of reliability and/or validity evidence of any kind) were subject to further review, resulting in a sample of 368 articles. This sample of articles contained information on a total of 1,211 measures. Given that many of the articles reviewed employed more than one measure, we coded the test-evaluative practices separately for each measure. The results reported herein are summarized at both the article- and measure-level of analysis. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 468 Journal of Psychoeducational Assessment Procedure A coding form was developed for assessing frequencies of a number of test-evaluative practices. Amongst other things, for each measure we coded (1) whether the measure was (a) preexisting, (b) newly developed, or (c) modified for the study at hand; (2) type of response format. We also coded for (3) whether (a) the issue of precision/reliability was addressed at all, and/or evidence of precision/reliability of measurement was (b) invoked from test manuals and/or previous studies in which the same measure(s) was used, and/or (c) generated from the data analyzed in the study; (4) whether (a) the issue of validity was addressed at all, and/or validity evidence was (b) invoked from external sources, and/or (c) generated from the data analyzed in the study. For measures in which evidence bearing on precision/reliability of measurement was reported directly, we coded for (5) whether reported precision/reliability estimates were (a) “internal consistency” (e.g., coefficient alpha, KR20, etc.), (b) test–retest, (c) other coefficients, and (d) if test–retest, was it explicitly used as a measure of stability. For measures in which validity evidence was reported directly, we assessed (6) whether or not the “theoretical structure” of the test was explicitly identified (i.e., how many and which attributes/qualities/properties the measure has been designed to measure); (7) whether and how the (a) internal score validity and (b) external score validity of the test were assessed. Finally, for measures for which precision/reliability of measurement was directly assessed and internal score validity assessed and external score validity assessed, we coded (8) whether (a) internal score validity evidence was reported prior to reporting evidence bearing on the precision/reliability of scores, or vice versa, (b) internal score validity evidence was reported prior to reporting external score validity evidence, or vice versa, and (c) score precision/reliability evidence was reported prior to external score validity evidence, or vice versa. The coding was completed over a period of 1 year by two of the authors on the current article. Each coded a separate set of articles; no articles were coded by more than one author. To determine whether coding was stable over time, a subset of the items for 20 articles was recoded (5 randomly chosen articles from each of the four reviewed journals) and the percentage of absolute agreement calculated for a subset of the items appearing on the coding form. The resulting percentages of absolute agreement for the original and recoded items ranged between 85% and 100%, with an average of 93.3% over the 12 recoded items. Results Demographics Table 1 summarizes both article- and measure-level demographic information. Of the 368 articles, 39 (10.6%) were published in EPM, 38 (10.3%) in PA, 36 (9.8%) in JPA, and 255 (69.3%) in PAID. Of the 1,211 measures coded, 49 (4.0%) appeared in EPM, 146 (12.1%) in PA, 178 (14.7%) in JPA, and 838 (68.7%) in PAID. The studies reviewed in EPM, JPA, PA, and PAID reported on an average 1.26, 3.84, 4.94, and 3.29 measures, respectively. Overall, the average number of measures employed in a given article was 3.29. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Slaney et al. / Psychometric Assessment and Reporting Practices 469 Table 1 Demographic Article/Measure Information Journal (Volume) Number of articles reviewed Number of measures coded Average number of employed measures Status of measure Preexisting New Modified Type of response format Nominal Dichotomous Ordered-categorical Continuous EPMa (64) PAb (16) JPAc (82, 83) PAIDd (36, 37) Overall 39 (10.6%) 38 (10.3%) 36 (9.8%) 255 (69.3%) n = 368 49 (4.0%) 146 (12.1%) 178 (14.7%) 838 (69.2%) n = 1,211 1.26 (SD = 0.74) 3.84 (SD = 3.24) 4.94 (SD = 5.55) 3.29 (SD = 2.17) 3.29 (SD = 2.86) 22 5 17 114 15 16 157 13 8 692 61 83 985 (81.3%) 94 (7.8%) 124 (10.2%) 3 5 31 0 1 14 59 0 0 14 85 0 1 74 387 3 5 (0.4%) 107 (8.8%) 562 (46.4%) 3 (0.2%) NOTES: EPM = Educational and Psychological Measurement; PA = Psychological Assessment; JPA = Journal of Personality Assessment; PAID = Personality and Individual Differences. Table 1 also summarizes demographic information related to measure status (i.e., preexisting, newly developed, or modified) and type of response format for both the total sample and broken down by journal. Psychometric Assessment and Reporting Table 2 lists both the article- and measure-level frequencies of the primary psychometric assessment and reporting practices investigated in the current study. At the article level, for at least one of the measures employed, 334 (90.8%) of the reviewed articles addressed (i.e., at least mentioned) the issue of precision/reliability, 225 (61.1%) invoked score precision/ reliability evidence from external sources, and 268 (72.8%) reported reliability estimates for the data analyzed in the study at hand. In all, 101 (27.4%) cited precision/reliability evidence from other studies but did not actually directly assess the precision/reliability of the test/ subtest scores. Of the 268 articles in which precision/reliability estimates were generated for at least one measure, 245 (91.4%) employed an “internal consistency” (e.g., Cronbach’s alpha, KR20, etc.) coefficient, whereas only 51 (19.0%) employed other estimates of precision/reliability, including test–retest coefficients, the latter of which were employed in 32 (11.9%) of the articles reviewed. However, only 31.3% (n = 10) of the test–retest coefficients were explicitly interpreted as measures of stability of measurement over time. Validity was addressed in 354 (96.2%), validity evidence invoked in 259 (70.4%), and validity evidence generated in 340 (92.4%) of the reviewed articles. Validity evidence was cited but not generated for at least one of the measures employed in 219 (59.5%) articles. Overall, at the article level of analysis, validity was addressed/ invoked/reported more often Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 470 Journal of Psychoeducational Assessment Table 2 Frequency of Psychometric Assessment and Reporting by Article and Measure Assessment and Reporting Practice Articlesa (n = 368) Measures (n = 1,211) Precision/reliability addressed Precision/reliability evidence invoked Precision/reliability evidence generated Internal consistency Test–retest Other Validity addressed Validity evidence invoked Validity evidence generated Theoretical structure identified Internal score validity assessed External score validity assessed 334 (90.8%) 225 (61.1%) 268 (72.8%) 245 (91.4% of 268) 32 (11.9% of 268) 19 (7.1% of 268) 354 (96.2%) 259 (70.4%) 340 (92.4%) 233 (68.5% of 340) 175 (51.5% of 340) 287 (84.4% of 340) 947 (78.2%) 547 (45.2%) 696 (57.5%) 635 (91.2% of 696) 47 (6.8% of 696) 63 (9.1% of 696) 1167 (96.4%) 552 (45.6%), 1139 (94.1%) 622 (54.6% of 1,139) 315 (27.7% of 1,139) 1040 (91.3% of 1,139) a. For at least one of the measures described in the study. than was evidence bearing on precision/reliability. Of the 340 articles in which validity evidence was generated, the theoretical structure of at least one of the measures employed in the study was explicitly identified in 233 (68.5%). However, of these, only 36 involved studies in which more than one measure was employed and, of these, only 8 (22.2%) explicitly identified the theoretical structure for all of the measures employed in the study at hand. For those articles in which validity evidence was generated, the internal score validity and external score validity of at least one measure were assessed in 175 (51.4%) and 287 (84.4%), respectively. As can be seen in Table 2, the measure-level results are generally consistent with this pattern of findings. Most of the measure-level percentages reported are similar to or lower than the corresponding percentages reported for particular practices seen at the article level. This indicates that researchers engage in certain practices for a subset, but not necessarily all, of the measures they employ. However, there were two notable exceptions to this general pattern: Whereas, the rate at which internal score validity was much higher at the article level versus measure level (51.5% vs. 27.7%), the opposite pattern was true for external score validity (84.4% vs. 91.3%). This indicates that internal score validity is often assessed only for a subset of the measures employed; external score validity, conversely, when it is being assessed by researchers, is being assessed for most if not all of the measures employed. Test-Analytic Rationale For the purposes of the current study, we take as an indication that researchers have adopted a minimally appropriate test evaluation rationale when they provide evidence for both the score precision/reliability and validity of the measures they employ in their research. However, if the aim is to determine whether a test score provides a good “indicator” of the attribute purportedly measured by the test, then merely assessing a measure’s Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Slaney et al. / Psychometric Assessment and Reporting Practices 471 Table 3 Frequency of Practices Bearing on Test-Analytic Rationale by Article and Measure Articlea (n = 368) Test-Analytic Practice Precision/reliability or validity evidence (or both) generated Precision/reliability or validity evidence (but not both) generated Precision/reliability (but not validity) evidence generated Validity (but not precision/reliability) evidence generated Both precision/reliability and validity evidence generated Theoretical structure identified, internal score validity, and external score validity assessed Internal score validity, precision/reliability, and external score validity assessed Internal prior to precision/reliability Internal prior to external Precision/reliability prior to external In full sequence Precision/reliability prior to internal External prior to internal External prior to precision/reliability 361 (98.1%) 200 (54.3%) Measureb (n = 1,211) 1182 (97.6%) 529 (43.7%) 27 (13.5% of 200) 173 (86.5% of 200) 244 (66.3%) 89 (24.2%) 43 (8.1% of 529) 486 (91.9% of 529) 653 (53.9%) 136 (11.2%) 103 (28.0%) 177 (14.6%) 41 (39.8% of 103) 82 (79.6% of 103) 89 (86.4% of 103) 33 (32.0% of 103) 57 (55.3% of 103) 14 (13.6% of 103) 7 (6.8% of 103) 61 (34.5% of 177) 131 (74.0% of 177) 153 (86.4% of 177) 50 (28.2% of 177) 101 (57.1% of 177) 30 (16.9% of 177) 9 (5.1% of 177) a. For at least one of the measures described in the study. b. Six cases were removed from this analysis because of inconsistency in coding. precision/reliability and validity falls somewhat short of this aim. Rather, for reasons outlined above, a coherent approach to data-based test analysis requires that internal score validity is assessed prior to the assessment of precision/reliability, that internal score validity is assessed prior to external score validity, that score precision/reliability is assessed prior to external score validity, and, ideally, that internal score validity, score precision/ reliability, and external score validity are all assessed, and in that sequence. Table 3 summarizes a number of practices which, we believe, indicate whether or not a sound test-analytic rationale was adopted by researchers. At the article level of analysis, 98.1% (n = 361) of the studies generated either score precision/reliability evidence, validity evidence, or both, 54.3% (n = 200) generated either one or the other (but not both), but only 66.3% (n = 244) assessed both score precision/reliability and validity for at least one of the measures described in the study (and many did not assess both the precision/reliability and validity for all of the measures described). In only 89 (24.2%) of the reviewed articles was the theoretical structure identified and both the internal score validity and the external score validity of the measure assessed, all of which, we argue, are minimally required to assess whether the particular administration of the test yields “valid” scores of the attribute(s) in question and the population under study. Of the 103 articles for which internal score validity, precision/reliability, and external score validity were all assessed for at least one of the measures employed, 41 (39.8%) assessed internal score validity prior to assessing precision/reliability of scores as compared to the 57 (55.3%) articles in which the pattern was reversed for at least one of the measures Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 472 Journal of Psychoeducational Assessment employed; 82 (79.6%) assessed internal score validity prior to investigating external score validity as compared with 14 (13.6%) in which the pattern was reversed for at least one of the measures employed; 89 (86.4%) investigated measurement precision/reliability prior to investigating external score validity as compared with the 7 (6.8%) in which the pattern was reversed for at least one of the measures employed. However, in only 33 (32.0%) of the articles were internal score validity, score precision/reliability, and external score validity all assessed, and in that sequence, for at least one of the measures employed. This indicates potentially that not enough researchers are aware or appreciative of the notion that it may be of little use to assess the precision/reliability of a “score,” if that score is in fact a composite of items measuring relatively distinct attributes or qualities (or distinct facets of a given attribute or quality). However, researchers appear to have a much better understanding of the relationship between internal and external score validity, and also between score precision/reliability and external score validity. The measure-level results are, once again, quite consistent with those at the article level of analysis. The only exceptions are the relative rates at which theoretical structure, internal score validity and external score validity are all assessed and the relative rates at which internal score validity, precision/reliability, and external score validity are all assessed, with the article-level rates being higher than the measure-level rates. This indicates that some researchers are not only employing a potentially unsound rationale with respect to assessing internal score validity, score precision/reliability, and external score validity, but that they may also be inconsistent in their assessments of different measures used in the same study. Measure-Level Results by Journal and Status It seems reasonable to consider that the extent to which psychometric evidence is reported could be influenced by the (possibly differing) editorial guidelines provided by different journals. Likewise, when a study involves the use of either a newly developed or modified measure, it is possible that researchers, reviewers, and editors alike may place more importance on the assessment of the psychometric properties of such measures than they might on previously developed measures, especially those which have had long-term and broad application. For these reasons, we thought it pertinent to provide a subset of the results presented above broken down by both journal and status of measure. Tables 4 and 5 summarize the relative rates at which a subset of psychometric assessment and reporting practices occur, broken down by journal and measure status, respectively. Relatively speaking, both precision/reliability and validity were addressed, and precision/ reliability and validity evidence invoked and generated, more often in measures described in articles published in PA than in the other three journals. However, whereas external score validity was also assessed relatively most often in PA, internal score validity was assessed with the highest relative frequency for measures published in EPM. With respect to the (usually implicit) logic guiding the evaluation of psychometric properties of measures, evidence that sound test-analytic rationales are being adopted by researchers is most evident for measures appearing in JPA and PA. With regard to measure status, precision/reliability was addressed and evidence generated proportionately more for new and modified measures than for preexisting measures. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Slaney et al. / Psychometric Assessment and Reporting Practices 473 Table 4 Percentage of Measures Reporting Psychometric Practices by Journal Journal Practice EPM JPA PA PAID Precision/reliability addressed Precision/reliability evidence invoked Precision/reliability evidence generated Validity addressed Validity evidence invoked Validity evidence generated Internal score validity assessed External score validity assessed Precision/reliability and validity evidence generated Internal prior to precision/reliability Internal prior to external Precision/reliability prior to external In full sequence 59.5 30.4 54.4 57.0 32.9 57.0 51.9 43.0 83.7 37.9 48.2 87.5 31.0 60.8 44.7 37.7 86.9 41.2 81.4 12.6 78.4 38.2 60.0 93.3 93.3 53.3 88.3 66.0 58.7 97.3 64.7 91.3 19.3 86.0 54.8 50.0 83.3 94.4 44.4 75.4 37.3 54.5 89.3 38.6 88.0 24.6 80.6 55.4 26.4 72.7 94.2 21.7 NOTES: EPM = Educational and Psychological Measurement; PA = Psychological Assessment; JPA = Journal of Personality Assessment; PAID = Personality and Individual Differences. Not surprisingly, precision/reliability evidence was invoked, relatively speaking, most often for preexisting measures. As regards validity, there appears to be little difference between the three statuses of measure in the relative rates for which validity was addressed and validity evidence generated. However, as with precision/reliability evidence, validity evidence was invoked relatively more often for preexisting measures, and internal score validity was assessed at a higher rate for new measures. The percentages of measures for which external score validity was assessed were similar across the difference statuses of measure. As for whether researchers appeared to adopt differing logics for assessing new or modified as opposed to preexisting measures, generally speaking, sounder test-evaluative rationales were adopted for assessing new measures than for preexisting and modified measures. Discussion Consistent with previous related research, there is evidence that score precision/reliability continues to be underreported by applied researchers. In particular, too high a percentage of researchers still rely heavily on invoked precision/reliability evidence. This practice, in the absence of a demonstration of similarities between the sample compositions and variabilities of the sample at hand and that from which the reliability estimate was inducted (Vacha-Haase, Kogan, & Thompson, 2000) may be unjustified. The present study also indicates that researchers may be underreporting validity evidence bearing on internal structure. However, in contrast to the findings of others (e.g., Hogan & Agnello, 2004; Qualls & Moss, 1996; Whittington, 1998) that the rates of reporting validity evidence lagged substantially behind that of reliability, we found the opposite Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 474 Journal of Psychoeducational Assessment Table 5 Percentage of Measures Reporting Psychometric Practices by Status of Measure Status of Measure Practice Precision/reliability addressed Precision/reliability evidence invoked Precision/reliability evidence generated Validity addressed Validity evidence invoked Validity evidence generated Internal score validity assessed External score validity assessed Precision/reliability and validity evidence generated Internal prior to precision/reliability Internal prior to external Precision/reliability prior to external In full sequence Preexisting New Modified 78.5 48.9 52.1 96.3 49.8 93.7 21.4 86.8 48.6 27.0 68.5 84.7 21.7 87.2 9.6a 83.0 97.9 7.4a 95.7 56.4 81.9 79.8 45.9 81.1 91.9 41.7 90.3 44.4 79.8 95.2 41.1 95.2 37.9 83.9 75.0 32.3 67.7 87.1 25.8 a. Likely due to coding unreliability. pattern to be the case. One potential explanation for this discrepancy is that previous studies did not distinguish between the assessment of internal and external score validity; decoupled in the current study, we found that although the rate at which external score validity was assessed was much higher than in previous studies, internal score validity was ignored by researchers in the majority of the articles reviewed in the current study. A second possibility is that our particular sample of journals might be more likely to include studies whose primary focus is the assessment of the validity and, so, one is more likely to encounter the reporting of validity evidence in articles published therein than would generally be the case. With respect to whether researchers appear to be adopting sound test-analytic rationales, the answer seems to be a resounding “somewhat.” Researchers are generally not assessing in a confirmatory manner whether the items of a given test are measuring a specific attribute or set of attributes in a specific manner (i.e., they are not often identifying the theoretical structures of measures, and then assessing internal test score validity). Additionally, despite the finding that most researchers who do assess both internal and external score validity, do so in that order, and also score precision/reliability prior to external test score validity, far too few are assessing internal test score validity prior to assessing precision/reliability, or assessing precision/reliability and test validity (both internal and external) in the proper sequence. This is not an insignificant finding: If the question of test’s internal score validity remains unanswered, there are obvious implications for the relevance of findings bearing on score precision/ reliability and external test score validity. Given the paucity of work in investigating the logic inherent in test-analytic practices, further research is required to shed greater light on these findings. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 Slaney et al. / Psychometric Assessment and Reporting Practices 475 The current study could also be improved on in two important ways. First, it should be determined whether data-based test evaluation practices have improved since 2004, the year in which the reviewed articles were published. Second, it is possible that our results were influenced by the particular sample of journals that were reviewed and do not represent well the test evaluation practices across a more diverse set of research areas and/or types of measurement instruments. To explore this issue further, we are involved in another study in which the sample consists of articles published more recently in a crosssection of journals. A third potential limitation is that the results may not simply reflect what the researchers did or did not do with respect to psychometric assessment, but other factors, such as the editorial policies of the journals reviewed or the requirements of the specific purpose for which the test is being evaluated (e.g., internal score validity studies, etc.). Future research (along the lines of Fidler, Thomason, Cumming, Finch, & Leeman, 2004; Vacha-Haase, Nilsson, Reetz, Lance, & Thompson, 2000, etc.) is required to explore these issues. The primary value of the current study is in providing a reference point in time for particular test analytic conventions and in indicating where future work in this area is needed. References American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity and psychological tests. Psychological Bulletin, 52, 281-302. Fidler, F., Thomason, N, Cumming, G, Finch, S., & Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science, 15, 119-126. Hogan, T. P., Benjamin, A., & Brezinski, K. L. (2000). Reliability methods: A note on the frequency of various types. Educational and Psychological Measurement, 60, 523-531. Hogan, T. P., & Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64, 802-812. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 634-694. Peak, H. (1953). Problems of objective observation. In L. Festinger & D. Katz (Eds.), Research methods in the behavioral sciences (pp. 243-299). New York: Holt, Rinehart & Winston. Qualls, A. L., & Moss, A. D. (1996). The degree of congruence between test standards and test documentation within journal publications. Educational and Psychological Measurement, 56, 209-214. Slaney, K. L., & Maraun, M. D. (2008). A proposed framework for conducting data-based test analysis. Psychological Methods, 13, 376-390. Thompson, B., & Synder, P. A. (1998). Statistical significance and reliability analyses in recent Journal of Counseling & Development research articles. Journal of Counseling & Development, 76, 436-441. Vacha-Haase, T., Henson, R. K., & Caruso, J. C. (2002). Reliability generalization: Moving toward improved understanding and use of score reliability. Educational and Psychological Measurement, 62, 562-569. Vacha-Haase, T., Kogan, L. R., & Thompson, B. (2000). Sample compositions and variabilities in published studies versus those in test manuals: Validity and score reliability inductions. Educational and Psychological Measurement, 60, 509-522. Vacha-Haase, T., Ness, C., Nilsson, J., & Reetz, D. (1999). Practices regarding reporting of reliability coefficients: A review of three journals. Journal of Experimental Education, 67, 335-341. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009 476 Journal of Psychoeducational Assessment Vacha-Haase, T., Nilsson, J., Reetz, D., Lance, T., & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory and Psychology, 10, 413-425. Whittington, D. (1998). How well do researchers report their measures? An evaluation of measurement in published educational research. Educational and Psychological Measurement, 58, 21-37. Wilkinson, L., & The APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. Willson, V. L. (1980). Research techniques in AERJ articles: 1969-1978. Educational Researcher, 9(6), 5-10. Downloaded from http://jpa.sagepub.com at SIMON FRASER LIBRARY on November 20, 2009

RELATED PAPERS

RELATED TOPICS

Log In

Psychometric Assessment and Reporting Practices: Incongruence Between Theory and Practice

Psychometric Assessment and Reporting Practices: Incongruence Between Theory and Practice

Related Papers

RELATED PAPERS

RELATED TOPICS