Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

The serial use of child neurocognitive tests: Development versus practice effects

2008, Psychological Assessment

Psychological Assessment 2008, Vol. 20, No. 4, 361–369 Copyright 2008 by the American Psychological Association 1040-3590/08/$12.00 DOI: 10.1037/a0012950 The Serial Use of Child Neurocognitive Tests: Development Versus Practice Effects Peter D. Slade Brenda D. Townes and Gail Rosenbaum Expert Data Analysis for Doctors and Others University of Washington Isabel P. Martins, Henrique Luis, and Mario Bernardo Michael D. Martin and Timothy A. DeRouen University of Washington University of Lisbon When serial neurocognitive assessments are performed, 2 main factors are of importance: test–retest reliability and practice effects. With children, however, there is a third, developmental factor, which occurs as a result of maturation. Child tests recognize this factor through the provision of age-corrected scaled scores. Thus, a ready-made method for estimating the relative contribution of developmental versus practice effects is the comparison of raw (developmental and practice) and scaled (practice only) scores. Data from a pool of 507 Portuguese children enrolled in a study of dental amalgams (T. A. DeRouen, B. G. Leroux, et al., 2002; T. A. DeRouen, M. D. Martin, et al., 2006) showed that practice effects over a 5-year period varied on 8 neurocognitive tests. Simple regression equations are provided for calculating individual retest scores from initial test scores. Keywords: neurocognitive tests, repeat testing, practice effects, test reliability, child development The lack of serial assessment information on neurocognitive tests should not be taken to imply that little work has been done in this area. As Lezak, Howieson, and Loring (2004, p. 116) point out, many studies researching the effects of repeated examinations have revealed “an overall pattern of test susceptibility to practice effects.” Moreover, the same authors note that “numerous studies have also shown a general test-taking benefit in which enhanced performance may occur after repeated examinations” (p. 116). This phenomenon has been referred to as test sophistication by Anastasi (1988). The most comprehensive overview of the effects of repeat testing was carried out by McCaffrey, Duff, and Westervelt (2000). They reviewed hundreds of studies that had been carried out on tests of intelligence (mainly on the Wechsler scales) in both healthy samples and patient groups and for both adults and children. They outline their findings in summary tables covering a total of 212 pages, a useful if not essential resource for anyone using the Wechsler tests on a repeated basis. As well as reviewing the extent of practice effects with differing test–retest intervals, they looked at some of the individual difference variables that have been found to affect the magnitude of practice effects. These include age, gender, intelligence, education, and the presence (or absence) of a disease process. As pointed out by McCaffrey et al. (2000), the effects of repeated testing involve two main factors: test–retest reliability and practice effects. The former refers to the measurement error associated with any given test and involves the stability of the relative rankings of individuals’ scores across testing occasions. That is, do individuals who score high on initial testing also score high on retest? Do individuals who score low on testing also score low on retest? And do the rest maintain their relative intermediate positions? Test–retest reliability is traditionally measured by the Neuropsychologists are often involved in carrying out serial assessments of children and adolescents. These may include monitoring patterns of intellectual development, the study of children’s recovery after epilepsy surgery, or observing the degree of improvement following localized or generalized head injury. There is also a crucial use in determining the efficacy of therapeutic interventions (pharmacological or surgical) with primary cognitive endpoints. Although some tests come with information on test–retest reliability—the WISC–III (Wechsler, 1991), for example—many others do not. Even when such data are provided, they usually cover only relatively short test–retest intervals over a few months—and then for only a single retest. But many serial assessments are conducted over much longer time periods—9 months to a year— and often involve three or more repetitions of the same tests. Peter D. Slade, Expert Data Analysis for Doctors and Others, West Kirby, United Kingdom; Brenda D. Townes, Department of Psychiatry and Behavioral Sciences, University of Washington; Gail Rosenbaum, Regional Epilepsy Center, University of Washington; Isabel P. Martins, Language Research Laboratory, Department of Neurology, University of Lisbon, Lisbon, Portugal; Henrique Luis and Mario Bernardo, Faculty of Dental Medicine, University of Lisbon; Michael D. Martin, Departments of Oral Medicine and Epidemiology, University of Washington; Timothy A. DeRouen, Departments of Dental Public Health Sciences and Biostatistics, University of Washington. This project was funded by National Institute of Dental and Craniofacial Research Cooperative Agreement U01 DE 11894. The authors wish to thank the staff and students of Casa Pia School System, Lisbon, Portugal, for their assistance with the project. Correspondence concerning this article should be addressed to Brenda D. Townes, 20 Park Road, West Kirby, Wirral CH48 4DW, United Kingdom. E-mail: btownes@u.washington.edu 361 SLADE ET AL. 362 correlation coefficient between test and retest scores. Practice effects, on the other hand, refer to the amount of overall change in scores from test to retest. Where there are only two testing occasions, the significance of the mean change is usually evaluated by using matched pairs t tests (e.g., Dikmen, Heaton, Grant, & Temkin, 1999). Where more than two testing occasions are involved, repeated measures analysis of variance (ANOVA) is commonly used instead (e.g., Collie, Maruff, Darby, & McStephen, 2003; Hinton-Bayre et al., 1999). However, in the case of children and adolescents there is another factor, in addition to practice effects, that comes into play when considering repeated testing: namely, developmental factors. Test stability involves these three factors: test reliability, practice effects, and development. Most of the studies on test–retest reliability and practice effects have used only two testing occasions. A notable exception is the study by Wilson, Watson, Baddeley, Emslie, and Evans (2000). They administered a battery of 11 tests to two small groups of brain-injured (n ⫽ 10) and control (n ⫽ 13) subjects on 20 separate occasions over the course of a 4-week period. Many of the tests showed continuing practice effects, which were larger in the control than in the brain-injured patients. A semantic processing test and word fluency tests showed the largest practice effects, while digit span and simple reaction time showed the smallest. This study followed a smaller scale study by Benedict and Zgaljardic (1998), who had administered the same forms and parallel forms of verbal and nonverbal memory tests to their participants every 2 weeks for 8 weeks. Benedict and Zgaljardic found significant practice effects when the same tests were used, but these significant effects disappeared when the tests were replaced by alternate test forms. Practice effects have been found on many if not most neurocognitive tests (Bird, Papadopoulu, Ricciardelli, Rossor, & Cipolotti, 2003, 2004; Collie et al., 2003). However, it is generally recognized that some tests are more susceptible to practice effects than others (Dodrill & Troupin, 1975; Lezak, 1995). Lezak et al. (2004) summarized this differential practice effect in the following way: Tests that have a large speed component, require an unfamiliar or infrequently practiced mode of response, or have a single solution— particularly if it can be easily conceptualized once it is attained—are more likely to show practice effects. (p. 116) Rationale Many of the tests developed for assessing neuropsychological performance in children and adolescents are known to show developmental (maturational) changes with age. Consequently, these tests were developed with age-corrected norms that allow the examiner to compare any given child with other children of the same age. In the case of the WISC–III, scaled scores are provided at three monthly intervals up to the age of 16 years 11 months; at this point the WAIS–III (Wechsler, 1997) takes over with yearly age increments. In the case of the Wide Range Assessment of Memory and Learning (WRAML; Sheslow & Adams, 1990), age norms are provided at 6-month intervals between the ages of 5 and 13 years and at 1-year intervals for ages 14 and 15 years. And in the case of the Wide Range Assessment of Visual Motor Abilities (WRAVMA; Adams & Sheslow, 1995), age norms are provided at 6-month intervals between the ages of 3 and 13 years and at 1-year intervals for ages 13 to 17 years 11 months. Thus, with tests designed for use with children and adolescents, we can expect to find increases in raw scores over time due to a combination of two factors: developmental (maturational) changes and changes due to repeat testing (practice effects). Of course, if no practice effects are operating, then raw score changes over time will be entirely due to developmental changes (plus error). However, where we have age-corrected norms (scaled scores) available for a test, we have a ready-made method for estimating how much of the change in scores is due to development and how much is due to practice. Namely, changes in raw scores reflect both of these factors in combination, while changes in agecorrected scale scores are likely to reflect practice alone. That is, once raw scores have been converted to scaled scores, we would not expect to find any further mean changes over time in the scaled scores unless there are additional practice effects operating. Of course, the major assumption here is that the age-corrected scaled scores do fully remove the developmental (maturational) changes in performance from repeat scores. Method Participants As part of a clinical trial on the safety of dental treatments with amalgams containing mercury, 507 Portuguese children were randomly assigned to receive dental treatment with either amalgam restorations for posterior teeth or composite-only restorations (DeRouen et al., 2002). The 4 children who were 12 years of age at the outset were dropped in this aspect of the study. This left a total of 503 participants who were 8 to 11 years of age at onset, and due to dropouts, there were 437 participants ages 16 –19 years upon completion of the study 8 years later. Inclusion criteria were (a) having at least one carious lesion in a permanent tooth, (b) no previous exposure to amalgam treatments, (c) urinary mercury level less than 10␮/L at baseline, (d) blood lead level less than 15␮/dL, (e) no interfering health conditions, and (f) IQ equal to or greater than 67 as obtained on the Comprehensive Test of Nonverbal Intelligence (CTONI; Hammill, Pearson, & Wiederholt, 1997). The CTONI was chosen as an instrument designed for use in cross-cultural settings as a brief, nonverbal test of intelligence that was available at the onset of the study (1997). During selection of participants, the lower IQ boundary of 67 on the CTONI was chosen. Though truncated at the lower end, the CTONI IQ at baseline was otherwise normally distributed with a mean of 85.10 (SD ⫽ 9.96). These scores are consistent with (a) clinical experience suggesting that the CTONI underestimates intelligence in non-U.S. populations by up to one standard deviation (Martins et al., 2005) and (b) the fact that the CTONI IQ is known to underestimate fluid intelligence (Lassiter, Harrison, Matthews, Bell, & The Citadel, 2001). At baseline 55% were male; 71% were Caucasian, 28% African Portuguese, and 1% Asian Portuguese. The mean age was 10.1 years (range 8.0 –11.9) and mean IQ on the CTONI was 85.10 (range 67–118). Institutional review board and parental or guardian approval was secured for all children who took part in neurocognitive testing. Since no significant differences were found between the two groups at any time on any of the neurocognitive outcome measures PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS (DeRouen et al., 2006), the two groups were combined into a single cohort. Procedure Participants completed a battery of neurocognitive tests in Portuguese annually for 8 years. Translation of all tests and instructions was done by a professor at the University of Washington who was a native speaker of Portuguese. Back translations were performed by Mario Bernardo and Henrique Luis, both of whom are fluent in English. The back translations were verified by the senior U.S. psychometrist (Gail Rosenbaum), who has had extensive experience in test administration and scoring as well as crosscultural testing. Three psychometrists were trained to administer the tests, and their performance was continually monitored by Gail Rosenbaum. Their work was calibrated throughout the study by using ratings on a 136-item checklist to review videotaped testing sessions (with 94.5% to 97.8% accuracy). Tests were doublescored and data corrected when errors were identified (no severe violations of protocol that required discarding data were observed). The neurocognitive tests utilized were chosen on the basis of their known sensitivity in measuring brain-behavior relationships (Lezak, 1995), appropriateness for children 8 –12 years of age, and where possible, minimal practice effects. Functional areas assessed were learning and memory, motor and visual motor abilities, and attention. Of the19 neuropsychological tests administered during the study, 8 of them had age-corrected scale scores available in addition to raw scores. These were Pegboard with the dominant (PegsDom) and nondominant (PegsND) hands and Matching Figures from the WRAVMA; Finger Windows and Visual Learning from the WRAML; and Coding, Symbol Search, and Digit Span from the WISC–III. The three test batteries were standardized on fairly substantial and representative samples of U.S. children. The WISC–III was standardized on a sample of 2,200 children between the ages of 6 and 16 years 11 months. The normative sample was stratified in terms of age, gender, ethnic origin, and region. The WISC–IV (Wechsler, 2004) was not available at the inception of this study in 1997. The WRAML was standardized on a sample of 2,363 children between the ages of 5 years and 17 years 11 months. The normative sample was stratified in terms of age, gender, ethnic origin, region, and urban–rural distribution. The WRAVMA was standardized on a sample of 2,282 children between the ages of 3 years and 17 years 11 months. The normative sample was stratified by age, gender, ethnic origin, region, and socioeconomic status. Of the eight tests from the above three batteries, six had been given on five successive occasions, a year apart, and then replaced by similar or equivalent adult tests for the duration of the study. Rather than attempt to make corrections for the transition from child to adult tests, we decided that for present purposes we would restrict our analyses to the participants who had completed the eight child tests on all or most occasions over the first 5 years of the study. Of the 503 children who were ages 8 –11.9 years at the start of the study, 408 completed all eight tests on all five occasions. 363 Statistical Design and Analysis All analyses were conducted with the SPSS 15.0. Test–retest reliability was assessed in the traditional manner with Pearson product–moment correlations between test scores for consecutive pairs of study years. Repeated measures analyses of variance (ANOVAs) were used to assess the significance of changes over the five study years. The analyses were carried out on both raw scores and scaled scores. In the case of raw scores, any significant changes over time were attributed to a combination of both developmental and practice effects. In the case of scaled scores, any significant changes were attributed to practice effects alone. The repeated measures ANOVAs included a within-subjects factor of study year; a between-subjects factor of age subgroups (the latter subgroups were formed in the first year of the study and were categorized as 8-year-olds [8 to 8.99 years], 9-year-olds [9 to 9.99 years], 10-year-olds [10 to 10.99 years], and 11-year-olds [11 to 11.99 years]); and the Study Years ⫻ Age Subgroups interaction. A between-subjects factor of gender was not used because the focus was upon estimating potential age differences in practice effects and the impact of attempting to control them through use of age-adjusted scaled scores. Furthermore, the test batteries that were used (WISC–III, WRAVMA, and WRAML) separated normative tables for only different age groups and not for boys and girls. Three aspects of the results were used to establish the significance and magnitude of the practice effects in the case of the scaled scores analyses: 1. the significance of the within-subjects effect due to study years, 2. the significance of the linear trend due to study years, and 3. the amount of variance accounted for by the linear trend, as indexed by partial eta-squared.1 We determined a priori that a practice effect can be deemed to be present if all three of the above were found; that is, a highly significant within-subjects effect indicating that the scaled scores of the sample varied significantly over the 5 years of the study; a significant linear trend indicating that a linear function provided a significant fit to the scaled score differences over the 5-year period; and a linear trend that accounted for a sizable proportion of the test variance. Where one or more of these criteria were not met, we concluded that a practice effect had not been demonstrated. In order to check on our major assumption that changes in raw scores reflect a combination of both development and practice effects, while changes in scaled scores reflect purely practice, we considered that there would be age subgroup differences in raw scores (between 8-, 9-, 10-, and 11-year-olds) but no age subgroup differences in scaled scores. That is, the age-adjusted norms would remove these age differences in the case of the latter. 1 Given the large sample size, we felt justified in using partial eta squared rather than the more complex omega squared as a measure of effect size. SLADE ET AL. 364 And finally, we calculated simple linear regression equations for predicting scaled scores on retest from scaled scores on initial testing for those measures that demonstrated clear practice effects. Results Test–Retest Reliability The test–retest reliability coefficients are presented in Table 1. Although some of them are somewhat low, such as those for Finger Windows and Digit Span in the earlier years, over 60% of the year-on-year comparisons are in excess of .60. It is notable also that there is a general increase in the size of the coefficients over time such that five of the eight tests had coefficients in excess of .60 in the Year 4/5 comparison. Analysis of Raw Scores The raw scores of the age subgroups for each of the eight tests are shown in Figure 1. It can be seen that in all cases there is substantial change in scores over the 5-year period, with most of the curves showing steady incremental changes. This impression is confirmed by the results of repeated measures ANOVAs, which are presented in Table 2. These show highly significant withinsubjects effects (Fs ranging from 28.9 to 638.2), highly significant linear trends (Fs ranging from 187.2 to 3,153.7), and partial eta-squared values showing that the linear trend accounts for between 18% of the variance in the case of Visual Learning and 81% in the cases of PegsND and Coding. It is notable also that there are significant between-subjects (age subgroups) differences for seven out of the eight tests, the exception being Visual Learning. These subgroup differences account for between 3% (Digit Span) and 12% (Symbol Search and Coding) of the variance, confirming the importance of using age-adjusted norms in clinical and research work with these instruments. There are also significant Study Years ⫻ Age Subgroups interaction effects for four out of the eight tests. Analyses of Scaled Scores The scaled scores of the age subgroups for each of the eight tests are shown in Figure 2. It can be seen that some tests show steady Table 1 Test–Retest Reliability of Cognitive Tests Administered Over 5 Consecutive Years Test Y1–Y2 (n ⫽ 473) Y2–Y3 (n ⫽ 457) Y3–Y4 (n ⫽ 437) Y4–Y5 (n ⫽ 417) PegsDom PegsND Matching Figures Symbol Search Visual Learning Coding Finger Windows Digit Span .49 .51 .51 .60 .66 .71 .35 .27 .51 .56 .56 .68 .73 .79 .31 .32 .55 .63 .49 .72 .70 .81 .29 .42 .63 .64 .53 .70 .65 .81 .44 .54 Note. Listwise deletions were used with each pair of study years. All correlations were significant at the .01 level or beyond. Y ⫽ year; PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand. incremental changes while others show relatively flat trends. This is again confirmed by the results of the repeated measures ANOVAs, which are presented in Table 3. The first thing of note in Table 3 are the between-subjects age subgroup ANOVAs. Unlike the case with raw scores, where seven out of eight tests were statistically significant, all of the age subgroup differences for the scaled scores were nonsignificant. This suggests that the age-adjusted norms have in all cases done a satisfactory job of removing the developmental factor from the equation. What are left, therefore, are primarily practice (and error) effects. In the case of PegsDom and PegsND, the within-subjects ANOVAs are highly significant; the linear trends are highly significant; and the linear trends account for 45% and 36% of the variance, respectively. These visual motor tests show a strong year-on-year practice effect. In the case of Matching Figures and Symbol Search, the withinsubjects ANOVAs are statistically significant and the linear trends are significant, but the linear trend accounts for only 15% and 13% of the variance, respectively. These tests show a moderate yearon-year practice effect. In the case of Visual Learning, the within-subjects ANOVA is significant, as is the linear trend. However, the variance accounted for by the linear trends is only 8%. This reflects the much flatter curve of this test, as shown in Figure 2. Visual Learning shows, at most, a small practice effect after controlling for developmental change. In the case of Coding, Finger Windows, and Digit Span, the first two tests show a significant within-subjects effect, while Digit Span does not; Coding shows a significant linear trend, while Finger Windows and Digit Span do not; finally, the linear trends for all three account for only 0% or 1% of the variance. We can conclude that none of these three tests (Coding, Finger Windows, or Digit Span) show any detectable practice effect. Only two of the tests, Matching Figures and Finger Windows, show a significant Study Years ⫻ Age Subgroups interaction term. Finally, because there appear to be age subgroup differences in scaled scores in Year 1 similar to those shown in the raw scores (see Figure 2), we carried out a further set of ANOVAs on the eight sets of Year 1 scaled scores, with age groups as the grouping factor. For six of the variables, there were no significant differences between age groups, the exceptions being Matching Figures, F(3, 499) ⫽ 3.32, p ⬍ .05, and Finger Windows, F(3, 499) ⫽ 5.97, p ⬍ .01. Post hoc Tukey HSD tests showed that, in the case of Matching Figures, the 8-year-olds scored significantly lower than did the other three age groups, while in the case of Finger Windows, the 11-year-olds scored significantly lower than did the other three age groups. That is, for the Year 1 data alone, the scaled scores did not fully correct for age for 8-year-olds in the case of Matching Figures, whereas in the case of Finger Windows, it appeared to overcorrect for age for the 11-year-olds. Linear Regression Equations for Predicting Retest Scores Table 4 presents simple linear regression equations for estimating retest scaled scores from initial scaled scores, for the five measures that manifested clear evidence of practice effects. This is comparable to the third model used by Temkin, Heaton, Grant, and Dikmen (1999). These equations can be used to calculate predicted PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS PegsDom (Raw Scores) 36 36 8-year-olds 9-year-olds 10-year-olds 36 __ 32 _ _ 11-year-olds 1 2 3 4 1 5 2 Study Year 3 25 32 Score 40 30 34 40 Score Score 44 8-year-olds 9-year-olds 10-year-olds 30 11-year-olds 28 _ 4 8-year-olds 9-year-olds 10-year-olds 5 1 2 _ _ 4 11-year-olds 1 5 2 Study Year Coding (Raw Scores) Visual Learning (Raw Scores) 3 8-year-olds 9-year-olds 10-year-olds 20 11-year-olds _ Study Year 3 4 5 Study Year Digit Span (Raw Scores) Finger Windows (Raw Scores) 60 26 Symbol Search (Raw Scores) Matching Figures (Raw Scores) PegsND (Raw Scores) 44 Score 48 365 13 12 11 12 24 22 Score Score Score Score 50 10 11 9 40 8-year-olds 9-year-olds 10-year-olds 20 _ _ 11-year-olds 1 2 3 4 8-year-olds 9-year-olds 10-year-olds 30_ 1 5 7_ 11-year-olds _ 2 3 4 8-year-olds 9-year-olds 10-year-olds 8 5 1 Study Year Study Ye ar 11-year-olds _ 2 3 4 8-year-olds 9-year-olds 10-year-olds 10 5 _ _ 11-year-olds 1 3 2 Study Year 4 5 Study Ye ar Figure 1. Raw scores of age subgroups on eight measures over a 5-year period. PegsDom ⫽ Pegboard with the dominant hand; PegsND ⫽ Pegboard with the nondominant hand. Table 2 Results of Repeated Measures Analyses of Variance for Raw Scores Variable Within-subjects study years dfsa F p ␩2p Linear trend for years dfs F p ␩2p Between-subjects age subgroups dfs F p ␩2p Study Years ⫻ Age Subgroups dfsa F p ␩2p PegsDom PegsND 3.82, 1565 327.9 ⬍.001 .45 3.67, 1500 638.2 ⬍.001 .61 3.74, 1539 157.5 ⬍.001 .28 3.70, 1521 552.4 ⬍.001 .57 1, 409 976.8 ⬍.001 .71 1, 409 1,725.3 ⬍.001 .81 1, 412 540.2 ⬍.001 .57 3, 409 9.79 ⬍.001 .07 3, 409 10.93 ⬍.001 .07 11.48, 1565 11.01, 1500 2.60 2.24 ⬍.003 ⬍.009 .02 .02 Matching Figures Symbol Search Visual Learning Coding Finger Windows Digit Span 3.83, 1572 28.9 ⬍.001 .07 3.44, 1413 713.7 ⬍.001 .64 4.00, 1644 126.0 ⬍.001 .24 3.76, 1543 61.32 ⬍.001 .13 1, 411 1,505.6 ⬍.001 .79 1, 411 3,153.7 ⬍.001 .18 1, 411 1,693.8 ⬍.001 .81 1, 411 383.6 ⬍.001 .48 1, 411 187.2 ⬍.001 .31 3, 412 6.77 ⬍.05 .05 3, 411 19.22 ⬍.001 .12 3, 411 0.50 ns .00 3, 411 19.15 ⬍.001 .12 3, 411 8.62 ⬍.001 .06 3, 411 3.55 ⬍.02 .03 11.21, 1539 4.01 ⬍.001 .03 11.10, 1521 1.04 ns .01 11.48, 1572 2.21 ⬍.02 .02 10.32, 1413 1.42 ns .01 12.00, 1644 0.93 ns .01 11.27, 1543 1.40 ns .01 Note. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand. When Mauchley’s test is significant, Greenhouse–Geisser adjusted dfs are used. a SLADE ET AL. 366 PegsDom (Scaled Scores) PegsND (Scaled Scores) 120 120 Symbol Search (Scaled Scores) Matching Figures (Scaled Scores) 102.5 10.5 115 Score Score Score 110 Score 100.0 115 97.5 10.0 110 105 100 95.0 8-year-olds 9-year-olds 10-year-olds 11-year-olds _ _ 1 2 4 3 9.5 8-year-olds 9-year-olds 105 92.5 10-year-olds 11-year-olds _ _ 5 1 Study Year 2 3 4 10-year-olds 11-year-olds _ _ 1 5 2 Study Year 3 4 9.0 10-year-olds 11-year-olds _ _ 1 5 2 Finger Windows (Scaled Scores) 10 3 4 5 Study Year Study Year Coding (Scaled Scores) Visual Learning (Scaled Scores) 8-year-olds 9-year-olds 8-year-olds 9-year-olds Digit Span (Scaled Scores) 8.0 9.0 7.5 9 Score Score Score 8.5 Score 7.5 7.0 8.0 8-year-olds 9-year-olds 7.5 10-year-olds 11-year-olds _ _ 1 2 3 4 5 8-year-olds 9-year-olds 8 7.0 10-year-olds 11-year-olds _ _ 12 3 4 8-year-olds 9-year-olds __ 5 1 _ _ 2 3 4 5 10-year-olds 11-year-olds 1 Study Year Study Year Study Year 8-year-olds 9-year-olds 6.5 10-year-olds 11-year-olds 2 3 4 5 Study Year Figure 2. Scaled scores of age subgroups on eight measures over a 5-year period. The first three tests— PegsDom, PegsND, and Matching Figures— have a mean scaled score of 100 (SD ⫽ 15). The other five have a mean of 10 (SD ⫽ 3). PegsDom ⫽ Pegboard with the dominant hand; PegsND ⫽ Pegboard with the nondominant hand. Table 3 Results of Repeated Measures Analyses of Variance for Scaled Scores Variable Within-subjects study years dfsa F p ␩2p Linear trend for years dfs F p ␩2p Between-subjects age subgroups dfs F p ␩2p Study Years ⫻ Age Subgroups dfsa F p ␩2p PegsDom PegsND 3.82, 1564 112.2 ⬍.001 .22 3.80, 1542 75.27 ⬍.001 .16 3.89, 1602 21.28 ⬍.001 .05 3.77, 1547 22.86 ⬍.001 .05 1, 409 335.2 ⬍.001 .45 1, 406 232.7 ⬍.001 .36 1, 412 74.46 ⬍.001 .15 3, 409 0.53 ns .00 3, 406 0.39 ns .00 11.47, 1564 11.40, 1542 0.67 0.59 ns ns .01 .00 Matching Figures Symbol Search Visual Learning Coding Finger Windows Digit Span 3.87, 1588 10.83 ⬍.001 .03 3.55, 1457 8.69 ⬍.001 .02 4.00, 1644 5.87 ⬍.001 .01 3.79, 1560 0.76 ns .00 1, 411 63.25 ⬍.001 .13 1, 411 34.66 ⬍.001 .08 1, 411 3.89 ⬍.05 .01 1, 411 2.64 ns .01 1, 411 0.04 ns .00 3, 412 0.07 ns .00 3, 411 1.17 ns .01 3, 411 0.22 ns .00 3, 411 1.91 ns .01 3, 411 1.97 ns .01 3, 411 1.71 ns .01 11.67, 1602 2.00 ⬍.03 .01 11.30, 1547 0.66 ns .01 11.59, 1588 1.05 ns .01 10.64, 1457 1.21 ns .01 12.00, 1644 1.82 ⬍.05 .01 11.37, 1560 1.21 ns .01 Note. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand. When Mauchley’s test is significant, Greenhouse–Geisser adjusted dfs are used. a PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS 367 Table 4 Linear Regression Equations for Predicting Individual Scaled Scores on Retesting From Initial Test Scores Year, predicted retest score (Ŷ), and confidence interval (CI) Retest Year Ŷ 90% CI Retest Year Ŷ 90% CI Retest Year Ŷ 90% CI Retest Year Ŷ 90% CI 2 3 4 5 PegsDom PegsND MF SS VL 64.75 ⫹ (0.43 ⫻ PegsDom1) 63.76 ⫹ (0.44 ⫻ PegsND1) 53.22 ⫹ (0.48 ⫻ MF1) 4.30 ⫹ (0.60 ⫻ SS1) 3.21 ⫹ (0.64 ⫻ VL1) ⫾22.98 ⫾20.48 ⫾18.70 ⫾4.10 ⫾3.49 65.10 ⫹ (0.48 ⫻ PegsDom1) 63.38 ⫹ (0.47 ⫻ PegsND1) 60.05 ⫹ (0.41 ⫻ MF1) 4.73 ⫹ (0.56 ⫻ SS1) 3.30 ⫹ (0.63 ⫻ VL1) ⫾22.68 ⫾21.15 ⫾17.85 ⫾4.23 ⫾3.82 71.70 ⫹ (0.44 ⫻ PegsDom1) 72.74 ⫹ (0.40 ⫻ PegsND1) 64.33 ⫹ (0.38 ⫻ MF1) 4.80 ⫹ (0.60 ⫻ SS1) 3.38 ⫹ (0.64 ⫻ VL1) ⫾21.96 ⫾20.13 ⫾18.41 ⫾4.38 ⫾3.78 77.30 ⫹ (0.41 ⫻ PegsDom1) 72.71 ⫹ (0.43 ⫻ PegsND1) 57.92 ⫹ (0.45 ⫻ MF1) 4.83 ⫹ (0.59 ⫻ SS1) 3.82 ⫹ (0.62 ⫻ VL1) ⫾24.44 ⫾21.70 ⫾17.93 ⫾4.77 ⫾4.11 Note. In all equations, test measures with a 1 signify the initial test score for the measure. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand; MF ⫽ Matching Figures; SS ⫽ Symbol Search; VL ⫽ Visual Learning. retest scores (Ŷ) that can then be compared directly with observed retest scores and evaluated for significance with the 90% confidence intervals (CI) also presented in Table 4. The 90% CI has been calculated by multiplying the standard deviations of the residuals estimated by the regression by 1.645, the z score cutoff for a 90% interval based on the normal distribution. For example, suppose a 10-year-old boy obtained a Matching Figures scaled score of 75 on first testing and then one year later obtained a score of 69. We can determine whether this represents a significant decline, given the data on practice effects for this test, by first calculating the predicted retest score using the appropriate equation in Table 4 (Retest Year 2). This is given by Ŷ ⫽ 53.22 ⫹ (0.48 ⫻ 75) ⫽ 89.22. We then calculate the lower bound of the CI for the predicted retest score. This is 89.22 – 18.70 ⫽ 70.52. The obtained retest score of 69 is below the lower CI boundary. We can therefore conclude that the boy’s performance has declined significantly on this test. It should be noted that the actual score difference between test and retest is only 6 points (75 – 69), while the expected practice effect is more than double this magnitude, namely 89.22 – 75 ⫽ 14.22. Discussion In the present study the test–retest reliabilities are comparable to those recently reported by Collie et al. (2003) on a set of computer tests with a sample of 113 adults. They carried out their four tests during the same day and considered their retest reliabilities acceptable. We would also consider our reliabilities adequate, given that we used a serial retest period of 12-month intervals over a total 5-year period. It was notable also that the reliability coefficients increased over the 5-year comparison periods. Participants appeared to become more consistent with practice. This is in line with the idea that early development is more variable and unpredictable, while later development is more stable. The raw scores presented in Figure 1 show steady increases over the 5-year period for all eight measures. These increases represent a combination of developmental changes and practice effects. It is notable that the four age subgroups have roughly parallel curves. The scaled scores presented in Figure 2 show more of a mixed picture. To check the effectiveness of the normative age– corrected scale scores in removing developmental changes, we first looked at the between-subject factor of age subgroup differences. None of the eight tests showed significant age subgroup differences, while all but one of the raw scores did. This supports our general argument that changes in the age-adjusted scores (scaled scores) reflect mainly practice (and error) effects. Some measures in Figure 2 continue to show steep linear changes, others show more modest ones, and a few show no obvious systematic changes over time. We used repeated measures ANOVAs to test the magnitude and strength of the practice effect with each of the eight neurocognitive measures. Repeated measures ANOVA is one of the most powerful statistical methods available (Murphy & Myors, 2004). It is especially powerful where the aim is to examine practice effects over five test occasions as opposed to only two. Using the threefold criteria derived from the repeated measures ANOVAs, we concluded that, after controlling for developmental change, PegsDom and PegsND show a strong year-on-year practice effect; Matching Figures and Symbol Search show a moderate practice effect; Visual Learning shows a small practice effect; and Coding, Finger Windows, and Digit Span show no demonstrable practice effect. The absence of a practice effect for Digit Span was also found by Brown, Rourke, and Cicchetti (1989); Dikmen et al. (1999); and Wilson et al. (2000), among others. Two of the tests (Matching Figures and Finger Windows) showed significant Study Years ⫻ Age Subgroups interactions on the scaled scores. This indicates that there were differential practice effects for different age subgroups on these measures. Figure 2 shows that on Matching Figures, 8-year-olds had the lowest score in Year 1 but the highest score in Year 5. Similarly, 9-year-olds had the second-lowest score in Year 1 and the second-highest score in Year 5. The two younger groups appeared to be more responsive to practice (as measured by the scaled scores) than the two older groups were on this test. On Finger Windows, the two older groups start lower but end up at similar levels in Year 5. 368 SLADE ET AL. Further ANOVAs on the Year 1 scaled score data showed that, while there were no significant age group differences for six out of the eight tests, there were significant differences on Matching Figures and Finger Windows. These are the two tests that showed significant Study Year ⫻ Age Group interactions. It may be the case, therefore, that the normative data did not adequately correct for age on these two measures in Year 1 of the study, although it did appear to do so in subsequent years. The findings of differential practice effects for the eight measures are partially consistent with Lezak et al.’s (2004) proposal that practice effects are determined by speed of response, unfamiliarity of the response required, and ease of conceptualization of the solution. In their model, the Pegboard tests would have large practice effects because of the necessity for speed and the lack of familiarity with the response required. Symbol Search and Matching Figures tests would show practice effects because of the necessity for speed in the former and the fact that there is only a single solution to each problem in the latter. Visual Learning may have a small practice effect because of the low ceiling of the test and the fact that neither speed nor unfamiliar responses are required. Lezak and her colleagues suggest that Digit Span would show practice effects only from the first to the second exposure because of the low ceiling on the test. This was not supported in the present study. Another way of construing the observed differences in practice effects is in terms of what is known about different types of learning/memory systems. The distinction between procedural (skill learning) and declarative (learning of facts) memory, as outlined by Cohen and Squire (1980) and Squire (1986), is particularly relevant in this respect. As the latter investigators have noted, procedural memory systems are responsive to pure repetition and often remain intact during advanced degenerative processes like Alzheimer’s disease. Visual motor tasks such as the Pegboard would seem to rely heavily on procedural memory, which probably explains why they are especially sensitive to the effects of repeated practice. By contrast, tasks that are dependent on declarative memory, such as Visual Learning, owe more to semantic organization and depth of information processing (Craik & Lockhart, 1972) rather than practice per se. It is therefore likely that they will show less of a practice effect, as was found here. Finally, there are the tests that rely on primary memory/working memory systems such as Digit Span and Finger Windows (and possibly Coding). Although sensitive to developmental change, Baddeley’s (1990) concept of working memory suggests a relatively fixed (hard-wired) system, which is unlikely to be subject to a practice effect. The data from the repeated measures ANOVAs would support the contention that practice effects are influenced by both the type of the task and the cognitive systems involved in their solution. It would appear, however, that primary memory/ working memory systems are impervious to practice effects, at least in this age span for this group. Given the consistency of the findings in this study, we calculated linear regression equations that potentially can be used by clinicians to estimate, from initial test scores, scaled scores on the five practice-sensitive neurocognitive tests for each of four retest years and to test their significance. In the hypothetical example given, the individuals’ decline was evidenced in large measure by a failure to show the normal practice effect rather than by the degree of actual decline exhibited on the test. Separating out these two factors is a common clinical problem that should be aided by the simple equations presented in this article. However, it should be noted that the confidence intervals shown in Table 4 are substantial; for example, approximately 3 standard deviations (45 points) in the case of the two Pegboard tests and not much smaller in the case of the other tests. The reason for this is the fact that the year-on-year retest coefficients are much lower than the ideal (i.e., in the .40s to .80s rather than in the .80s to .90s). In consequence, the table can be of assistance only where substantial changes have taken place in a positive direction or moderate changes have occurred in a negative direction, such as in the hypothetical example quoted above. The table should also be used with caution because 1. The equations generated are generalizable to only comparable populations of healthy, Portuguese children, retested over similar time periods. 2. Some of the data derives from the WISC–III, which limits the potential utility of these equations for evaluators using more current versions of the WISC such as the WISC–IV. 3. The U.S. standard scores may not be representative of developmental changes among Portuguese children, whose language, educational, and cultural experiences differ from those of U.S. children. It has been suggested that data on practice effects in healthy children may not be relevant to children with pathological problems such as brain damage or childhood schizophrenia. The reason underlying this argument is that those with pathological problems may fail to show the same magnitude of practice effects as do healthy subjects and that use of the normative regression equations given in Table 4 may, therefore, underestimate the improvement of pathological subjects over time. There is certainly evidence for the first part of this argument, namely that those with pathological problems show less of a practice effect compared with healthy controls (McCaffrey et al., 2000; Wilson et al., 2000). However, the counterargument is that the task of the clinician is to estimate the full extent of deficits shown by such individuals, including those deficits arising from test factors such as difficulty in understanding instructions, impulsivity, distractibility, and failure to benefit from the practice effect. It follows that it is only by using normative data such as those represented in the regression equations here that it will be possible to evaluate the full extent of deficit or recovery. A similar argument was previously found to be relevant to research on cardiac by-pass surgery in adults. Many group studies in this area had found substantial deficits in neuropsychological performance immediately postsurgery that seemed to disappear on follow-up. However, most of these studies did not use a control group and did not take account of practice effects. Those that did use a healthy control group found continuing significant deficits at follow-up. This led two of the current authors (Peter D. Slade and Brenda D. Townes) to publish a methodological article on research design in the area in which one of the major suggestions was the necessity of using a healthy control group to estimate, and control for, practice effects (Slade, Sanchez, Townes, & Aldea, 2001). PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS One limitation of the present study was our inability to obtain information about parental education and occupation, due to concerns about invasion of privacy. Environmental issues are known to influence test performance, with individuals of higher socioeconomic status (SES) performing better than those of lower SES, particularly on verbal tests (Sattler, 2001). However, no investigation isolating the effects of SES on the presence, absence, or magnitude of practice effects was found. A second possible criticism that might be directed at the present analysis is that age-adjusted scaled scores do not remove all of the variance due to developmental factors. In a general sense this is obviously true. The normative data are unlikely to correct completely for all the developmental effects observed; and in fact we have shown that this is true for the data of two age subgroups (8-year-olds and 11-year-olds) on two tests (Matching Figures and Finger Windows) in the first year of testing. However, ageadjusted scaled scores do seem to have removed the overall general effects of developmental changes. In conclusion, the dataset examined in this article is unique, with a large cohort of children being tested yearly for 5 years on a battery of neuropsychological tests. We believe that the simple methodology of comparing raw scores with scaled scores is a useful one for separating out developmental from practice effects on serial neurocognitive testing of children and adolescents. References Adams, W., & Sheslow, D. (1995). Wide Range Assessment of Visual Motor Abilities: Manual. Wilmington, DE: Wide Range. Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan. Baddeley, A. (1990). Human memory: Theory and practice. Hove, United Kingdom: Erlbaum. Benedict, R. H. B., & Zgaljardic, D. J. (1998). Practice effects during repeated administrations of memory tests with and without alternate forms. Journal of Clinical and Experimental Neuropsychology, 20, 339 –352. Bird, C. M., Papadopoulu, K., Ricciardelli, P., Rossor, M. N., & Cipolotti, L. (2003). Test-retest reliability, practice effects and reliable change indices for the Recognition Memory Test. British Journal of Clinical Psychology, 42, 407– 425. Bird, C. M., Papadopoulu, K., Ricciardelli, P., Rossor, M. N., & Cipolotti, L. (2004). Monitoring cognitive changes: Psychometric properties of six cognitive tests. British Journal of Clinical Psychology, 43, 197–210. Brown, S. J., Rourke, B. D., & Cicchetti, D. V. (1989). Reliability of tests and measures used in the neuropsychological assessment of children. Clinical Neuropsychologist, 3, 353–368. Cohen, N. J., & Squire, L. R. (1980, October 10). Preserved learning and retention of pattern-analyzing skill in amnesia: Dissociation of knowing how and knowing that. Science, 210(4466), 207–210. Collie, A., Maruff, P., Darby, D. G., & McStephen, M. (2003). The effects of practice on the cognitive test performance of neurologically normal individuals assessed at brief test-retest intervals. Journal of the International Neuropsychological Society, 9, 419 – 428. Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11, 671– 684. DeRouen, T. A., Leroux, B. G., Martin, M. D., Townes, B. D., Woods, J. S., Leitao, J., et al. (2002). Issues in design and analysis of a randomized clinical trial to assess the safety of dental amalgam restorations in children. Controlled Clinical Trials, 23, 301–320. View publication stats 369 DeRouen, T. A., Martin, M. D., Leroux, B. G., Townes, B. D., Woods, J. S., Leitao, J., et al. (2006). Neurobehavioral effects of dental amalgam in children: A randomized clinical trial. Journal of the American Medical Association, 295, 1784 –1792. Dikmen, S. S., Heaton, R. K., Grant, I., & Temkin, N. R. (1999). Test-retest reliability and practice effects of Expanded Halstead-Reitan Neuropsychological Test Battery. Journal of the International Neuropsychological Society, 5, 346 –356. Dodrill, C. B., & Troupin, A. S. (1975). Effects of repeated administrations of a comprehensive neuropsychological battery among chronic epileptics. Journal of Nervous and Mental Diseases, 161, 185–190. Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (1997). Comprehensive Test of Nonverbal Intelligence: Manual. Austin, TX: Pro–Ed. Hinton-Bayre, A. D., Geffen, G. M., Geffen, L. B., McFarland, K. A., & Friss, P. (1999). Concussion in contact sports: Reliable change indices of impairment and recovery. Journal of Clinical and Experimental Neuropsychology, 21, 70 – 86. Lassiter, K. S., Harrison, T. K., Matthers, T. D., Bell, N. L., & The Citadel. (2001). The validity of the Comprehensive Test of Nonverbal Intelligence as a measure of fluid intelligence. Assessment, 88, 95–103. Lezak, M. D. (1995). Neuropsychological assessment. New York: Oxford University Press. Lezak, M. D., Howieson, D. B., & Loring, D. W. (2004). Neuropsychological assessment (4th ed.). New York: Oxford University Press. Martins, I. P., Castro-Caldas, A., Townes, B. D., Ferreira, G., Rodrigues, P., Marques, S., et al. (2005). Age and sex differences in neurobehavioral performance: A study of Portuguese elementary school children. International Journal of Neuroscience, 115, 1687–1709. McCaffrey, R. J., Duff, K., & Westervelt, H. J. (2000). Practitioner’s guide to evaluating change with intellectual assessment instruments. New York: Kluwer Academic/Plenum Press. Murphy, K. R., & Myors, B. (2004). Statistical power analysis. Mahwah, NJ: Erlbaum. Sattler, J. M. (2001). Assessment of children: Cognitive applications (3rd ed.). La Mesa, CA: Sattler. Sheslow, D., & Adams, W. (1990). Wide Range Assessment of Memory and Learning: Manual. Wilmington, DE: Jastak Associates. Slade, P. D., Sanchez, P., Townes, B. D., & Aldea, G. S. (2001). The use of neurocognitive tests in evaluating the outcome of cardiac surgery: Some methodologic considerations. Journal of Cardiothoracic and Vascular Anesthesia, 15, 4 – 8. Squire, L. R. (1986, June 27). Mechanisms of memory. Science, 232(4758), 1612–1619. Temkin, N. R., Heaton, R. K., Grant, I., & Dikmen, S. S. (1999). Detecting significant change in neuropsychological test performance: A comparison of four models. Journal of the International Neuropsychological Society, 5, 357–369. Wechsler, D. (1991). Wechsler Intelligence Scale for Children—Third Edition: Manual. San Antonio, TX: Psychological Corporation. Wechsler, D. (1997). Wechsler Adult Intelligence Scale—Third Edition. San Antonio: Psychological Corporation. Wechsler, D. (2004). Wechsler Intelligence Scale for Children—Fourth Edition: Manual. San Antonio, TX: Psychological Corporation. Wilson, B. A., Watson, P. C., Baddeley, A. D., Emslie, H., & Evans, J. J. (2000). Improvement or simply practice? The effects of twenty repeated assessments on people with and without brain injury. Journal of the International Neuropsychological Society, 6, 469 – 479. Received September 24, 2007 Revision received May 1, 2008 Accepted May 8, 2008 䡲