Psychological Assessment
2008, Vol. 20, No. 4, 361–369
Copyright 2008 by the American Psychological Association
1040-3590/08/$12.00 DOI: 10.1037/a0012950
The Serial Use of Child Neurocognitive Tests: Development Versus
Practice Effects
Peter D. Slade
Brenda D. Townes and Gail Rosenbaum
Expert Data Analysis for Doctors and Others
University of Washington
Isabel P. Martins, Henrique Luis, and
Mario Bernardo
Michael D. Martin and Timothy A. DeRouen
University of Washington
University of Lisbon
When serial neurocognitive assessments are performed, 2 main factors are of importance: test–retest
reliability and practice effects. With children, however, there is a third, developmental factor, which
occurs as a result of maturation. Child tests recognize this factor through the provision of age-corrected
scaled scores. Thus, a ready-made method for estimating the relative contribution of developmental
versus practice effects is the comparison of raw (developmental and practice) and scaled (practice only)
scores. Data from a pool of 507 Portuguese children enrolled in a study of dental amalgams (T. A.
DeRouen, B. G. Leroux, et al., 2002; T. A. DeRouen, M. D. Martin, et al., 2006) showed that practice
effects over a 5-year period varied on 8 neurocognitive tests. Simple regression equations are provided
for calculating individual retest scores from initial test scores.
Keywords: neurocognitive tests, repeat testing, practice effects, test reliability, child development
The lack of serial assessment information on neurocognitive
tests should not be taken to imply that little work has been done in
this area. As Lezak, Howieson, and Loring (2004, p. 116) point
out, many studies researching the effects of repeated examinations
have revealed “an overall pattern of test susceptibility to practice
effects.” Moreover, the same authors note that “numerous studies
have also shown a general test-taking benefit in which enhanced
performance may occur after repeated examinations” (p. 116).
This phenomenon has been referred to as test sophistication by
Anastasi (1988).
The most comprehensive overview of the effects of repeat
testing was carried out by McCaffrey, Duff, and Westervelt
(2000). They reviewed hundreds of studies that had been carried
out on tests of intelligence (mainly on the Wechsler scales) in both
healthy samples and patient groups and for both adults and children. They outline their findings in summary tables covering a
total of 212 pages, a useful if not essential resource for anyone
using the Wechsler tests on a repeated basis. As well as reviewing
the extent of practice effects with differing test–retest intervals,
they looked at some of the individual difference variables that have
been found to affect the magnitude of practice effects. These
include age, gender, intelligence, education, and the presence (or
absence) of a disease process.
As pointed out by McCaffrey et al. (2000), the effects of
repeated testing involve two main factors: test–retest reliability
and practice effects. The former refers to the measurement error
associated with any given test and involves the stability of the
relative rankings of individuals’ scores across testing occasions.
That is, do individuals who score high on initial testing also score
high on retest? Do individuals who score low on testing also score
low on retest? And do the rest maintain their relative intermediate
positions? Test–retest reliability is traditionally measured by the
Neuropsychologists are often involved in carrying out serial
assessments of children and adolescents. These may include monitoring patterns of intellectual development, the study of children’s
recovery after epilepsy surgery, or observing the degree of improvement following localized or generalized head injury. There is
also a crucial use in determining the efficacy of therapeutic interventions (pharmacological or surgical) with primary cognitive
endpoints.
Although some tests come with information on test–retest reliability—the WISC–III (Wechsler, 1991), for example—many others do not. Even when such data are provided, they usually cover
only relatively short test–retest intervals over a few months—and
then for only a single retest. But many serial assessments are
conducted over much longer time periods—9 months to a year—
and often involve three or more repetitions of the same tests.
Peter D. Slade, Expert Data Analysis for Doctors and Others, West
Kirby, United Kingdom; Brenda D. Townes, Department of Psychiatry and
Behavioral Sciences, University of Washington; Gail Rosenbaum, Regional Epilepsy Center, University of Washington; Isabel P. Martins,
Language Research Laboratory, Department of Neurology, University of
Lisbon, Lisbon, Portugal; Henrique Luis and Mario Bernardo, Faculty of
Dental Medicine, University of Lisbon; Michael D. Martin, Departments of
Oral Medicine and Epidemiology, University of Washington; Timothy A.
DeRouen, Departments of Dental Public Health Sciences and Biostatistics,
University of Washington.
This project was funded by National Institute of Dental and Craniofacial
Research Cooperative Agreement U01 DE 11894. The authors wish to
thank the staff and students of Casa Pia School System, Lisbon, Portugal,
for their assistance with the project.
Correspondence concerning this article should be addressed to Brenda
D. Townes, 20 Park Road, West Kirby, Wirral CH48 4DW, United
Kingdom. E-mail: btownes@u.washington.edu
361
SLADE ET AL.
362
correlation coefficient between test and retest scores. Practice
effects, on the other hand, refer to the amount of overall change in
scores from test to retest. Where there are only two testing occasions, the significance of the mean change is usually evaluated by
using matched pairs t tests (e.g., Dikmen, Heaton, Grant, &
Temkin, 1999). Where more than two testing occasions are involved, repeated measures analysis of variance (ANOVA) is commonly used instead (e.g., Collie, Maruff, Darby, & McStephen,
2003; Hinton-Bayre et al., 1999). However, in the case of children
and adolescents there is another factor, in addition to practice
effects, that comes into play when considering repeated testing:
namely, developmental factors. Test stability involves these three
factors: test reliability, practice effects, and development.
Most of the studies on test–retest reliability and practice effects
have used only two testing occasions. A notable exception is the
study by Wilson, Watson, Baddeley, Emslie, and Evans (2000).
They administered a battery of 11 tests to two small groups of
brain-injured (n ⫽ 10) and control (n ⫽ 13) subjects on 20 separate
occasions over the course of a 4-week period. Many of the tests
showed continuing practice effects, which were larger in the control than in the brain-injured patients. A semantic processing test
and word fluency tests showed the largest practice effects, while
digit span and simple reaction time showed the smallest. This
study followed a smaller scale study by Benedict and Zgaljardic
(1998), who had administered the same forms and parallel forms of
verbal and nonverbal memory tests to their participants every 2
weeks for 8 weeks. Benedict and Zgaljardic found significant
practice effects when the same tests were used, but these significant effects disappeared when the tests were replaced by alternate
test forms.
Practice effects have been found on many if not most neurocognitive tests (Bird, Papadopoulu, Ricciardelli, Rossor, & Cipolotti, 2003, 2004; Collie et al., 2003). However, it is generally
recognized that some tests are more susceptible to practice effects
than others (Dodrill & Troupin, 1975; Lezak, 1995). Lezak et al.
(2004) summarized this differential practice effect in the following
way:
Tests that have a large speed component, require an unfamiliar or
infrequently practiced mode of response, or have a single solution—
particularly if it can be easily conceptualized once it is attained—are
more likely to show practice effects. (p. 116)
Rationale
Many of the tests developed for assessing neuropsychological
performance in children and adolescents are known to show developmental (maturational) changes with age. Consequently, these
tests were developed with age-corrected norms that allow the
examiner to compare any given child with other children of the
same age. In the case of the WISC–III, scaled scores are provided
at three monthly intervals up to the age of 16 years 11 months; at
this point the WAIS–III (Wechsler, 1997) takes over with yearly
age increments. In the case of the Wide Range Assessment of
Memory and Learning (WRAML; Sheslow & Adams, 1990), age
norms are provided at 6-month intervals between the ages of 5 and
13 years and at 1-year intervals for ages 14 and 15 years. And in
the case of the Wide Range Assessment of Visual Motor Abilities
(WRAVMA; Adams & Sheslow, 1995), age norms are provided at
6-month intervals between the ages of 3 and 13 years and at 1-year
intervals for ages 13 to 17 years 11 months.
Thus, with tests designed for use with children and adolescents,
we can expect to find increases in raw scores over time due to a
combination of two factors: developmental (maturational) changes
and changes due to repeat testing (practice effects). Of course, if
no practice effects are operating, then raw score changes over time
will be entirely due to developmental changes (plus error).
However, where we have age-corrected norms (scaled scores)
available for a test, we have a ready-made method for estimating
how much of the change in scores is due to development and how
much is due to practice. Namely, changes in raw scores reflect
both of these factors in combination, while changes in agecorrected scale scores are likely to reflect practice alone. That is,
once raw scores have been converted to scaled scores, we would
not expect to find any further mean changes over time in the scaled
scores unless there are additional practice effects operating. Of
course, the major assumption here is that the age-corrected scaled
scores do fully remove the developmental (maturational) changes
in performance from repeat scores.
Method
Participants
As part of a clinical trial on the safety of dental treatments with
amalgams containing mercury, 507 Portuguese children were randomly assigned to receive dental treatment with either amalgam
restorations for posterior teeth or composite-only restorations
(DeRouen et al., 2002). The 4 children who were 12 years of age
at the outset were dropped in this aspect of the study. This left a
total of 503 participants who were 8 to 11 years of age at onset, and
due to dropouts, there were 437 participants ages 16 –19 years
upon completion of the study 8 years later. Inclusion criteria were
(a) having at least one carious lesion in a permanent tooth, (b) no
previous exposure to amalgam treatments, (c) urinary mercury
level less than 10/L at baseline, (d) blood lead level less than
15/dL, (e) no interfering health conditions, and (f) IQ equal to or
greater than 67 as obtained on the Comprehensive Test of Nonverbal Intelligence (CTONI; Hammill, Pearson, & Wiederholt,
1997). The CTONI was chosen as an instrument designed for use
in cross-cultural settings as a brief, nonverbal test of intelligence
that was available at the onset of the study (1997).
During selection of participants, the lower IQ boundary of 67 on
the CTONI was chosen. Though truncated at the lower end, the
CTONI IQ at baseline was otherwise normally distributed with a
mean of 85.10 (SD ⫽ 9.96). These scores are consistent with (a)
clinical experience suggesting that the CTONI underestimates
intelligence in non-U.S. populations by up to one standard deviation (Martins et al., 2005) and (b) the fact that the CTONI IQ is
known to underestimate fluid intelligence (Lassiter, Harrison, Matthews, Bell, & The Citadel, 2001). At baseline 55% were male;
71% were Caucasian, 28% African Portuguese, and 1% Asian
Portuguese. The mean age was 10.1 years (range 8.0 –11.9) and
mean IQ on the CTONI was 85.10 (range 67–118).
Institutional review board and parental or guardian approval was
secured for all children who took part in neurocognitive testing.
Since no significant differences were found between the two
groups at any time on any of the neurocognitive outcome measures
PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS
(DeRouen et al., 2006), the two groups were combined into a
single cohort.
Procedure
Participants completed a battery of neurocognitive tests in Portuguese annually for 8 years. Translation of all tests and instructions was done by a professor at the University of Washington who
was a native speaker of Portuguese. Back translations were performed by Mario Bernardo and Henrique Luis, both of whom are
fluent in English. The back translations were verified by the senior
U.S. psychometrist (Gail Rosenbaum), who has had extensive
experience in test administration and scoring as well as crosscultural testing. Three psychometrists were trained to administer
the tests, and their performance was continually monitored by Gail
Rosenbaum. Their work was calibrated throughout the study by
using ratings on a 136-item checklist to review videotaped testing
sessions (with 94.5% to 97.8% accuracy). Tests were doublescored and data corrected when errors were identified (no severe
violations of protocol that required discarding data were observed).
The neurocognitive tests utilized were chosen on the basis of
their known sensitivity in measuring brain-behavior relationships
(Lezak, 1995), appropriateness for children 8 –12 years of age, and
where possible, minimal practice effects. Functional areas assessed
were learning and memory, motor and visual motor abilities, and
attention.
Of the19 neuropsychological tests administered during the
study, 8 of them had age-corrected scale scores available in addition to raw scores. These were Pegboard with the dominant (PegsDom) and nondominant (PegsND) hands and Matching Figures
from the WRAVMA; Finger Windows and Visual Learning from
the WRAML; and Coding, Symbol Search, and Digit Span from
the WISC–III.
The three test batteries were standardized on fairly substantial
and representative samples of U.S. children. The WISC–III was
standardized on a sample of 2,200 children between the ages of 6
and 16 years 11 months. The normative sample was stratified in
terms of age, gender, ethnic origin, and region. The WISC–IV
(Wechsler, 2004) was not available at the inception of this study in
1997. The WRAML was standardized on a sample of 2,363
children between the ages of 5 years and 17 years 11 months. The
normative sample was stratified in terms of age, gender, ethnic
origin, region, and urban–rural distribution. The WRAVMA was
standardized on a sample of 2,282 children between the ages of 3
years and 17 years 11 months. The normative sample was stratified
by age, gender, ethnic origin, region, and socioeconomic status.
Of the eight tests from the above three batteries, six had been
given on five successive occasions, a year apart, and then replaced
by similar or equivalent adult tests for the duration of the study.
Rather than attempt to make corrections for the transition from
child to adult tests, we decided that for present purposes we would
restrict our analyses to the participants who had completed the
eight child tests on all or most occasions over the first 5 years of
the study. Of the 503 children who were ages 8 –11.9 years at the
start of the study, 408 completed all eight tests on all five occasions.
363
Statistical Design and Analysis
All analyses were conducted with the SPSS 15.0. Test–retest
reliability was assessed in the traditional manner with Pearson
product–moment correlations between test scores for consecutive
pairs of study years.
Repeated measures analyses of variance (ANOVAs) were used
to assess the significance of changes over the five study years. The
analyses were carried out on both raw scores and scaled scores. In
the case of raw scores, any significant changes over time were
attributed to a combination of both developmental and practice
effects. In the case of scaled scores, any significant changes were
attributed to practice effects alone.
The repeated measures ANOVAs included a within-subjects
factor of study year; a between-subjects factor of age subgroups
(the latter subgroups were formed in the first year of the study and
were categorized as 8-year-olds [8 to 8.99 years], 9-year-olds [9 to
9.99 years], 10-year-olds [10 to 10.99 years], and 11-year-olds [11
to 11.99 years]); and the Study Years ⫻ Age Subgroups interaction.
A between-subjects factor of gender was not used because the
focus was upon estimating potential age differences in practice
effects and the impact of attempting to control them through use of
age-adjusted scaled scores. Furthermore, the test batteries that
were used (WISC–III, WRAVMA, and WRAML) separated normative tables for only different age groups and not for boys and
girls.
Three aspects of the results were used to establish the significance and magnitude of the practice effects in the case of the
scaled scores analyses:
1.
the significance of the within-subjects effect due to study
years,
2.
the significance of the linear trend due to study years, and
3.
the amount of variance accounted for by the linear trend,
as indexed by partial eta-squared.1
We determined a priori that a practice effect can be deemed to
be present if all three of the above were found; that is, a highly
significant within-subjects effect indicating that the scaled scores
of the sample varied significantly over the 5 years of the study; a
significant linear trend indicating that a linear function provided a
significant fit to the scaled score differences over the 5-year
period; and a linear trend that accounted for a sizable proportion of
the test variance. Where one or more of these criteria were not met,
we concluded that a practice effect had not been demonstrated.
In order to check on our major assumption that changes in raw
scores reflect a combination of both development and practice
effects, while changes in scaled scores reflect purely practice, we
considered that there would be age subgroup differences in raw
scores (between 8-, 9-, 10-, and 11-year-olds) but no age subgroup
differences in scaled scores. That is, the age-adjusted norms would
remove these age differences in the case of the latter.
1
Given the large sample size, we felt justified in using partial eta squared
rather than the more complex omega squared as a measure of effect size.
SLADE ET AL.
364
And finally, we calculated simple linear regression equations for
predicting scaled scores on retest from scaled scores on initial
testing for those measures that demonstrated clear practice effects.
Results
Test–Retest Reliability
The test–retest reliability coefficients are presented in Table 1.
Although some of them are somewhat low, such as those for
Finger Windows and Digit Span in the earlier years, over 60% of
the year-on-year comparisons are in excess of .60. It is notable also
that there is a general increase in the size of the coefficients over
time such that five of the eight tests had coefficients in excess of
.60 in the Year 4/5 comparison.
Analysis of Raw Scores
The raw scores of the age subgroups for each of the eight tests
are shown in Figure 1. It can be seen that in all cases there is
substantial change in scores over the 5-year period, with most of
the curves showing steady incremental changes. This impression is
confirmed by the results of repeated measures ANOVAs, which
are presented in Table 2. These show highly significant withinsubjects effects (Fs ranging from 28.9 to 638.2), highly significant
linear trends (Fs ranging from 187.2 to 3,153.7), and partial
eta-squared values showing that the linear trend accounts for
between 18% of the variance in the case of Visual Learning and
81% in the cases of PegsND and Coding. It is notable also that
there are significant between-subjects (age subgroups) differences
for seven out of the eight tests, the exception being Visual Learning. These subgroup differences account for between 3% (Digit
Span) and 12% (Symbol Search and Coding) of the variance,
confirming the importance of using age-adjusted norms in clinical
and research work with these instruments. There are also significant Study Years ⫻ Age Subgroups interaction effects for four out
of the eight tests.
Analyses of Scaled Scores
The scaled scores of the age subgroups for each of the eight tests
are shown in Figure 2. It can be seen that some tests show steady
Table 1
Test–Retest Reliability of Cognitive Tests Administered Over 5
Consecutive Years
Test
Y1–Y2
(n ⫽ 473)
Y2–Y3
(n ⫽ 457)
Y3–Y4
(n ⫽ 437)
Y4–Y5
(n ⫽ 417)
PegsDom
PegsND
Matching Figures
Symbol Search
Visual Learning
Coding
Finger Windows
Digit Span
.49
.51
.51
.60
.66
.71
.35
.27
.51
.56
.56
.68
.73
.79
.31
.32
.55
.63
.49
.72
.70
.81
.29
.42
.63
.64
.53
.70
.65
.81
.44
.54
Note. Listwise deletions were used with each pair of study years. All
correlations were significant at the .01 level or beyond. Y ⫽ year; PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand.
incremental changes while others show relatively flat trends. This
is again confirmed by the results of the repeated measures
ANOVAs, which are presented in Table 3.
The first thing of note in Table 3 are the between-subjects age
subgroup ANOVAs. Unlike the case with raw scores, where seven
out of eight tests were statistically significant, all of the age
subgroup differences for the scaled scores were nonsignificant.
This suggests that the age-adjusted norms have in all cases done a
satisfactory job of removing the developmental factor from the
equation. What are left, therefore, are primarily practice (and error)
effects.
In the case of PegsDom and PegsND, the within-subjects
ANOVAs are highly significant; the linear trends are highly significant; and the linear trends account for 45% and 36% of the
variance, respectively. These visual motor tests show a strong
year-on-year practice effect.
In the case of Matching Figures and Symbol Search, the withinsubjects ANOVAs are statistically significant and the linear trends
are significant, but the linear trend accounts for only 15% and 13%
of the variance, respectively. These tests show a moderate yearon-year practice effect.
In the case of Visual Learning, the within-subjects ANOVA is
significant, as is the linear trend. However, the variance accounted
for by the linear trends is only 8%. This reflects the much flatter
curve of this test, as shown in Figure 2. Visual Learning shows, at
most, a small practice effect after controlling for developmental
change.
In the case of Coding, Finger Windows, and Digit Span, the first
two tests show a significant within-subjects effect, while Digit
Span does not; Coding shows a significant linear trend, while
Finger Windows and Digit Span do not; finally, the linear trends
for all three account for only 0% or 1% of the variance.
We can conclude that none of these three tests (Coding, Finger
Windows, or Digit Span) show any detectable practice effect. Only
two of the tests, Matching Figures and Finger Windows, show a
significant Study Years ⫻ Age Subgroups interaction term.
Finally, because there appear to be age subgroup differences in
scaled scores in Year 1 similar to those shown in the raw scores
(see Figure 2), we carried out a further set of ANOVAs on the
eight sets of Year 1 scaled scores, with age groups as the grouping
factor. For six of the variables, there were no significant differences between age groups, the exceptions being Matching Figures,
F(3, 499) ⫽ 3.32, p ⬍ .05, and Finger Windows, F(3, 499) ⫽ 5.97,
p ⬍ .01. Post hoc Tukey HSD tests showed that, in the case of
Matching Figures, the 8-year-olds scored significantly lower than
did the other three age groups, while in the case of Finger Windows, the 11-year-olds scored significantly lower than did the
other three age groups. That is, for the Year 1 data alone, the scaled
scores did not fully correct for age for 8-year-olds in the case of
Matching Figures, whereas in the case of Finger Windows, it
appeared to overcorrect for age for the 11-year-olds.
Linear Regression Equations for Predicting Retest Scores
Table 4 presents simple linear regression equations for estimating retest scaled scores from initial scaled scores, for the five
measures that manifested clear evidence of practice effects. This is
comparable to the third model used by Temkin, Heaton, Grant, and
Dikmen (1999). These equations can be used to calculate predicted
PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS
PegsDom (Raw Scores)
36
36
8-year-olds
9-year-olds
10-year-olds
36
__
32
_
_
11-year-olds
1
2
3
4
1
5
2
Study Year
3
25
32
Score
40
30
34
40
Score
Score
44
8-year-olds
9-year-olds
10-year-olds
30
11-year-olds
28 _
4
8-year-olds
9-year-olds
10-year-olds
5
1
2
_
_
4
11-year-olds
1
5
2
Study Year
Coding (Raw Scores)
Visual Learning (Raw Scores)
3
8-year-olds
9-year-olds
10-year-olds
20
11-year-olds
_
Study Year
3
4
5
Study Year
Digit Span (Raw Scores)
Finger Windows (Raw Scores)
60
26
Symbol Search (Raw Scores)
Matching Figures (Raw Scores)
PegsND (Raw Scores)
44
Score
48
365
13
12
11
12
24
22
Score
Score
Score
Score
50
10
11
9
40
8-year-olds
9-year-olds
10-year-olds
20
_
_
11-year-olds
1
2
3
4
8-year-olds
9-year-olds
10-year-olds
30_
1
5
7_
11-year-olds
_
2
3
4
8-year-olds
9-year-olds
10-year-olds
8
5
1
Study Year
Study Ye ar
11-year-olds
_
2
3
4
8-year-olds
9-year-olds
10-year-olds
10
5
_
_
11-year-olds
1
3
2
Study Year
4
5
Study Ye ar
Figure 1. Raw scores of age subgroups on eight measures over a 5-year period. PegsDom ⫽ Pegboard with
the dominant hand; PegsND ⫽ Pegboard with the nondominant hand.
Table 2
Results of Repeated Measures Analyses of Variance for Raw Scores
Variable
Within-subjects study
years
dfsa
F
p
2p
Linear trend for years
dfs
F
p
2p
Between-subjects age
subgroups
dfs
F
p
2p
Study Years ⫻ Age
Subgroups
dfsa
F
p
2p
PegsDom
PegsND
3.82, 1565
327.9
⬍.001
.45
3.67, 1500
638.2
⬍.001
.61
3.74, 1539
157.5
⬍.001
.28
3.70, 1521
552.4
⬍.001
.57
1, 409
976.8
⬍.001
.71
1, 409
1,725.3
⬍.001
.81
1, 412
540.2
⬍.001
.57
3, 409
9.79
⬍.001
.07
3, 409
10.93
⬍.001
.07
11.48, 1565 11.01, 1500
2.60
2.24
⬍.003
⬍.009
.02
.02
Matching Figures Symbol Search Visual Learning
Coding
Finger Windows
Digit Span
3.83, 1572
28.9
⬍.001
.07
3.44, 1413
713.7
⬍.001
.64
4.00, 1644
126.0
⬍.001
.24
3.76, 1543
61.32
⬍.001
.13
1, 411
1,505.6
⬍.001
.79
1, 411
3,153.7
⬍.001
.18
1, 411
1,693.8
⬍.001
.81
1, 411
383.6
⬍.001
.48
1, 411
187.2
⬍.001
.31
3, 412
6.77
⬍.05
.05
3, 411
19.22
⬍.001
.12
3, 411
0.50
ns
.00
3, 411
19.15
⬍.001
.12
3, 411
8.62
⬍.001
.06
3, 411
3.55
⬍.02
.03
11.21, 1539
4.01
⬍.001
.03
11.10, 1521
1.04
ns
.01
11.48, 1572
2.21
⬍.02
.02
10.32, 1413
1.42
ns
.01
12.00, 1644
0.93
ns
.01
11.27, 1543
1.40
ns
.01
Note. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand.
When Mauchley’s test is significant, Greenhouse–Geisser adjusted dfs are used.
a
SLADE ET AL.
366
PegsDom (Scaled Scores)
PegsND (Scaled Scores)
120
120
Symbol Search (Scaled Scores)
Matching Figures (Scaled Scores)
102.5
10.5
115
Score
Score
Score
110
Score
100.0
115
97.5
10.0
110
105
100
95.0
8-year-olds
9-year-olds
10-year-olds
11-year-olds
_
_
1
2
4
3
9.5
8-year-olds
9-year-olds
105
92.5
10-year-olds
11-year-olds
_
_
5
1
Study Year
2
3
4
10-year-olds
11-year-olds
_
_
1
5
2
Study Year
3
4
9.0
10-year-olds
11-year-olds
_
_
1
5
2
Finger Windows (Scaled Scores)
10
3
4
5
Study Year
Study Year
Coding (Scaled Scores)
Visual Learning (Scaled Scores)
8-year-olds
9-year-olds
8-year-olds
9-year-olds
Digit Span (Scaled Scores)
8.0
9.0
7.5
9
Score
Score
Score
8.5
Score
7.5
7.0
8.0
8-year-olds
9-year-olds
7.5
10-year-olds
11-year-olds
_
_
1
2
3
4
5
8-year-olds
9-year-olds
8
7.0
10-year-olds
11-year-olds
_
_
12
3
4
8-year-olds
9-year-olds
__
5
1
_
_
2
3
4
5
10-year-olds
11-year-olds
1
Study Year
Study Year
Study Year
8-year-olds
9-year-olds
6.5
10-year-olds
11-year-olds
2
3
4
5
Study Year
Figure 2. Scaled scores of age subgroups on eight measures over a 5-year period. The first three tests—
PegsDom, PegsND, and Matching Figures— have a mean scaled score of 100 (SD ⫽ 15). The other five have
a mean of 10 (SD ⫽ 3). PegsDom ⫽ Pegboard with the dominant hand; PegsND ⫽ Pegboard with the
nondominant hand.
Table 3
Results of Repeated Measures Analyses of Variance for Scaled Scores
Variable
Within-subjects study
years
dfsa
F
p
2p
Linear trend for years
dfs
F
p
2p
Between-subjects age
subgroups
dfs
F
p
2p
Study Years ⫻ Age
Subgroups
dfsa
F
p
2p
PegsDom
PegsND
3.82, 1564
112.2
⬍.001
.22
3.80, 1542
75.27
⬍.001
.16
3.89, 1602
21.28
⬍.001
.05
3.77, 1547
22.86
⬍.001
.05
1, 409
335.2
⬍.001
.45
1, 406
232.7
⬍.001
.36
1, 412
74.46
⬍.001
.15
3, 409
0.53
ns
.00
3, 406
0.39
ns
.00
11.47, 1564 11.40, 1542
0.67
0.59
ns
ns
.01
.00
Matching Figures Symbol Search Visual Learning
Coding
Finger Windows
Digit Span
3.87, 1588
10.83
⬍.001
.03
3.55, 1457
8.69
⬍.001
.02
4.00, 1644
5.87
⬍.001
.01
3.79, 1560
0.76
ns
.00
1, 411
63.25
⬍.001
.13
1, 411
34.66
⬍.001
.08
1, 411
3.89
⬍.05
.01
1, 411
2.64
ns
.01
1, 411
0.04
ns
.00
3, 412
0.07
ns
.00
3, 411
1.17
ns
.01
3, 411
0.22
ns
.00
3, 411
1.91
ns
.01
3, 411
1.97
ns
.01
3, 411
1.71
ns
.01
11.67, 1602
2.00
⬍.03
.01
11.30, 1547
0.66
ns
.01
11.59, 1588
1.05
ns
.01
10.64, 1457
1.21
ns
.01
12.00, 1644
1.82
⬍.05
.01
11.37, 1560
1.21
ns
.01
Note. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard with nondominant hand.
When Mauchley’s test is significant, Greenhouse–Geisser adjusted dfs are used.
a
PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS
367
Table 4
Linear Regression Equations for Predicting Individual Scaled Scores on Retesting From Initial Test Scores
Year, predicted
retest score (Ŷ),
and confidence
interval (CI)
Retest Year
Ŷ
90% CI
Retest Year
Ŷ
90% CI
Retest Year
Ŷ
90% CI
Retest Year
Ŷ
90% CI
2
3
4
5
PegsDom
PegsND
MF
SS
VL
64.75 ⫹ (0.43 ⫻ PegsDom1) 63.76 ⫹ (0.44 ⫻ PegsND1) 53.22 ⫹ (0.48 ⫻ MF1) 4.30 ⫹ (0.60 ⫻ SS1) 3.21 ⫹ (0.64 ⫻ VL1)
⫾22.98
⫾20.48
⫾18.70
⫾4.10
⫾3.49
65.10 ⫹ (0.48 ⫻ PegsDom1) 63.38 ⫹ (0.47 ⫻ PegsND1) 60.05 ⫹ (0.41 ⫻ MF1) 4.73 ⫹ (0.56 ⫻ SS1) 3.30 ⫹ (0.63 ⫻ VL1)
⫾22.68
⫾21.15
⫾17.85
⫾4.23
⫾3.82
71.70 ⫹ (0.44 ⫻ PegsDom1) 72.74 ⫹ (0.40 ⫻ PegsND1) 64.33 ⫹ (0.38 ⫻ MF1) 4.80 ⫹ (0.60 ⫻ SS1) 3.38 ⫹ (0.64 ⫻ VL1)
⫾21.96
⫾20.13
⫾18.41
⫾4.38
⫾3.78
77.30 ⫹ (0.41 ⫻ PegsDom1) 72.71 ⫹ (0.43 ⫻ PegsND1) 57.92 ⫹ (0.45 ⫻ MF1) 4.83 ⫹ (0.59 ⫻ SS1) 3.82 ⫹ (0.62 ⫻ VL1)
⫾24.44
⫾21.70
⫾17.93
⫾4.77
⫾4.11
Note. In all equations, test measures with a 1 signify the initial test score for the measure. PegsDom ⫽ Pegboard with dominant hand; PegsND ⫽ Pegboard
with nondominant hand; MF ⫽ Matching Figures; SS ⫽ Symbol Search; VL ⫽ Visual Learning.
retest scores (Ŷ) that can then be compared directly with observed
retest scores and evaluated for significance with the 90% confidence intervals (CI) also presented in Table 4. The 90% CI has
been calculated by multiplying the standard deviations of the
residuals estimated by the regression by 1.645, the z score cutoff
for a 90% interval based on the normal distribution.
For example, suppose a 10-year-old boy obtained a Matching
Figures scaled score of 75 on first testing and then one year later
obtained a score of 69. We can determine whether this represents
a significant decline, given the data on practice effects for this test,
by first calculating the predicted retest score using the appropriate
equation in Table 4 (Retest Year 2). This is given by Ŷ ⫽ 53.22 ⫹
(0.48 ⫻ 75) ⫽ 89.22. We then calculate the lower bound of the CI
for the predicted retest score. This is 89.22 – 18.70 ⫽ 70.52. The
obtained retest score of 69 is below the lower CI boundary. We can
therefore conclude that the boy’s performance has declined significantly on this test. It should be noted that the actual score
difference between test and retest is only 6 points (75 – 69), while
the expected practice effect is more than double this magnitude,
namely 89.22 – 75 ⫽ 14.22.
Discussion
In the present study the test–retest reliabilities are comparable to
those recently reported by Collie et al. (2003) on a set of computer
tests with a sample of 113 adults. They carried out their four tests
during the same day and considered their retest reliabilities acceptable. We would also consider our reliabilities adequate, given that
we used a serial retest period of 12-month intervals over a total
5-year period. It was notable also that the reliability coefficients
increased over the 5-year comparison periods. Participants appeared to become more consistent with practice. This is in line
with the idea that early development is more variable and unpredictable, while later development is more stable.
The raw scores presented in Figure 1 show steady increases over
the 5-year period for all eight measures. These increases represent
a combination of developmental changes and practice effects. It is
notable that the four age subgroups have roughly parallel curves.
The scaled scores presented in Figure 2 show more of a mixed
picture. To check the effectiveness of the normative age– corrected
scale scores in removing developmental changes, we first looked at
the between-subject factor of age subgroup differences. None of
the eight tests showed significant age subgroup differences, while
all but one of the raw scores did. This supports our general
argument that changes in the age-adjusted scores (scaled scores)
reflect mainly practice (and error) effects.
Some measures in Figure 2 continue to show steep linear
changes, others show more modest ones, and a few show no
obvious systematic changes over time. We used repeated measures
ANOVAs to test the magnitude and strength of the practice effect
with each of the eight neurocognitive measures. Repeated measures ANOVA is one of the most powerful statistical methods
available (Murphy & Myors, 2004). It is especially powerful
where the aim is to examine practice effects over five test occasions as opposed to only two. Using the threefold criteria derived
from the repeated measures ANOVAs, we concluded that, after
controlling for developmental change, PegsDom and PegsND
show a strong year-on-year practice effect; Matching Figures and
Symbol Search show a moderate practice effect; Visual Learning
shows a small practice effect; and Coding, Finger Windows, and
Digit Span show no demonstrable practice effect. The absence of
a practice effect for Digit Span was also found by Brown, Rourke,
and Cicchetti (1989); Dikmen et al. (1999); and Wilson et al.
(2000), among others.
Two of the tests (Matching Figures and Finger Windows)
showed significant Study Years ⫻ Age Subgroups interactions on
the scaled scores. This indicates that there were differential practice effects for different age subgroups on these measures. Figure 2
shows that on Matching Figures, 8-year-olds had the lowest score
in Year 1 but the highest score in Year 5. Similarly, 9-year-olds
had the second-lowest score in Year 1 and the second-highest
score in Year 5. The two younger groups appeared to be more
responsive to practice (as measured by the scaled scores) than the
two older groups were on this test. On Finger Windows, the two
older groups start lower but end up at similar levels in Year 5.
368
SLADE ET AL.
Further ANOVAs on the Year 1 scaled score data showed that,
while there were no significant age group differences for six out of
the eight tests, there were significant differences on Matching
Figures and Finger Windows. These are the two tests that showed
significant Study Year ⫻ Age Group interactions. It may be the
case, therefore, that the normative data did not adequately correct
for age on these two measures in Year 1 of the study, although it
did appear to do so in subsequent years.
The findings of differential practice effects for the eight measures are partially consistent with Lezak et al.’s (2004) proposal
that practice effects are determined by speed of response, unfamiliarity of the response required, and ease of conceptualization of
the solution. In their model, the Pegboard tests would have large
practice effects because of the necessity for speed and the lack of
familiarity with the response required. Symbol Search and Matching Figures tests would show practice effects because of the
necessity for speed in the former and the fact that there is only a
single solution to each problem in the latter. Visual Learning may
have a small practice effect because of the low ceiling of the test
and the fact that neither speed nor unfamiliar responses are required. Lezak and her colleagues suggest that Digit Span would
show practice effects only from the first to the second exposure
because of the low ceiling on the test. This was not supported in
the present study.
Another way of construing the observed differences in practice
effects is in terms of what is known about different types of
learning/memory systems. The distinction between procedural
(skill learning) and declarative (learning of facts) memory, as
outlined by Cohen and Squire (1980) and Squire (1986), is particularly relevant in this respect. As the latter investigators have
noted, procedural memory systems are responsive to pure repetition and often remain intact during advanced degenerative processes like Alzheimer’s disease. Visual motor tasks such as the
Pegboard would seem to rely heavily on procedural memory,
which probably explains why they are especially sensitive to the
effects of repeated practice. By contrast, tasks that are dependent
on declarative memory, such as Visual Learning, owe more to
semantic organization and depth of information processing (Craik
& Lockhart, 1972) rather than practice per se. It is therefore likely
that they will show less of a practice effect, as was found here.
Finally, there are the tests that rely on primary memory/working
memory systems such as Digit Span and Finger Windows (and
possibly Coding). Although sensitive to developmental change,
Baddeley’s (1990) concept of working memory suggests a relatively fixed (hard-wired) system, which is unlikely to be subject to
a practice effect. The data from the repeated measures ANOVAs
would support the contention that practice effects are influenced
by both the type of the task and the cognitive systems involved in
their solution. It would appear, however, that primary memory/
working memory systems are impervious to practice effects, at
least in this age span for this group.
Given the consistency of the findings in this study, we calculated linear regression equations that potentially can be used by
clinicians to estimate, from initial test scores, scaled scores on the
five practice-sensitive neurocognitive tests for each of four retest
years and to test their significance. In the hypothetical example
given, the individuals’ decline was evidenced in large measure by
a failure to show the normal practice effect rather than by the
degree of actual decline exhibited on the test. Separating out these
two factors is a common clinical problem that should be aided by
the simple equations presented in this article.
However, it should be noted that the confidence intervals shown
in Table 4 are substantial; for example, approximately 3 standard
deviations (45 points) in the case of the two Pegboard tests and not
much smaller in the case of the other tests. The reason for this is
the fact that the year-on-year retest coefficients are much lower
than the ideal (i.e., in the .40s to .80s rather than in the .80s to
.90s). In consequence, the table can be of assistance only where
substantial changes have taken place in a positive direction or
moderate changes have occurred in a negative direction, such as in
the hypothetical example quoted above.
The table should also be used with caution because
1.
The equations generated are generalizable to only comparable populations of healthy, Portuguese children, retested over similar time periods.
2.
Some of the data derives from the WISC–III, which
limits the potential utility of these equations for evaluators using more current versions of the WISC such as the
WISC–IV.
3.
The U.S. standard scores may not be representative of
developmental changes among Portuguese children,
whose language, educational, and cultural experiences
differ from those of U.S. children.
It has been suggested that data on practice effects in healthy
children may not be relevant to children with pathological problems such as brain damage or childhood schizophrenia. The reason
underlying this argument is that those with pathological problems
may fail to show the same magnitude of practice effects as do
healthy subjects and that use of the normative regression equations
given in Table 4 may, therefore, underestimate the improvement of
pathological subjects over time. There is certainly evidence for the
first part of this argument, namely that those with pathological
problems show less of a practice effect compared with healthy
controls (McCaffrey et al., 2000; Wilson et al., 2000). However,
the counterargument is that the task of the clinician is to estimate
the full extent of deficits shown by such individuals, including
those deficits arising from test factors such as difficulty in understanding instructions, impulsivity, distractibility, and failure to
benefit from the practice effect. It follows that it is only by using
normative data such as those represented in the regression equations here that it will be possible to evaluate the full extent of
deficit or recovery.
A similar argument was previously found to be relevant to
research on cardiac by-pass surgery in adults. Many group studies
in this area had found substantial deficits in neuropsychological
performance immediately postsurgery that seemed to disappear on
follow-up. However, most of these studies did not use a control
group and did not take account of practice effects. Those that did
use a healthy control group found continuing significant deficits at
follow-up. This led two of the current authors (Peter D. Slade and
Brenda D. Townes) to publish a methodological article on research
design in the area in which one of the major suggestions was the
necessity of using a healthy control group to estimate, and control
for, practice effects (Slade, Sanchez, Townes, & Aldea, 2001).
PRACTICE EFFECTS IN NEUROCOGNITIVE TESTS
One limitation of the present study was our inability to obtain
information about parental education and occupation, due to concerns about invasion of privacy. Environmental issues are known
to influence test performance, with individuals of higher socioeconomic status (SES) performing better than those of lower SES,
particularly on verbal tests (Sattler, 2001). However, no investigation isolating the effects of SES on the presence, absence, or
magnitude of practice effects was found.
A second possible criticism that might be directed at the present
analysis is that age-adjusted scaled scores do not remove all of the
variance due to developmental factors. In a general sense this is
obviously true. The normative data are unlikely to correct completely for all the developmental effects observed; and in fact we
have shown that this is true for the data of two age subgroups
(8-year-olds and 11-year-olds) on two tests (Matching Figures and
Finger Windows) in the first year of testing. However, ageadjusted scaled scores do seem to have removed the overall general effects of developmental changes.
In conclusion, the dataset examined in this article is unique, with
a large cohort of children being tested yearly for 5 years on a
battery of neuropsychological tests. We believe that the simple
methodology of comparing raw scores with scaled scores is a
useful one for separating out developmental from practice effects
on serial neurocognitive testing of children and adolescents.
References
Adams, W., & Sheslow, D. (1995). Wide Range Assessment of Visual
Motor Abilities: Manual. Wilmington, DE: Wide Range.
Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan.
Baddeley, A. (1990). Human memory: Theory and practice. Hove, United
Kingdom: Erlbaum.
Benedict, R. H. B., & Zgaljardic, D. J. (1998). Practice effects during
repeated administrations of memory tests with and without alternate
forms. Journal of Clinical and Experimental Neuropsychology, 20,
339 –352.
Bird, C. M., Papadopoulu, K., Ricciardelli, P., Rossor, M. N., & Cipolotti,
L. (2003). Test-retest reliability, practice effects and reliable change
indices for the Recognition Memory Test. British Journal of Clinical
Psychology, 42, 407– 425.
Bird, C. M., Papadopoulu, K., Ricciardelli, P., Rossor, M. N., & Cipolotti,
L. (2004). Monitoring cognitive changes: Psychometric properties of six
cognitive tests. British Journal of Clinical Psychology, 43, 197–210.
Brown, S. J., Rourke, B. D., & Cicchetti, D. V. (1989). Reliability of tests
and measures used in the neuropsychological assessment of children.
Clinical Neuropsychologist, 3, 353–368.
Cohen, N. J., & Squire, L. R. (1980, October 10). Preserved learning and
retention of pattern-analyzing skill in amnesia: Dissociation of knowing
how and knowing that. Science, 210(4466), 207–210.
Collie, A., Maruff, P., Darby, D. G., & McStephen, M. (2003). The effects
of practice on the cognitive test performance of neurologically normal
individuals assessed at brief test-retest intervals. Journal of the International Neuropsychological Society, 9, 419 – 428.
Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A
framework for memory research. Journal of Verbal Learning and Verbal
Behavior, 11, 671– 684.
DeRouen, T. A., Leroux, B. G., Martin, M. D., Townes, B. D., Woods,
J. S., Leitao, J., et al. (2002). Issues in design and analysis of a
randomized clinical trial to assess the safety of dental amalgam restorations in children. Controlled Clinical Trials, 23, 301–320.
View publication stats
369
DeRouen, T. A., Martin, M. D., Leroux, B. G., Townes, B. D., Woods,
J. S., Leitao, J., et al. (2006). Neurobehavioral effects of dental amalgam
in children: A randomized clinical trial. Journal of the American Medical Association, 295, 1784 –1792.
Dikmen, S. S., Heaton, R. K., Grant, I., & Temkin, N. R. (1999). Test-retest
reliability and practice effects of Expanded Halstead-Reitan Neuropsychological Test Battery. Journal of the International Neuropsychological Society, 5, 346 –356.
Dodrill, C. B., & Troupin, A. S. (1975). Effects of repeated administrations
of a comprehensive neuropsychological battery among chronic epileptics. Journal of Nervous and Mental Diseases, 161, 185–190.
Hammill, D. D., Pearson, N. A., & Wiederholt, J. L. (1997). Comprehensive Test of Nonverbal Intelligence: Manual. Austin, TX: Pro–Ed.
Hinton-Bayre, A. D., Geffen, G. M., Geffen, L. B., McFarland, K. A., &
Friss, P. (1999). Concussion in contact sports: Reliable change indices of
impairment and recovery. Journal of Clinical and Experimental Neuropsychology, 21, 70 – 86.
Lassiter, K. S., Harrison, T. K., Matthers, T. D., Bell, N. L., & The Citadel.
(2001). The validity of the Comprehensive Test of Nonverbal Intelligence as a measure of fluid intelligence. Assessment, 88, 95–103.
Lezak, M. D. (1995). Neuropsychological assessment. New York: Oxford
University Press.
Lezak, M. D., Howieson, D. B., & Loring, D. W. (2004). Neuropsychological assessment (4th ed.). New York: Oxford University Press.
Martins, I. P., Castro-Caldas, A., Townes, B. D., Ferreira, G., Rodrigues,
P., Marques, S., et al. (2005). Age and sex differences in neurobehavioral performance: A study of Portuguese elementary school children.
International Journal of Neuroscience, 115, 1687–1709.
McCaffrey, R. J., Duff, K., & Westervelt, H. J. (2000). Practitioner’s guide
to evaluating change with intellectual assessment instruments. New
York: Kluwer Academic/Plenum Press.
Murphy, K. R., & Myors, B. (2004). Statistical power analysis. Mahwah,
NJ: Erlbaum.
Sattler, J. M. (2001). Assessment of children: Cognitive applications (3rd
ed.). La Mesa, CA: Sattler.
Sheslow, D., & Adams, W. (1990). Wide Range Assessment of Memory
and Learning: Manual. Wilmington, DE: Jastak Associates.
Slade, P. D., Sanchez, P., Townes, B. D., & Aldea, G. S. (2001). The use
of neurocognitive tests in evaluating the outcome of cardiac surgery:
Some methodologic considerations. Journal of Cardiothoracic and Vascular Anesthesia, 15, 4 – 8.
Squire, L. R. (1986, June 27). Mechanisms of memory. Science,
232(4758), 1612–1619.
Temkin, N. R., Heaton, R. K., Grant, I., & Dikmen, S. S. (1999). Detecting
significant change in neuropsychological test performance: A comparison of four models. Journal of the International Neuropsychological
Society, 5, 357–369.
Wechsler, D. (1991). Wechsler Intelligence Scale for Children—Third
Edition: Manual. San Antonio, TX: Psychological Corporation.
Wechsler, D. (1997). Wechsler Adult Intelligence Scale—Third Edition.
San Antonio: Psychological Corporation.
Wechsler, D. (2004). Wechsler Intelligence Scale for Children—Fourth
Edition: Manual. San Antonio, TX: Psychological Corporation.
Wilson, B. A., Watson, P. C., Baddeley, A. D., Emslie, H., & Evans, J. J.
(2000). Improvement or simply practice? The effects of twenty repeated
assessments on people with and without brain injury. Journal of the
International Neuropsychological Society, 6, 469 – 479.
Received September 24, 2007
Revision received May 1, 2008
Accepted May 8, 2008 䡲