Journal of Clinical and Experimental Neuropsychology

1998, Vol. 20, No. 3, pp. 419-427

Swets & Zeitlinger

Brain is Related to Behavior (p < .05)*
Konstantine K. Zakzanis
York University, Toronto, Canada and Baycrest Centre for Geriatric Care, Toronto, Canada

This article demonstrates that sole reliance on tests of statistical significance in the analysis and interpretation of neuropsychological data that is grounded in quasi-experimentation can systematically confound the
conclusions drawn from our neuropsychological research regarding brain-behavior relations. The conclusion of this article is that we must accompany the statistical significance test with more appropriate statistics namely, point-estimate effect sizes along with interval estimation and meta-analysis for the analysis
of data from multiple studies. The argument for this conclusion is demonstrated from the re-analysis of
published neuropsychological test findings. It is recommended on the basis of this review that the consumer of neuropsychological reports will be better served if due consideration is given to the magnitude of
effect in brain-behavior statistical analyses.

Neuropsychology is defined by its knowledge

base, which is typically meant to represent the
accumulation of scientific study of the relationship between brain and behavior. Indeed, the
first attempts to localize mental processes to the
brain may be traced back to antiquity when Hippocrates of Croton claimed that the brain was
the organ of intellect and the heart the organ of
the senses (see Heilman & Valenstein, 1993).
However, it was not until Gall laid the foundation of modern neuropsychology that the likes of
Bouillard (1825), Broca (1865), Wernicke
(1874), Lichtheim (1885), Liepmann, (1920),
and later, Geschwind (1965), further advanced
our science by way of careful observation followed by analysis as well as hypothesis-testing
based on the observation of single case studies
(Heilman & Valenstein, 1993). This rich methodological tradition has been elegantly employed

over the past few decades to further define the

relationship between brain and behavior (e.g.,
Benson, 1994; Cummings, 1993; Damasio,
1994; Heilman, 1973; Heilman & Valenstein,
1972, 1979, 1993; Kaplan, 1988; Kertesz, 1994;
Lezak, 1995; Luria, 1973; Mesulam, 1982;
Ogden, 1996; Scoville & Milner, 1957; Shallice,
1988; Snowden, Neary, & Mann, 1996; Stuss &
Benson, 1986).
A vast number of investigations that pass for
research in the field of neuropsychology today,
however, entail the use of quasi-experimental
research designs and statistical tests of significance. Quasi-experimental denotes experiments that have treatments, outcome measures,
and experimental units, but do not use random
assignment to create the comparisons from
which treatment-caused change is inferred. Instead, the comparisons depend on nonequivalent

groups that differ from each other in many ways

other than the presence of a treatment whose
effects are being tested. Most characteristically, when a [neuro]psychologist finds a problem he wishes to investigate, he converts his
intuitions and hypotheses into procedures which
will yield a test of significance, and will characteristically allow the result of the test of significance to bear the essential responsibility for the
conclusions he will draw (Bakan, 1966, p. 1).
As such, the quasi-experimental approach to
neuropsychological research has enabled researchers to test a hypothesis based on a far
greater number of patients compared to the n of
1 in single case studies. However, by allowing
the level of significance (i.e., p < .05) to bear the
essential responsibility for the conclusions that
we draw in our quasi-experimentation, we fail to
give due consideration to the magnitude of effect, which in essence, is the principal measure
used in the single case study approach (e.g.,
brain ablation paradigms).
Perhaps a major attraction of significance
testing among modern researchers and consumers of research findings is the apparent simplicity and meaningfulness in dichotomizing all statistical results (involving associations) as significant (p < .05) or nonsignificant (p >
.05). In fact, this common practice has no scientific meaning and was introduced by statisticians
in the early part of the century to confront common decision-making problems (e.g., which of
two fertilizers should be used?) not casual inference in science. Thus, not only does
dichotomizing p values at an arbitrary level degrade the available statistical evidence, but it
readily leads to misinterpretations of findings.
The purpose of this article is to demonstrate
that sole reliance on tests of statistical significance in the analysis and interpretation of neuropsychological data that is grounded in quasi-experimentation can systematically confound the
conclusions drawn from our neuropsychological
research regarding brain-behavior relations.
Some have argued (e.g., Bakan, 1966; Cohen,
1994; Hunter & Schmidt, 1990; Schmidt, 1992,
1996) that null hypothesis statistical significance
testing has not only failed to support the advance of psychology as a science but also has

seriously impeded it. The argument herein is

that tests of statistical significance should be
accompanied by more explicit measures of effect to aid in the interpretation of neuropsychological findings. Before providing a rationale for
such a conclusion, it should first be noted that
the argument to be put forth here is not new and
has been voiced in the field of general psychology. It has been articulated in different ways by
Rozeboom (1960), Bakan (1966), Meehl (1967),
Carver (1978), Guttman (1985), Oakes (1986),
Loftus (1991, 1994), Cohen (1994), and most
recently by Schmidt (1996). This conclusion,
however, has not been voiced in the pages of a
journal with a specialty in clinical and experimental neuropsychology where its implications
for the field of neuropsychology can be considered. Thus, it shall be argued here that we must
accompany the statistical significance test with
more appropriate statistics namely point estimate effect sizes along with interval estimation
and meta-analysis for the analysis of data from
multiple studies. The argument for this conclusion will be demonstrated from the re-analysis
of published neuropsychological test findings
rather than reiterating the illogical statistical
intricacies of null hypothesis statistical significance testing which have been eloquently articulated in the works of Schmidt (1996), Cohen
(1994), and others (Bakan, 1966; Carver, 1978;
Guttman, 1985; Rozeboom, 1960).


It is all too often that we find competing hypotheses in neuropsychology, or even empirical
findings that do not correspond to our clinical
experience with patients who suffer from neurologic or psychiatric disease. Take for example
our understanding of schizophrenia. It has long
been debated that schizophrenia is a disease of
the frontal lobes the so called frontal executive
hypothesis (Zakzanis & Heinrichs, 1997). Proponents of this hypothesis (e.g., Morice &
Delhunty, 1996; Weinberger, 1988) have consistently demonstrated that patients with
schizophrenia perform poorly on neuropsycho-


logical tasks that are sensitive to frontal-executive functioning (e.g., Wisconsin Card Sorting
Test [WCST]; Heaton, Chelune, Talley, Kay, &
Curtiss, 1993) by providing data that achieves
statistical significance. Despite illustrating that
the two groups (typically patients with schizophrenia and normal healthy controls) are statistically different from one another, the question
that fails to be addressed when support for a hypothesis is based solely on the interpretation of
a dichotomized p value is how much of a difference is there between the two groups being compared, or more accurately, what is the magnitude
of deficit (or effect) in the patient sample and
how confident can we be in our obtained results?
Further, it is all too often that a significant p
value is taken to imply the presence of a significant deficit. This is a fatal error in data interpretation (Soper, Cicchetti, Satz, Light, &
Orsini, 1988). To illustrate, a re-analysis of
WCST results in patients with schizophrenia
from published studies with competing conclusions will serve the point. Although the issues raised in this paper are illustrated with a
specific example of a cross-sectional comparison of a continuous outcome (i.e., WCST performance) between two groups (i.e., patients
with schizophrenia and healthy normal controls), the reader is reminded that the point being
made applies to all types of experimental, quasiexperimental, and observational designs involving the estimation of effects in populations.
In the first study of frontal-executive functioning, WCST results for 117 patients with
schizophrenia and 68 healthy normal controls is
presented. A statistically significant two-tailed
independent sample t test with a level of significance at p < .05 is reported for the perseverative
error score on the WCST. This is interpreted to
support frontal-executive impairment in patients
with schizophrenia. In a second study, a nonsignificant (p > .05) two-tailed independent sample
t test is reported for the perseverative error score
on the WCST for 10 patients with schizophrenia
and 10 healthy normal control comparisons.
This result is interpreted as failing to support
frontal-executive impairment in schizophrenia,
but is also qualified with if there were more
patients and controls in the respective groups,


the trend toward a significant difference would

surely have achieved statistical significance.
What have these two studies told us? The first
study supports the frontal-executive hypothesis
in schizophrenia whereas the other does not.
Neither study has indexed the magnitude of
frontal-executive impairment in patients with
schizophrenia. Is it little or none? Is it a defining
characteristic of schizophrenia? Are the normals
more impaired than the patients? The studies
have only demonstrated a statistically significant
difference between two group means on the
WCST perseverative error variable. Granted, a
quick glimpse at the mean performance from
each group will reveal which group obtained a
greater number of perseverative errors, but that
glimpse, and the reported statistical significance
and corresponding p value, will not provide a
quantifiable index of frontal-executive impairment which will make clear whether the frontalexecutive system is impaired in schizophrenia
and, therefore, settle any competing hypotheses.
Moreover, the second study demonstrates the
type of convoluted logic that has indeed served
[neuro]psychology wrongly (Bakan, 1966). That
is, if all that is needed to give truth to our hypotheses is to increase the sample size, then
what is the purpose in testing any hypothesis?
Being theoretically possible (see Meehl, 1967),
then all our hypotheses regarding brain-behavior
relations can be automatically given truth in
keeping with a large enough sample size and the
test of significance. One has to wonder, if Gall
had had available the test of significance, would
we still be practicing Craniology?
What then is our alternative if the test of significance is really of such limited appropriateness? As Bakan (1966) notes, at the very least
it would appear that we would be much better
off if we were to attempt to estimate the magnitude of the parameters in the populations, and
recognize that we then need to make other inferences concerning the [neuro]psychological phenomenon which may be manifesting themselves
in these magnitudes (p. 27). The effect size
estimate d measures magnitude. It is a measure
of the degree to which the phenomenon is present in the population or the degree to which the
null hypothesis is false (Cohen, 1988). In mathe-



matical terms, d is simply the difference in patient and control means calibrated in pooled
standard deviation units (i.e., patient mean
control mean / pooled standard deviation). The
effect size d is not dependent on nor influenced
by sample size. Moreover, effect sizes can demonstrate test score overlap dispersion between
two groups by utilizing and inverting Cohens
(1988) idealized population distributions. That
is, a hypothetical percent overlap is associated
with the varying degrees of effect size. For example, an effect size of 0.0 corresponds to complete overlap the two groups are completely
indistinguishable from one another on the variable measure. If d = 1.0, the corresponding overlap is 45% about half of the patient group can
be discriminated from the control group on the
basis of the variable measure. If d = 3.0, the corresponding overlap is less than 5% the two
groups are approximately completely distinguishable from one another with respect to the
variable measure. Thus, if d does equal about
3.0 for the variable measure, the effect size may
serve as a marker on account of approximate
complete discriminability between experimental
(i.e., patient) and control groups (see Zakzanis,
1998b). Briefly, a diagnostic marker should be
capable of discriminating approximately all patients from all normal healthy controls on the
dependent variable of interest. Such discriminability would have to have an associated effect size greater than 3.0 as this size of effect
corresponds to test score dispersion overlap of
less than 5% between patients and normal
healthy controls. For example, Zakzanis (1998b)
showed that effect sizes of delayed recall and
structural imaging of the hippocampus in patients with dementia of the Alzheimers type
correspond to effect sizes greater than 3.0 and
percentage overlap (OL%) values of less than
5%. This finding was taken to support the notion
that temporal-hippocampal dysfunction is a
marker for dementia of the Alzheimers type
(Zakzanis, 1998b). In doing so, heuristic benchmark criteria were proposed (i.e., d > 3.0 OL%
< 5) that could help articulate further the
strength of neuroanatomic and neuropsychological evidence in other disorders with prominent
brain pathology. Although such a standard is not

entirely justifiable, it can serve as an heuristic

benchmark in which the magnitude of an effect
size can be articulated when interpreting brainbehavior relations from quasi-experimental
When an effect size d is calculated for the
two examples given above, a very different interpretation regarding perseverative error in
schizophrenia is met based on the authors original data. However, first note that although the
emphasis placed in this paper is on the effect
size d, the definition of an effect measure (i.e.,
d; Cohen, 1988) is problematic because the measure is expressed in terms of standard-deviation
units of neuropsychological test scores (the outcome). Thus, the (causal) effect of the predictor
(being schizophrenic) on neuropsychological
deficit is literally confounded with the variances
of those variables being assessed. This characteristic of all variance-based measures of association, including correlation coefficients, means
that the value of such measures will depend, not
only on the effect of interest, but also on the
methods for selecting a study population. For an
elaboration of these points, see two related papers by Greenland, Maclure, Schlesselman, and
Morgenstern (1991) and Greenland, Schlesselman and Criqui (1986).
Returning to our example, in the first study
the calculated effect size is 0.5 corresponding to
approximately 67% overlap between patient and
control test score distributions. In the second
study, the calculated effect size is 1.1 which corresponds to 41% overlap. On the basis of these
results (and remember, study 1 had a statistically significant result, whereas study 2 did not),
study 2 would be in a better position to argue for
frontal-executive impairment in schizophrenia
than study 1 which uses its statistically significant finding to argue in favor of the hypothesis.
Indeed, the true significance of the study
findings masquerade behind convoluted statistical tests of significance. The reinterpretation of
the data using effect size analyses provides a
much more valid index of frontal-executive impairment in patients with schizophrenia which is
not directly determined by sample size. Moreover, the presentation of effect estimates with
confidence intervals is far more informative


than presenting point estimates alone and would

certainly further strengthen the interpretation of
the findings. That is, in keeping with the two
studies examining the possible effect of schizophrenia on WCST deficits, a 95% confidence
interval might show, for example, that the 95%
confidence interval around the larger effect estimate in study 2 includes the 95% confidence
interval around the smaller effect estimate in
study 1. Returning to our example then, the calculated effect sizes coupled with Cohens (1988)
corresponding inverted overlap percentages
from these two particular studies indicate that
most patients with schizophrenia obtain an average number of perseverative errors on the
WCST which makes them indistinguishable
from healthy normal controls, whereas a minority of patients obtain scores that clearly discriminates them from healthy normal controls. Further, because the mean effect size is not able to
completely discriminate all patients with schizophrenia from healthy controls (i.e., d > 3.0), it
would be hard to argue in favor of frontal-executive impairment as being a reliable characteristic of the illness, assuming of course that
perseverative error was sensitive and specific to
frontal-executive function in schizophrenia. Although the two studies mentioned above did not
provide confidence intervals, nor did they provide a means to calculate confidence intervals
based on the published results, a 95% confidence interval drawn around each point-estimate
effect size would further help articulate the accuracy of this conclusion.
In keeping with the conclusion met from the
re-analysis of the published data, it would be
safe to say that such a conclusion is free of convoluted logic. That is, the conclusion has not
been determined on the basis of the significance of the finding that is directly decided by
sample size alone. Most importantly, it is based
on the magnitude of deficit, rather than a statistically significant difference between two means.
Unfortunately, the effect size statistic does not
allow one to generalize from patient samples to
the population. However, it is the accumulation
and synthesis of effect sizes across independent
studies that can reveal the reliability of a finding


and allow generalizations to be made. This can

be accomplished with meta-analysis.


While the test of significance has been carrying too much of the burden of scientific inference in [neuro]psychology (Bakan, 1966, p. 1),
it has resulted in the impediment of our knowledge base as well as created the difficult and
onerous task of trying to compare and synthesize
the myriad of neuropsychological findings into
systematic and objective profiles of neurocognitive function in patient samples. That is, systematic knowledge about neuropsychological phenomenon are commonly dependent on research
conducted within the quasi-experimental framework. In keeping with the WCST example, such
research can often yield an ambiguous mix of
results decidedly significant, suggestive, convincingly null, and sometimes hopelessly inconclusive (Lipsey & Wilson, 1993). Researchers
are then left with the meticulous task of trying to
weigh the evidence that is reported with each
conclusion, while clinicians pick through the
results with hopes of finding a preponderance of
evidence supporting a consistent neuropsychological profile. Moreover, although our science
purports to serve the acquisition of knowledge
and the pursuit of truth, it is all too easy to fall
into the trap of interpreting data selectively in
the service of an a priori position (Jacobson &
Hollon, 1996). When comparing and/or synthesizing more than one statistically significant
finding there is no numeric unit which allows
for valid comparison or synthesis of statistically significant results. What is lacking from
the literature that will aid in the valid comparison and synthesis of neuropsychological findings across studies is magnitude or evidential
strength calibrated in numerical bits of data-information that the test of significance does not
An empirical valid index by which we can
weigh and compare the evidential strength of
neuropsychological findings across studies is the



effect-size statistic. One methodological approach that utilizes this statistic is meta-analysis. Meta-analysis has become a statistically sophisticated tool for objective research integration (Cooper & Hedges, 1994; Glass, McGaw &
Smith, 1981; Hedges & Olkin, 1985; Hunter,
Schmidt & Jackson, 1982; Rosenthal, 1991;
Schimdt, 1996). In addition to solving problems
with traditional literature reviews, such as the
selective inclusion of studies often based on the
reviewers own impressionistic view of the
quality of the study; differential subjective
weighting of studies in the interpretation of a set
of findings; misleading interpretations of study
findings; failure to examine characteristics of
the studies as potential explanations for disparate or consistent results across studies; and failure to examine moderating variables in the relationship under examination (Wolf, 1986), metaanalysis provides tools for the analysis of magnitude (i.e., the effect size d). Eligible research
studies comprising a common dependent variable and statistics that can be transformed into
effect sizes are viewed as a population to be systematically sampled and surveyed. Individual
study results (typically means and standard deviations from each group) and moderator variables
(e.g., education, duration of disease, gender,
age) are then abstracted, quantified and coded,
and assembled into a database that is statistically
analyzed (Lipsey & Wilson, 1993).
The main statistic presented in a meta-analysis is the mean effect size when there is little or
no heterogeneity of effect observed across studies. This statistic is meant to reflect the average
individual effect size across the sample of studies included in the synthesis. However, in the
vast majority of meta-analyses in which there is
appreciable heterogeneity of effect observed
across studies, the primary goal should be to
document and explain such heterogeneity in
terms of various characteristics of the study populations or methods along with the mean effect
across studies. That is, moderator variables are
correlated to the effect size in order to parse relationships of subject characteristics that may
influence the magnitude of the size of effect between the groups being compared. Moreover, as
indicated above, the effect size can then be

transformed into an overlap percentage by inverting Cohens (1988) nonoverlap idealized

distributions which can then be used as a measure of sensitivity that can indicate neuroanatomic or neuropsychological markers for a
disease (Zakzanis, 1998b).
If we return to our example of WCST deficit
in schizophrenia and the frontal-executive hypothesis, it has been shown that the meta-analytic mean effect size across studies (N = 28) for
WCST perseverative errors is 0.87 (Zakzanis &
Heinrichs, 1997). This corresponds to approximately 48% overlap between patients with
schizophrenia and healthy normal controls.
Thus, the meta-analytic effect size supports the
conclusion that frontal-executive impairment is
not a core deficit of schizophrenia. That is,
schizophrenia is not a necessary nor sufficient
cause of frontal-executive impairment so that all
cases of such deficit is not caused by schizophrenia and that all patients with schizophrenia
will not eventually develop such neuropsychological deficit. Indeed, it is true that to negate
the necessary-cause hypothesis, we could simply
identify some confirmed cases of frontal-executive impairment among nonschizophrenics. Similarly to negate the sufficient-cause hypothesis,
we could simply identify some chronic patients
with schizophrenia who never develop frontalexecutive impairment. If so, why then must we
do a meta-analysis to reach a similar conclusion
regarding schizophrenia as that purported
above? In addition to being a much simpler and
direct way of reaching such a conclusion, metaanalysis allows the clinician and researcher to
test how robust the evidence is over the accumulation of studies. That is, we can never be sure
that the conclusions drawn from a single study
are not attributable to chance or error, even with
a test of significance (see Schmidt, 1996). However, if the results from all studies are considered together, as in meta-analysis, the clinician
and researcher is in a better position to evaluate
and articulate the nature and pattern of brainbehavior relations knowing that a given finding
is a reliable, or unreliable, finding.
For example, the conclusion of insufficient
frontal-executive deficit in schizophrenia can be
taken with considerable evidence of reliability.


That is, Zakzanis, (1998c) demonstrated that the

mean meta-analytic effect size from four independent reviewers of WCST performance in patients with schizophrenia correspond to an
intraclass reliability correlation for d of 0.98.
This is excellent reliability. The four independent reviewers met with the same conclusion
based on their meta-analytic results. The unity in
the interpretation of their results is unlike the
conflicting hypotheses and interpretations regarding frontal-executive impairment in schizophrenia that can be found in single quasi-experimental studies or in typical narrative reviews
where the strength of a hypothesis is based on a
count of significant and nonsignificant findings
(e.g., Taylor, 1995; Weinberger, 1988). Indeed,
meta-analysis has been employed by several
investigators to review and resolve controversies
regarding brain-behavior relationships (e.g.,
Binder, Rohling, & Larrabee, 1997; Christensen,
Griffiths, Mackinnon, & Jacomb, 1997; Heinrichs & Zakzanis, 1998; Kinderman & Brown,
1997; Meiran & Jelicic, 1995; Thornton & Raz,
1997; Zakzanis, 1998a, 1998b, 1998c; Zakzanis,
in press-a; Zakzanis, Leach, & Kaplan, in press).
Thus, in the analysis of neurocognitive data
from multiple studies, meta-analysis is a reliable
and valid methodological approach to research


It has been shown that the consumer of neuropsychological reports will be better served if due
consideration is given to the magnitude of effect
along with the results of tests of statistical significance in quasi-experimental studies of brain
and behavior. It should be evident in keeping
with the example that we must adopt more appropriate statistics namely, point-estimate effect sizes along with interval estimation and
meta-analysis for the analysis of data from multiple studies. Effect sizes and 95% confidence
intervals should be reported along with traditional descriptive statistics such as means, standard deviations, minimum-maximum values,
and exact p values when quantitative data is
being reported for two groups. As for the test of


statistical significance, it is ideal for testing ordinal claims relating the order of conditions (see
Frick, 1996), but is insufficient when clinical
significance is important (also see Bieliauskas,
Fastenau, Lacey & Roper, 1997). As such, when
p values are desired, an exact p value (e.g., p
= 0.06) and the deletion from our written reports
any reference to the results being significant
or nonsignificant would indeed serve the
reader of neuropsychological reports well. It
would appear, therefore, that when designing a
study to test a particular brain-behavior hypothesis in neuropsychology that incorporates a
quasi-experimental research design, the magnitude of effect and interval estimation should be
taken into consideration and reported along side
our conclusions for example: The brain is related to behavior ( d = ).

