NHMRC Levels and Grades (2009) PDF
NHMRC Levels and Grades (2009) PDF
NHMRC Levels and Grades (2009) PDF
December 2009
NHMRC levels of evidence and grades for recommendations
for developers of guidelines
Introduction
In 1999 the National Health and Medical Research Council (NHMRC) in Australia released a suite
of handbooks to support organisations involved in the development of evidence-based clinical
practice guidelines (www.nhmrc.gov.au/publications/synopses/cp65syn.htm).
Reflecting the general impetus of the previous decade, these handbooks focused predominantly on
assessing the clinical evidence for interventions. As a consequence, the handbooks present ‘levels
of evidence’ appropriate mainly for intervention studies. However, feedback from guideline
developers received by the NHMRC indicated that the levels of evidence used by the NHMRC
for intervention studies have been found to be restrictive. This was particularly so where the areas
of study do not lend themselves to research designs appropriate to intervention studies (i.e.
randomised controlled trials).
This paper presents a new approach to grading evidence recommendations, which should be
relevant to any clinical guideline (not just those dealing with interventions).
This process of developing and grading evidence recommendations has received robust scrutiny
and refinement through two public consultation phases and formal pilot-testing has been
conducted across a range of guideline development projects.
The Pilot Program on ‘NHMRC additional levels of evidence and grades for recommendations for
developers of guidelines’, was initially released for public consultation in 2005 until mid-2006 with
feedback sought until 30 June 2007 on their usability and applicability. A revised version was then
released for a second stage of public consultation over the period January 2008 to February 2009.
Several guideline development teams, with guidance from a NHMRC Guideline Assessment
Register (GAR) consultant, tested the revised grading approach in guidelines that were developed
during the pilot period. The website feedback and the practical experience of guideline developers
support the clinical utility and academic rigour of the new NHMRC hierarchy of levels of evidence
and their role in the formulation of the new grades of recommendation.
Further peer review was solicited on one aspect of the grading process (specifically revising the
levels of evidence hierarchy) through submission of a manuscript to BMC Medical Research
Methodology, which was published in June 2009. It is anticipated that a subsequent manuscript
outlining the process for grading recommendations will be submitted to a peer reviewed journal
later in 2009.
Levels of evidence
Guidelines can have different purposes, dealing with clinical questions such as intervention,
diagnosis, prognosis, aetiology and screening. To address these clinical questions adequately,
guideline developers need to include different research designs. This consequently requires different
evidence hierarchies that recognise the importance of research designs relevant to the purpose of the
guideline. A new evidence hierarchy has been developed by the NHMRC GAR consultants. This
hierarchy assigns levels of evidence according to the type of research question, recognising the
importance of appropriate research design to that question. As well as the current NHMRC levels of
evidence for interventions, new levels have been developed for studies relevant for guidelines on
diagnosis, prognosis, aetiology and screening.
Grades of recommendations
However, ascribing a level of evidence to a study, that reflects the risk of bias in its design, is only
one small part of assessing evidence for a guideline recommendation. Consideration also needs to be
given to: the quality of the study and the likelihood that the results have been affected by bias during
its conduct; the consistency of its findings to those from other studies; the clinical impact of its
results; the generalisability of the results to the population for whom the guideline is intended; and
the applicability of the results to the Australian (and/or local) health care setting.
To further assist guideline developers to make judgments on the basis of the body of evidence
relevant to a research question, a grading system for recommendations has been developed (see
Part A). This takes the form of an evidence matrix, which lists the evidence components that should
be considered when judging the body of evidence. The grade of a recommendation is based on an
overall assessment of the rating of individual components in the evidence matrix.
Authors
This work was undertaken by the following NHMRC GAR consultants:
NHMRC Management:
This project was managed by the NHMRC Evidence Translation Section with support from the
NHMRC National Institute of Clinical Studies.
To assist guideline developers, the NHMRC GAR consultants have developed an approach for
assessing the body of evidence and formulating recommendations. This will ensure that while
guidelines may differ in their purpose and formulation, their developmental processes are
consistent, and their recommendations are formulated in a consistent manner. Part A describes
how to grade the ‘body of evidence’ for each guideline recommendation. The body of evidence
considers the evidence dimensions of all the studies relevant to that recommendation. Part B
gives further detail on how to appraise individual studies contributing to the body of evidence.
Consequently, the NHMRC Evidence Statement Form is intended to be used for each clinical
question addressed in a guideline. Before completing the form, each included study should be
critically appraised and the relevant data extracted and summarised as shown in the NHMRC
standards and procedures for externally developed guidelines (NHMRC 2007) and with
reference to Part B below. This information assists in the formulation of the recommendation,
and in determining the overall grade of the ‘body of evidence’ that supports that
recommendation.
The NHMRC Evidence Statement Form sets out the basis for rating five key components of the
‘body of evidence’ for each recommendation. These components are:
1. The evidence base, in terms of the number of studies, level of evidence and quality of studies
(risk of bias).
2. The consistency of the study results.
3. The potential clinical impact of the proposed recommendation.
4. The generalisability of the body of evidence to the target population for the guideline.
The first two components give a picture of the internal validity of the study data in support of
efficacy (for an intervention), accuracy (for a diagnostic test), or strength of association (for a
prognosis or aetiological question). The third component addresses the likely clinical impact of
the proposed recommendation. The last two components consider external factors that may
influence the effectiveness of the proposed recommendation in practice, in terms of the
generalisability of study results to the intended target population for the Guideline and setting
of the proposed recommendation, and applicability to the Australian (or other local) health care
system.
1. Evidence base
The evidence base is assessed in terms of the quantity, level and quality (risk of bias) of the
included studies:
• Quantity of evidence reflects the number of the studies that have been included as the
evidence base for each guideline (and listed in the evidence summary table or text). The
quantity assessment also takes into account the number of patients in relation to the
frequency of the outcomes measured (ie the statistical power of the studies). Small,
underpowered studies that are otherwise sound may be included in the evidence base if their
findings are generally similar — but at least some of the studies cited as evidence must be
large enough to detect the size and direction of any effect. Alternatively, the results of the
studies could be considered in a meta-analysis to increase the power and statistical precision
of the effect estimate.
• Level of evidence reflects the best study types for the specific type of question (see Part B,
Table 3). The most appropriate study design to answer each type of clinical question
(intervention, diagnostic accuracy, aetiology or prognosis) is level II evidence. Level I
studies are systematic reviews of the appropriate level II studies in each case. Study designs
that are progressively less robust for answering each type of question are shown at levels III
and IV. Systematic reviews of level III and IV studies are ascribed the same level of
evidence as the studies included in the review to address each outcome. For example, a
systematic review of cohort studies and case series for an intervention question would be
given a Level III-2 ranking in the hierarchy, even if the quality of the systematic review was
exceptional. The levels of evidence hierarchy is specifically concerned with the risk of bias
in the presented results that is related to study design (see Explanatory note 4 to Table 3),
whereas the quality of the evidence is assessed separately.
• Quality of evidence reflects how well the studies were conducted in order to eliminate bias,
including how the subjects were selected, allocated to groups, managed and followed up and
how the study outcomes were measured (see Part B, Dimensions of evidence, and Table 4
for further information).
2. Consistency
The consistency component of the ‘body of evidence’ assesses whether the findings are
consistent across the included studies (including across a range of study populations and study
designs). It is important to determine whether study results are consistent to ensure that the
results are likely to be replicable or only likely to occur under certain conditions. Ideally, for a
meta-analysis of randomised studies, there should be a statistical analysis of heterogeneity
showing little statistical difference (consistent or homogenous) between the studies. However,
given that statistical tests for heterogeneity are underpowered, presentation of an I2 statistic2, as
well as an appraisal of the likely reasons for the differences in results across studies, would be
useful. Heterogeneity in the results of studies may be due to differences in the study design, the
quality of the studies (risk of bias), the population studied, the definition of the outcome being
assessed, as well as many other factors. Non-randomised studies may have larger estimates of
1 Adapted from the Scottish Intercollegiate Guidelines Network (SIGN) guide to using their Considered Judgement Form (available
from http://www.sign.ac.uk/guidelines/fulltext/50/annexd.html Accessed 19.10.07)
2 whereas most statistical tests of heterogeneity (eg Cochran’s Q) assess whether heterogeneity exists between studies, I2 is a
statistic that quantifies how much heterogeneity exists between the studies (see Higgins & Thompson, 2002)
NHMRC levels of evidence and grades for recommendations 5
December 2009
effect as a result of the greater bias in such studies; however, such studies may also be important
for confirming or questioning results from randomised trials in larger populations that may be
more representative of the target population for the proposed guideline.
3. Clinical impact
Clinical impact is a measure of the potential benefit from application of the guideline to a
population. Factors that need to be taken into account when estimating clinical impact include:
• the relevance of the evidence to the clinical question, the statistical precision and size of the
effect (including clinical importance) of the results in the evidence-base, and the relevance of
the effect to the patients, compared with other management options (or none)
• the duration of therapy required to achieve the effect, and
• the balance of risks and benefits (taking into account the size of the patient population
concerned).
4. Generalisability
This component covers how well the subjects and settings of the included studies will match
those of the Guideline recommendations, specifically the patient population being targeted by
the Guideline and the clinical setting where the recommendation will be implemented.
Population issues that might influence the relative importance of recommendations include
gender, age or ethnicity, baseline risk, or the level of care (eg community or hospital). This is
particularly important for evidence from randomised controlled trials (RCTs), as the setting and
entry requirements for such trials are generally narrowly based and therefore may not be
representative of all the patients to whom the recommendation may be applied in practice.
Confirmation of RCT evidence by broader-based population studies may be helpful in this
regard (see ‘2. Consistency’). Basically, an assessment of generalisability is about determining
whether the available body of evidence is answering the clinical question that was asked.
In the case of studies of diagnostic accuracy, a number of additional criteria also need to be taken
into account, including the stage of the disease (eg early versus advanced), the duration of illness
and the prevalence of the disease in the study population as compared to the target population for
the guideline.
5. Applicability
This component addresses whether the evidence base is relevant to the Australian health care
system generally, or to more local settings for specific recommendations (such as rural areas or
cities).
Factors that may reduce the direct application of study findings to the Australian or more local
settings include organisational factors (eg availability of trained staff, clinic time, specialised
equipment, tests or other resources) and cultural factors (eg attitudes to health issues, including
those that may affect compliance with the recommendation).
The components described above should be rated according to the matrix shown in Table 1.
Enter the results into the NHMRC Evidence Statement Form (Attachment 1) along with any
further notes relevant to the discussions for each component.
Evidence one or more level I one or two level II studies one or two level III level IV studies, or level I
base1 studies with a low risk of with a low risk of bias or a studies with a low risk of to III studies/SRs with a
bias or several level II SR/several level III studies bias, or level I or II high risk of bias
studies with a low risk of with a low risk of bias studies with a moderate
bias risk of bias
Consistency2 all studies consistent most studies consistent some inconsistency evidence is
and inconsistency may reflecting genuine inconsistent
be explained uncertainty around
clinical question
Generalisability population/s studied in population/s studied in the population/s studied in population/s studied in body
body of evidence are body of evidence are body of evidence differ to of evidence differ to target
the same as the target similar to the target target population for population and hard to judge
population for the population for the guideline guideline but it is clinically whether it is sensible to
guideline sensible to apply this generalise to target
evidence to target population
population3
The Evidence Statement Form also provides space to enter any other relevant factors that were
taken into account by the guideline developers when judging the body of evidence and developing
NHMRC levels of evidence and grades for recommendations 7
December 2009
the wording of the recommendation.
NHMRC overall grades of recommendation are intended to indicate the strength of the body of
evidence underpinning the recommendation. This should assist users of the clinical practice
guidelines to make appropriate and informed clinical judgments. Grade A or B recommendations
are generally based on a body of evidence that can be trusted to guide clinical practice, whereas
Grades C or D recommendations must be applied carefully to individual clinical and
organisational circumstances and should be interpreted with care (see Table 2).
Grade of Description
recommendation
1. Evidence base (number of studies, level of evidence and risk of bias in the included studies)
A One or more level I studies with a low risk of bias or several level II studies with a low risk of bias
B One or two Level II studies with a low risk of bias or SR/several Level III studies with a low risk of bias
C One or two Level III studies with a low risk of bias or Level I or II studies with a moderate risk of bias
D Level IV studies or Level I to III studies/SRs with a high risk of bias
2. Consistency (if only one study was available, rank this component as ‘not applicable’)
A All studies consistent
B Most studies consistent and inconsistency can be explained
C Some inconsistency, reflecting genuine uncertainty around question
D Evidence is inconsistent
NA Not applicable (one study only)
3. Clinical impact (Indicate in the space below if the study results varied according to some unknown factor (not simply study quality or sample size) and thus the clinical impact of the intervention could not be determined)
A Very large
B Moderate
C Slight
D Restricted
4. Generalisability (How well does the body of evidence match the population and clinical settings being targeted by the Guideline?)
A Evidence directly generalisable to target population
B Evidence directly generalisable to target population with some caveats
C Evidence not directly generalisable to the target population but could be sensibly applied
D Evidence not directly generalisable to target population and hard to judge whether it is sensible to apply
5. Applicability (Is the body of evidence relevant to the Australian healthcare context in terms of health services/delivery of care and cultural factors?)
A Evidence directly applicable to Australian healthcare context
B Evidence applicable to Australian healthcare context with few caveats
C Evidence probably applicable to Australian healthcare context with some caveats
D Evidence not applicable to Australian healthcare context
NHMRC levels of evidence and grades for recommendations 10
December 2009
Other factors (Indicate here any other factors that you took into account when assessing the evidence base (for example, issues that might cause the group to downgrade or upgrade the recommendation)
2. Consistency
3. Clinical impact
4. Generalisability
5. Applicability
Indicate any dissenting opinions
IMPLEMENTATION OF RECOMMENDATION
Please indicate yes or no to the following questions. Where the answer is yes please provide explanatory information about this. This information will be used to develop the implementation plan for the
guidelines.
Will this recommendation result in changes in usual care?
YES
NO
Are there any resource implications associated with implementing this recommendation?
YES
NO
Will the implementation of this recommendation require changes in the way care is currently organised?
YES
NO
Are the guideline development group aware of any barriers to the implementation of this recommendation?
YES
NO
1. Strength of evidence
a. Level of evidence: Each study design is assessed according to its place in the research
hierarchy. The hierarchy reflects the potential of each study or systematic review included in
the systematic review(s) underpinning the Guidelines to adequately answer a particular
research question, based on the probability that its design has minimised the impact of bias
on the results. See page 6–10 of How to use the evidence: assessment and application of
scientific evidence (NHMRC 2000b).
The original NHMRC levels of evidence for intervention studies (NHMRC 2000b),
together with the new levels of evidence for questions on diagnosis, prognosis, aetiology
and screening are shown in the evidence hierarchy in Table 3. A glossary describing
each of the study designs is provided in Attachment 2.
b. Quality of evidence (risk of bias): The methodological quality of each included study
is critically appraised. Each study is assessed according to the likelihood that bias,
confounding and/or chance may have influenced its results. The NHMRC toolkit
How to review the evidence: systematic identification and review of the scientific
literature (NHMRC 2000a) lists examples of ways that methodological quality can
be assessed. In cases where other critical appraisal approaches may be required, there
are a number of alternatives. The NHMRC/NICS can advise on the choice of an
alternative to supplement and/or replace those in the NHMRC handbook (see Table
4).
c. Statistical precision: The primary outcomes of each included study are evaluated to
determine whether the effect is real, rather than due to chance (using a level of significance
expressed as a P-value and/or a confidence interval). See page 17 of How to use the
evidence: assessment and application of scientific evidence (NHMRC 2000b).
2. Size of effect
This dimension is useful for assessing the clinical importance of the findings of each study (and
hence addresses the clinical impact component of the body of evidence matrix in Part A ). This
is a different concept to statistical precision and specifically refers to the measure of effect or
point estimate provided in the results of each study (eg mean difference, relative risk, odds
ratio, hazard ratio, sensitivity, specificity). In the case of a meta-analysis it is the pooled
measure of effect from the studies included in the systematic review (eg weighted mean
difference, pooled relative risk). These point estimates are calculated in comparison to either
doing nothing or versus an active control.
3. Relevance of evidence
This dimension deals with the translation of research evidence into clinical practice and is
potentially the most subjective of the evidence assessments. There are two key questions.
a. Appropriateness of the outcomes: Are the outcomes measured in the study relevant to
patients? This question focuses on the patient-centredness of the study. See pages 23–27 of
How to use the evidence: assessment and application of scientific evidence (NHMRC
2000b).
b. Relevance of study question: How closely do the elements of the research question (‘PICO’3)
match those of the clinical question being considered in the guideline? This is important in
determining the extent to which the study results are relevant (generalisable) for the
population who will be the recipients of the clinical guideline.
The results of these assessments for each included study should be entered into a data extraction
form described in the NHMRC standards and procedures for externally developed guidelines
(NHMRC 2007). Once each included study is assessed according to these dimensions of evidence,
a summary can be made that is relevant to the whole body of evidence, which can then be graded
as described in Part A of this document. The data extraction process provides the evidence base on
which the systematic review, and subsequent guideline recommendations are built.
II A randomised controlled trial A study of test accuracy with: A prospective cohort study7 A prospective cohort study A randomised controlled trial
an independent, blinded
comparison with a valid
reference standard,5 among
consecutive persons with a
defined clinical presentation6
III-1 A pseudorandomised controlled trial A study of test accuracy with: All or none8 All or none8 A pseudorandomised
(i.e. alternate allocation or some an independent, blinded controlled trial
comparison with a valid
other method) (i.e. alternate allocation or
reference standard,5 among some other method)
non-consecutive persons with
a defined clinical presentation6
III-2 A comparative study with A comparison with reference Analysis of prognostic factors A retrospective cohort study A comparative study with
concurrent controls: standard that does not meet the amongst persons in a single concurrent controls:
▪ Non-randomised, criteria required for arm of a randomised ▪ Non-randomised,
experimental trial9 Level II and III-1 evidence controlled trial experimental trial
▪ Cohort study ▪ Cohort study
▪ Case-control study ▪ Case-control study
▪ Interrupted time series with a
control group
III-3 A comparative study without Diagnostic case-control A retrospective cohort study A case-control study A comparative study without
concurrent controls: study6 concurrent controls:
▪ Historical control study ▪ Historical control study
▪ Two or more single arm ▪ Two or more single arm
study10 study
▪ Interrupted time series without a
parallel control group
IV Case series with either post-test or Study of diagnostic yield (no Case series, or cohort study of A cross-sectional study or Case series
pre-test/post-test outcomes reference standard)11 persons at different stages of case series
disease
Intervention Page 45
Diagnosis Page 62 QUADAS (Whiting et al., 2003)
Prognosis Page 81 GATE checklist for prognostic studies
(NZGG, 2001)
Aetiology Page 73
Screening Page 45 UK National Screening Committee
Guidelines (2000)
Systematic Review Page 162 SIGN checklist (SIGN, 2006), CASP checklist (CASP, 2006)
1 Included in How to review the evidence: systematic identification and review of the scientific literature (NHMRC
2000a)
2 Included in How to use the evidence: assessment and application of scientific evidence (NHMRC 2000b)
Conclusion
This paper outlines an approach to developing guideline recommendations that was piloted and
refined over four years by NHMRC GAR consultants. This approach reflects the concerted
input of experience in assisting a range of guideline developers to develop guidelines for a range
of conditions and purposes. It also incorporates feedback from the guideline developers
themselves to improve the utility of the process and the clarity of the instructions and
suggestions.
There are some types of evidence that have not been captured in this new grading approach,
specifically the appraisal of qualitative studies and cost-effectiveness analyses. The empirical
and theoretical basis for appraising and synthesising these types of evidence in a standard
manner is still uncertain and undergoing refinement. It is expected that that with developments
in these fields that subsequent revision of the presented approach to developing guideline
recommendations may occur.
This new methodological approach provides a way forward for guideline developers to appraise,
classify and grade evidence relevant to the purpose of a guideline and develop recommendations
that are evidence-based, action-oriented and implementable.
Note: This is a specialised glossary that relates specifically to the study designs mentioned in the
NHMRC Evidence Hierarchy. Glossaries of terms that relate to wider epidemiological concepts
and evidence based medicine are also available – see http://www.inahta.org/HTA/Glossary/;
http://www.ebmny.org/glossary.html
All or none –- all or none of a series of people (case series) with the risk factor(s) experience the
outcome. The data should relate to an unselected or representative case series which provides an
unbiased representation of the prognostic effect. For example, no smallpox develops in the
absence of the specific virus; and clear proof of the causal link has come from the disappearance
of small pox after large scale vaccination. This is a rare situation.
A study of test accuracy with: an independent, blinded comparison with a valid reference
standard, among consecutive patients with a defined clinical presentation – a cross-sectional
study where a consecutive group of people from an appropriate (relevant) population receive the
test under study (index test) and the reference standard test. The index test result is not
incorporated in (is independent of) the reference test result/final diagnosis. The assessor
determining the results of the index test is blinded to the results of the reference standard test and
vice versa.
A study of test accuracy with: an independent, blinded comparison with a valid reference
standard, among non-consecutive patients with a defined clinical presentation – a cross-
sectional study where a non-consecutive group of people from an appropriate (relevant)
population receive the test under study (index test) and the reference standard test. The index test
result is not incorporated in (is independent of) the reference test result/final diagnosis. The
assessor determining the results of the index test is blinded to the results of the reference standard
test and vice versa.
Adjusted indirect comparisons – an adjusted indirect comparison compares single arms from
two or more interventions from two or more separate studies via the use of a common reference ie
A versus B and B versus C allows a comparison of A versus C when there is statistical adjustment
for B. This is most commonly done in meta-analyses (see Bucher et al 1997). Such an indirect
comparison should only be attempted when the study populations, common comparator/reference,
and settings are very similar in the two studies (Song et al 2000).
Case-control study – people with the outcome or disease (cases) and an appropriate group of
controls without the outcome or disease (controls) are selected and information obtained about
their previous exposure/non-exposure to the intervention or factor under study.
Case series – a single group of people exposed to the intervention (factor under study).
Post-test – only outcomes after the intervention (factor under study) are recorded in the
series of people, so no comparisons can be made.
Pre-test/post-test – measures on an outcome are taken before and after the intervention is
introduced to a series of people and are then compared (also known as a ‘before- and-after
study’).
Retrospective cohort study – where the cohorts (groups of people exposed and not
exposed) are defined at a point of time in the past and information collected on
subsequent outcomes, eg. the use of medical records to identify a group of women using
oral contraceptives five years ago, and a group of women not using oral contraceptives,
and then contacting these women or identifying in subsequent medical records the
development of deep vein thrombosis.
Cross-sectional study – a group of people are assessed at a particular point (or cross-section) in
time and the data collected on outcomes relate to that point in time ie proportion of people with
asthma in October 2004. This type of study is useful for hypothesis-generation, to identify whether
a risk factor is associated with a certain type of outcome, but more often than not (except when the
exposure and outcome are stable eg. genetic mutation and certain clinical symptoms) the causal
link cannot be proven unless a time dimension is included.
Diagnostic (test) accuracy – in diagnostic accuracy studies, the outcomes from one or more
diagnostic tests under evaluation (the index test/s) are compared with outcomes from a reference
standard test. These outcomes are measured in individuals who are suspected of having the
condition of interest. The term accuracy refers to the amount of agreement between the index test
and the reference standard test in terms of outcome measurement. Diagnostic accuracy can be
expressed in many ways, including sensitivity and specificity, likelihood ratios, diagnostic odds
ratio, and the area under a receiver operator characteristic (ROC) curve (Bossuyt et al 2003)
Diagnostic case-control study – the index test results for a group of patients already known to
have the disease (through the reference standard) are compared to the index test results with a
separate group of normal/healthy people known to be free of the disease (through the use of the
reference standard). In this situation patients with borderline or mild expressions of the disease,
and conditions mimicking the disease are excluded, which can lead to exaggeration of both
sensitivity and specificity. This is called spectrum bias because the spectrum of study
participants will not be representative of patients seen in practice. Note: this does not apply to
well-designed population based case-control studies.
Historical control study – outcomes for a prospectively collected group of people exposed to the
intervention (factor under study) are compared with either (1) the outcomes of people treated at the
same institution prior to the introduction of the intervention (ie. control group/usual care), or (2)
the outcomes of a previously published series of people undergoing the alternate or control
intervention.
Interrupted time series with a control group – trends in an outcome or disease are measured
over multiple time points before and after the intervention (factor under study) is introduced to a
group of people, and then compared to the outcomes at the same time points for a group of people
that do not receive the intervention (factor under study).
Interrupted time series without a parallel control group – trends in an outcome or disease
are measured over multiple time points before and after the intervention (factor under study) is
introduced to a group of people, and compared (as opposed to being compared to an external
control group).
Randomised controlled trial – the unit of experimentation (eg. people, or a cluster of people4)
is allocated to either an intervention (the factor under study) group or a control group, using a
random mechanism (such as a coin toss, random number table, computer-generated random
numbers) and the outcomes from each group are compared. Cross-over randomised controlled
trials – where the people in the trial receive one intervention and then cross-over to receive the
alternate intervention at a point in time – are considered to be the same level of evidence as a
randomised controlled trial, although appraisal of these trials would need to be tailored to address
the risk of bias specific to cross-over trials,
Reference standard - the reference standard is considered to be the best available method for
establishing the presence or absence of the target condition of interest. The reference standard can
be a single method, or a combination of methods. It can include laboratory tests, imaging tests, and
pathology, but also dedicated clinical follow-up of individuals (Bossuyt et al 2003).
Study of diagnostic yield – these studies provide the yield of diagnosed patients, as
determined by the index test, without confirmation of the accuracy of the diagnosis (ie.
whether the patient is actually diseased) by a reference standard test.
Systematic review – systematic location, appraisal and synthesis of evidence from scientific
studies.
4
Known as a cluster randomised controlled trial
NHMRC levels of evidence and grades for recommendations 20
December 2009
Test - any method of obtaining additional information on a person’s health status. It includes
information from history and physical examination, laboratory tests, imaging tests, function tests,
and histopathology (Bossuyt et al 2003).
Two or more single arm study – the outcomes of a single series of people receiving an
intervention (case series) from two or more studies are compared. Also see entry on unadjusted
indirect comparisons.
Bandolier editorial. Diagnostic testing emerging from the gloom? Bandolier, 1999;70. Available
at: http://www.jr2.ox.ac.uk/bandolier/band70/b70-5.html
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, Lijmer JG, Moher D,
Rennie D, de Vet HCW for the STARD Group. Towards complete and accurate reporting of
studies of diagnostic accuracy: the STARD initiative. AJR, 2003; 181:51-56
Bucher HC, Guyatt GH, Griffith LE, Walter SD. The results of direct and indirect treatment
comparisons in meta-analysis of randomized controlled trials. J Clin Epidemiol, 1997;50:683-91.
CASP (2006). Critical Appraisal Skills Programme (CASP) - making sense of evidence: 10
questions to help you make sense of reviews. England: Public Health Resource Unit. Available at:
http://www.phru.nhs.uk/Doc_Links/S.Reviews%20Appraisal%20Tool.pdf
Elwood M. (1998) Critical appraisal of epidemiological studies and clinical trials. Second edition.
Oxford: Oxford University Press.
Glasziou P, Irwig L, Bain C, Colditz G. (2001) Systematic reviews in health care. A practical
guide. Cambridge: Cambridge University Press.
Higgins JP, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med, 2002;
21(11):1539-58.
Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van der Meulen JHP, Bossuyt PMM.
Empirical evidence of design-related bias in studies of diagnostic tests. JAMA, 1999;
282(11):1061-6.
Medical Services Advisory Committee (2005). Guidelines for the assessment of diagnostic
technologies. [Internet] Available at: www.msac.gov.au
Mulherin S, Miller WC. Spectrum bias or spectrum effect? Subgroup variation in diagnostic test
evaluation. Ann Intern Med, 2002;137:598-602.
NHMRC (1999). A guide to the development, implementation and evaluation of clinical practice
guidelines. Canberra: National Health and Medical Research Council.
NHMRC (2000a). How to review the evidence: systematic identification and review of the
scientific literature. Canberra: National Health and Medical Research Council.
NHMRC (2000b). How to use the evidence: assessment and application of scientific evidence.
Canberra: National Health and Medical Research Council.
NHMRC (2007). NHMRC standards and procedures for externally developed guidelines. Canberra:
National Health and Medical Research Council.
http://www.nhmrc.gov.au/publications/synopses/_files/nh56.pdf
NZGG (2001). Handbook for the preparation of explicit evidence-based clinical practice
guidelines. Wellington: New Zealand Guidelines Group. Available at: http://www.nzgg.org.nz
Phillips B, Ball C, Sackett D, Badenoch D, Straus S, Haynes B, Dawes M (2001). Oxford Centre
for Evidence-Based Medicine levels of evidence (May 2001). Oxford: Centre for Evidence-Based
Medicine. Available at: http://www.cebm.net/levels_of_evidence.asp
Sackett DL, Haynes RB. The architecture of diagnostic research. BMJ, 2002;324:539-41.
SIGN. SIGN 50. A guideline developers’ handbook. Methodology checklist 1: Systematic reviews
and meta-analyses. Edinburgh: Scottish Intercollegiate Guidelines Network. Available at:
http://www.sign.ac.uk/guidelines/fulltext/50/checklist1.html
Song F, Glenny A-M, Altman DG. Indirect comparison in evaluating relative efficacy illustrated
by antimicrobial prophylaxis in colorectal surgery. Controlled Clinical Trials, 2000;21(5):488-
497.
NHMRC levels of evidence and grades for recommendations 22
December 2009
UK National Screening Committee (2000). The UK National Screening Committee’s criteria for
appraising the viability, effectiveness and appropriateness of a screening programme. In: Second
Report of the UK National Screening Committee. London: United Kingdom Departments of
Health. Pp. 26-27. Available at: http://www.nsc.nhs.uk/
UK National Screening Committee. What is screening?. [Internet]. Available at -
http://www.nsc.nhs.uk/whatscreening/whatscreen_ind.htm [Accessed August 2007].
Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J. The development of QUADAS:a tool
for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC
Med Res Methodol 2003; 3(1): 25. Available at: http://www.biomedcentral.com/1471-2288/3/25