(English Language Education 6) John Read (Eds.) - Post-Admission Language Assessment of University Students-Springer International Publishing (2016)
(English Language Education 6) John Read (Eds.) - Post-Admission Language Assessment of University Students-Springer International Publishing (2016)
(English Language Education 6) John Read (Eds.) - Post-Admission Language Assessment of University Students-Springer International Publishing (2016)
Post-admission
Language
Assessment
of University
Students
English Language Education
Volume 6
Series Editors
Chris Davison, The University of New South Wales, Australia
Xuesong Gao, The University of Hong Kong, China
Post-admission Language
Assessment of University
Students
Editor
John Read
School of Cultures, Languages and Linguistics
University of Auckland
Auckland, New Zealand
This volume grew out of two conference events that I organised in 2013 and 2014.
The first was a symposium at the Language Testing Research Colloquium in Seoul,
South Korea, in July 2013 with the title “Exploring the diagnostic potential of post-
admission language assessments in English-medium universities”. The other event
was a colloquium entitled “Exploring post-admission language assessments in uni-
versities internationally” at the Annual Conference of the American Association for
Applied Linguistics (AAAL) in Portland, Oregon, USA, in March 2014. The AAAL
symposium attracted the attention of the Springer commissioning editor, Jolanda
Voogt, who invited me to submit a proposal for an edited volume of the papers
presented at one conference or the other. In order to expand the scope of the book,
I invited Edward Li and Avasha Rimbiritch, who were not among the original
presenters, to prepare additional chapters. Several of the chapters acquired an extra
author along the way to provide specialist expertise on some aspects of the
content.
I want to express my great appreciation first to the authors for the rich and stimu-
lating content of their papers. On a more practical level, they generally met their
deadlines to ensure that the book would appear in a timely manner and they will-
ingly undertook the necessary revisions of their original submissions. Whatever my
virtues as an editor, I found that as an author I tended to trail behind the others in
completing my substantive contributions to the volume.
At Springer, I am grateful to Jolanda Voogt for seeing the potential of this topic
for a published volume and encouraging us to develop it. Helen van der Stelt has
been a most efficient editorial assistant and a pleasure to work with. I would also
like to thank the series editors, Chris Davison and Andy Gao, for their ongoing sup-
port and encouragement. In addition, two anonymous reviewers of the draft manu-
script gave positive feedback and very useful suggestions for revisions.
v
vi Preface
Part I Introduction
1 Some Key Issues in Post-Admission Language Assessment . . . . . . . . . . 3
John Read
vii
viii Contents
Part V Conclusion
11 Reflecting on the Contribution of Post-Admission Assessments . . . . 219
John Read
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Contributors
ix
x Contributors
Rebecca Patterson Office of the Dean: Humanities, University of the Free State,
Bloemfontein, South Africa
Anna Pot Office of the Dean: Humanities, University of the Free State,
Bloemfontein, South Africa
Avasha Rambiritch Unit for Academic Literacy, University of Pretoria, Pretoria,
South Africa
Michelle Raquel Centre for Applied English Studies, University of Hong Kong,
Hong Kong, China
John Read School of Cultures, Languages and Linguistics, University of Auckland,
Auckland, New Zealand
Thomas Roche SCU College, Southern Cross University, Lismore, NSW, Australia
Yogesh Sinha Department of English Language Teaching, Sohar University, Al
Sohar, Oman
Suthathip Ploy Thirakunkovit English Department, Mahidol University,
Bangkok, Thailand
Alan Urmston English Language Centre, Hong Kong Polytechnic University,
Hong Kong, China
Janet von Randow Diagnostic English Language Needs Assessment, University
of Auckland, Auckland, New Zealand
Albert Weideman Office of the Dean: Humanities, University of the Free State,
Bloemfontein, South Africa
Xun Yan Department of Linguistics, University of Illinois at Urbana-Champaign,
Urbana-Champaign, IL, USA
Part I
Introduction
Chapter 1
Some Key Issues in Post-Admission Language
Assessment
John Read
Abstract This chapter introduces the volume by briefly outlining trends in English-
medium higher education internationally, but with particular reference to post-entry
language assessment (PELA) in Australian universities. The key features of a PELA
are described, in contrast to a placement test and an international proficiency test.
There is an overview of each of the other chapters in the book, providing appropri-
ate background information on the societies and education systems represented:
Australia, Canada, Hong Kong, the USA, New Zealand, Oman and South Africa.
This is followed by a discussion of three themes running through several chapters.
The first is how to validate post-admission language assessments; the second is the
desirability of obtaining feedback from the test-takers; and the third is the extent to
which a PELA is diagnostic in nature.
J. Read (*)
School of Cultures, Languages and Linguistics, University of Auckland,
Auckland, New Zealand
e-mail: ja.read@auckland.ac.nz
employment in Australia. This score (or 6.5 in many cases) is the standard require-
ment for direct admission to undergraduate degree programmes, but the problem
was that many students were following alternative pathways into the universities
which allowed them to enter the country originally with much lower test scores, and
they had not been re-tested at the time they were accepted for degree-level study.
Media coverage of Birrell’s work generated a large amount of public debate
about English language standards in Australian universities. A national symposium
(AEI 2007) organised by a federal government agency was held in Canberra to
address the issues and this led to a project by the Australian Universities Quality
Agency (AUQA) to develop the Good Practice Principles for English Language
Proficiency for International Students in Australian Universities (AUQA 2009). The
principles have been influential in prompting tertiary institutions to review their
provisions for supporting international students and have been incorporated into the
regular cycle of academic audits conducted by the AUQA and its successor, the
Tertiary Education Quality and Standards Agency (TEQSA). In fact, the promotion
of English language standards (or academic literacy) is now seen as encompassing
the whole student body, rather than just international students (see, e.g., Arkoudis
et al. 2012).
From an assessment perspective, the two most relevant Good Practice Principles
are these:
1. Universities are responsible for ensuring that their students are sufficiently com-
petent in English to participate effectively in their university studies.
2. Students’ English language development needs are diagnosed early in their stud-
ies and addressed, with ongoing opportunities for self-assessment (AUQA 2009,
p. 4).
A third principle, which assigns shared responsibility to the students themselves,
should also be noted:
3. Students have responsibilities for further developing their English language pro-
ficiency during their study at university and are advised of these responsibilities
prior to enrolment. (ibid.)
These principles have produced initiatives in many Australian institutions to
design what have become known as post-entry language assessments (PELAs).
Actually, a small number of assessments of this kind pre-date the developments of
the last 10 years, notably the Diagnostic English Language Assessment (DELA) at
the University of Melbourne (Knoch et al. this volume) and Measuring the Academic
Skills of University Students at the University of Sydney (Bonanno and Jones
2007). The more recent developments have been documented in national surveys by
Dunworth (2009) and Dunworth et al. (2013). The latter project led to the creation
of the Degrees of Proficiency website (www.degreesofproficiency.aall.org.au),
which offers a range of useful resources on implementing the Good Practice
Principles, including a database of PELAs in universities nationwide.
In New Zealand, although the term PELA is not used, the University of Auckland
has implemented the most comprehensive assessment programme of this kind, the
6 J. Read
The first four chapters of the book, following this one, focus on the assessment of
students entering degree programmes in English-medium universities at the under-
graduate level. The second chapter, by Ute Knoch, Cathie Elder and Sally
O’Hagan, discusses recent developments at the University of Melbourne, which
has been a pioneering institution in Australia in the area of post-admission
8 J. Read
assessment, not only because of the high percentage of students from non-English-
speaking backgrounds on campus but also because the establishment of the
Language Testing Research Centre (LTRC) there in 1990 made available to the
University expertise in test design and development. The original PELA at
Melbourne, the Diagnostic English Language Assessment (DELA), which goes
back to the early 1990s, has always been administered on a limited scale for various
reasons. A policy was introduced in 2009 that all incoming undergraduate students
whose English scores fell below a certain threshold on IELTS (for international
students) or the Victorian matriculation exams (domestic students) would be
required to take DELA, followed by an academic literacy development programme
as necessary (Ransom 2009). However, it has been difficult to achieve full compli-
ance with the policy. This provided part of the motivation for the development of a
new assessment, now called the Post-admission Assessment of Language (PAAL),
which is the focus of the Knoch et al. chapter.
Although Knoch et al. report on a trial of PAAL in two faculties at Melbourne
University, the assessment is intended for use on a university-wide basis and thus it
measures general academic language ability. By contrast, in Chap. 3 Janna Fox,
John Haggerty and Natasha Artemeva describe a programme tailored specifi-
cally for the Faculty of Engineering at Carleton University in Canada. The starting
point was the introduction of generic screening measures and a writing task licensed
from the DELNA programme at the University of Auckland in New Zealand
(discussed in Chap. 7), but as the Carleton assessment has evolved, it was soon
recognised that a more discipline-specific set of screening measures was required to
meet the needs of the faculty. Thus, both the input material and the rating criteria for
the writing task were adapted to reflect the expectations of engineering instructors,
and recently a more appropriate reading task and a set of mathematical problems
have been added to the test battery. Another feature of the Carleton programme has
been the integration of the assessment with the follow-up pedagogical support. This
has involved the embedding of the assessment battery into the delivery of a required
first-year engineering course and the opening of a support centre staffed by
upper-year students as peer mentors. Fox et al. report that there is a real sense in
which students in the faculty have taken ownership of the centre, with the result
that it is not stigmatised as a remedial place for at-risk students, but somewhere
where a wide range of students can come to enhance their academic literacy in
engineering.
The term academic literacy is used advisedly here to refer to the discipline-
specific nature of the assessment at Carleton, which distinguishes it from the other
programmes presented in this volume; otherwise, they all focus on the more generic
construct of academic language proficiency (for an extended discussion of the two
constructs, see Read 2015). The one major example of an academic literacy assess-
ment in this sense is Measuring the Academic Skills of University Students
(MASUS) (Bonanno and Jones 2007), a procedure developed in the early 1990s at
the University of Sydney, Australia, which involves the administration of a
discipline-specific integrated writing task. This model requires the active involve-
ment of instructors in the discipline, and has been implemented most effectively in
1 Some Key Issues in Post-Admission Language Assessment 9
The second section of the book includes two studies of post-admission assessments
for postgraduate students, and more specifically doctoral candidates. Although
international students at the postgraduate level have long been required to achieve a
minimum score on a recognised English proficiency test for admission purposes,
normally this has involved setting a somewhat higher score on a test that is other-
wise the same as for undergraduates. However, there is growing recognition of the
importance of addressing the specific language needs of doctoral students, particularly
in relation to speaking and writing skills. Such students have usually developed
academic literacy in their discipline through their previous university education but,
if they are entering a fully English-medium programme for the first time, their
doctoral studies will place new demands on their proficiency in the language.
1 Some Key Issues in Post-Admission Language Assessment 11
In the third section of the book, there are three chapters which shift the focus back
to English-medium university education in societies where (as in Hong Kong) most
if not all of the domestic student population are primary speakers of other lan-
guages. These chapters are also distinctive in the attention they pay to design issues
in post-admission assessments.
Chapter 8, by Thomas Roche, Michael Harrington, Yogesh Sinha and
Christopher Denman, investigates the use of a particular test format for the pur-
poses of post-admission assessment at two English-medium universities in Oman, a
Gulf state which came under British influence in the twentieth century but where
English remains a foreign language for most of the population. The instrument is
what the authors call the Timed Yes/No (TYN) vocabulary test, which measures the
accuracy and speed with which candidates report whether they know each of a set
of target words or not. Such a measure would not normally be acceptable in a
contemporary high-stakes proficiency test, but it has a place in post-admission
assessments. Vocabulary sections are included in the DELTA, DELNA and ELPA
test batteries, and the same applies to TALL and TALPS (Chaps. 9 and 10). Well-
constructed vocabulary tests are highly reliable and efficient measures which have
been repeatedly shown to be good predictors of reading comprehension ability and
indeed of general language proficiency (Alderson 2005). They fit well with the
diagnostic purpose of many post-admission assessments, as distinct from the more
communicative language use tasks favoured in proficiency test design. Roche et al.
argue that a TYN test should be seriously considered as a cost-effective alternative
to the existing resource-intensive placement tests used as part of the admission
process to foundation studies programmes at the two institutions.
1 Some Key Issues in Post-Admission Language Assessment 13
The TYN test trial at the two institutions produced promising results but also
some reasons for caution in implementing the test for operational purposes. The
vocabulary test was novel to the students not only in its Yes/No format but also
the fact that it was computer-based. A comparison of student performance at the two
universities showed evidence of a digital divide between students at the metropoli-
tan institution and those at the regional one; there were also indications of a gender
gap in favour of female students at the regional university. This points to the need
to ensure that the reliability of placement tests and other post-admission assess-
ments is not threatened by the students’ lack of familiarity with the format and the
mode of testing. It also highlights the value of obtaining feedback from test-takers
themselves, as several of the projects described in earlier chapters have done.
The other two chapters in the section come from a team of assessment specialists
affiliated to the Inter-institutional Centre for Language Development and Assessment
(ICELDA), which – like the DELTA project in Hong Kong – involves collaboration
among four participating universities in South Africa to address issues of academic
literacy faced by students entering each of the institutions. The work of ICELDA is
informed by the multilingual nature of South African society, as well as the ongoing
legacy of the political and educational inequities inflicted by apartheid on the major-
ity population of the country. This makes it essential that students who will struggle
to meet the language demands of university study through the media of instruction
of English or Afrikaans should be identified on entry to the institution and should be
directed to an appropriate programme of academic literacy development.
Two tests developed for this purpose, the Test of Academic Literacy Levels
(TALL) and its Afrikaans counterpart, the Toets van Akademiese Geletterdheidsvlakke
(TAG), are discussed in Chap. 9 by Albert Weideman, Rebecca Patterson and
Anna Pot. These tests are unusual among post-admission assessments in the extent
to which an explicit definition of academic literacy has informed their design. It
should be noted here that the construct was defined generically in this case, rather
than in the discipline-specific manner adopted by Read (2015) and referred to in
Chap. 3 in relation to the Carleton University assessment for engineering students.
The original construct definition draws on current thinking in the applied linguistic
literature, particularly work on the nature of academic discourse. The authors
acknowledge that compromises had to be made in translating the components of
the construct into an operational test design, particularly given the need to employ
objectively-scored test items for practical reasons in such large-scale tests. The
practical constraints precluded any direct assessment of writing ability, which many
would consider an indispensable element of academic literacy.
In keeping with contemporary thinking about the need to re-validate tests peri-
odically, Weideman et al. report on their recent exercise to revisit the construct,
leading to some proposed new item types targeting additional components of aca-
demic literacy. One interesting direction, following the logic of two of the additions,
is towards the production of some field-specific tests based on the same broad con-
struct. It would be useful to explore further the diagnostic potential of these tests
through the reporting of scores for individual sections, rather than just the total
score. To date this potential has not been realised, largely again on the practical
14 J. Read
grounds that more than 30,000 students need to be assessed annually, and thus over-
all cut scores are simply used to determine which lower-performing students will be
required to take a 1-year credit course in academic language development.
This brings us to Chap. 10, by Avasha Rambiritch and Albert Weideman,
which complements Chap. 9 by giving an account of the other major ICELDA
instrument, the Test of Academic Literacy for Postgraduate Students (TALPS). As
the authors explain, the development of the test grew out of a recognition that post-
graduate students were entering the partner institutions with inadequate skills in
academic writing. The construct definition draws on the one for TALL and TAG but
with some modification, notably the inclusion of an argumentative writing task. The
test designers decided that a direct writing task was indispensable if the test was to
be acceptable (or in traditional terminology, to have face validity) to postgraduate
supervisors in particular.
The last point is an illustration of the authors’ emphasis on the need for test
developers to be both transparent and accountable in their dealings with stakehold-
ers, including of course the test-takers. At a basic level, it means making informa-
tion easily available about the design of the test, its structure and formats, and the
meaning of test scores, as well as providing sample forms of the test for prospective
candidates to access. Although this may seem standard practice in high-stakes
testing programmes internationally, Rambiritch and Weideman point out that such
openness is not common in South Africa. In terms of accountability, the test
developers identify themselves and provide contact details on the ICELDA website.
They are also active participants in public debate about the assessment and related
issues through the news media and in talks, seminars and conferences. Their larger
purpose is to promote the test not as a tool for selection or exclusion but as one
means of giving access to postgraduate study for students from disadvantaged
backgrounds.
Although universities in other countries may not be faced with the extreme
inequalities that persist in South African society, this concern about equity of access
can be seen as part of the more general rationale for post-admission language
assessment and the subsequent provision of an “intervention” (as Rambiritch and
Weideman call it), in the form of opportunities for academic language development.
The adoption of such a programme signals that the university accepts a responsibil-
ity for ensuring that students it has admitted to a degree programme are made aware
of the fact that they may be at risk of underachievement, if not outright failure, as a
result of their low level of academic language proficiency, even if they have met the
standard requirements for matriculation. The institutional responsibility also
extends to the provision of opportunities for students to enhance their language
skills, whether it be through a compulsory course, workshops, tutorial support,
online resources or peer mentoring.
The concluding Chap. 11, by John Read, discusses what is involved for a par-
ticular university in deciding whether to introduce a post-admission language
assessment, as part of a more general programme to enhance the academic language
development of incoming students from diverse language backgrounds. There
are pros and cons to be considered, such as how the programme will be viewed
1 Some Key Issues in Post-Admission Language Assessment 15
externally and whether the benefits will outweigh the costs. Universities are paying
increasing attention to the employability of their graduates, whose attributes are
often claimed to include effective communication ability. This indicates that both
academic literacy and professional communication skills need to be developed not
just in the first year of study but throughout students’ degree programmes. Thus,
numerous authors now argue that language and literacy enhancement should be
embedded in the curriculum for all students, but there are daunting challenges in
fostering successful and sustained collaboration between English language specialists
and subject teaching staff. The chapter concludes by exploring the ideas associated
with English as a Lingua Franca (ELF) and how they might have an impact on the
post-admission assessment of students.
To conclude this introduction, I will identify three themes which each go across
several chapters in the volume.
As with any assessment, a key question with PELAs is how to validate them. The
authors of this volume have used a variety of frameworks and conceptual tools for
this purpose, especially ones which emphasise the importance of the consequences
of the assessment. This is obviously relevant to post-admission assessment pro-
grammes, where by definition the primary concern is not only to identify incoming
students with academic language needs but also to ensure that subsequently they
have the opportunity to develop their language proficiency in ways which will
enhance their academic performance at the university.
In Chap. 2, Knoch, Elder and O’Hagan present a framework which is specifically
tailored for the validation of post-admission assessments. The framework is an
adapted version of the influential one in language testing developed by Bachman
and Palmer (2010), which in turn draws on the seminal work on test validation of
Samuel Messick and more particularly the argument-based approach advocated by
Michael Kane. It specifies the sequence of steps in the development of an argument
to justify the interpretation of test scores for a designated purpose, together with the
kinds of evidence required at each step in the process. The classic illustration of this
approach is the validity argument for the internet-based TOEFL articulated by
Chapelle et al. (2008). Knoch and Elder have applied their version of the framework
to several PELAs and here use it as the basis for evaluating the Post-entry Assessment
of Academic Language (PAAL) at the University of Melbourne.
An alternative approach to validation is Cyril Weir’s (2005) socio-cognitive
model, which incorporates the same basic components as the Bachman and Palmer
16 J. Read
A notable feature of several studies in the volume is the elicitation of feedback from
students who have taken the assessment. This can be seen as a form of validity
evidence or, as we have just seen in the case of the OEPP at Purdue University, as
input to a quality management procedure. Although an argument can be made for
obtaining test-taker views at least at the development stage of any language testing
programme, it is particularly desirable for a post-admission assessment for three
reasons. First, like a placement test, a PELA is administered shortly after students
1 Some Key Issues in Post-Admission Language Assessment 17
arrive on campus and, if they are uninformed or confused about the nature and
purpose of the assessment, it is less likely to give a reliable measure of their aca-
demic language ability. The second reason is that the assessment is intended to alert
students to difficulties they may face in meeting the language demands of their stud-
ies and often to provide them with beneficial diagnostic feedback. This means that
taking the assessment should ideally be a positive experience for them and anything
which frustrates them about the way the test is administered or the results are
reported will not serve the intended purpose. The other, related point is that the
assessment is not an end in itself but should be the catalyst for actions taken by the
students to enhance their academic language ability. Thus, feedback from the stu-
dents after a period of study provides evidence as to whether the consequences of
the assessment are positive or not, in terms of what follow-up activities they engage
in and what factors may inhibit their uptake of language development
opportunities.
Feedback from students was obtained in different ways and for various purposes
in these studies. In the development of the PAAL (Chap. 2), a questionnaire was
administered shortly after the two trials, followed later by focus groups. A similar
pattern of data-gathering was conducted in the studies of the two Hong Kong tests,
ELPA (Chap. 4) and DELTA (Chap. 5). On the other hand, the Post Test Questionnaire
is incorporated as a routine component of every administration of the OEPT (Chap.
6), to monitor student reactions to the assessment on an ongoing basis. A third
model is implemented for DELNA (Chap. 7). In this case, students are invited to
complete a questionnaire and then participate in an interview only after they have
completed a semester of study. Although this voluntary approach greatly reduces
the response rate, it provides data on the students’ experiences of engaging (or not)
in academic language enhancement activities as well as their reactions to the assess-
ment itself.
References
Jenkins, J. (2007). English as a lingua franca: Attitudes and identity. Oxford: Oxford University
Press.
Kunnan, A. J. (Ed.). (2014). The companion to language assessment. Chichester: Wiley Blackwell.
Lee, Y. W. (Ed.) (2015). Special issue: Future of diagnostic language testing. Language Testing,
32(3), 293–418.
Leki, I. (2007). Undergraduates in a second language: Challenges and complexities of academic
literacy development. Mahwah: Lawrence Erlbaum.
Murray, N. (2016). Standards of English in higher education: Issues, challenges and strategies.
Cambridge: Cambridge University Press.
Phillipson, R. (2009). English in higher education: Panacea or pandemic? In R. Phillipson (Ed.),
Linguistic imperialism continued (pp. 195–236). New York: Routledge.
Poon, A. Y. K. (2013). Will the new fine-tuning medium-of-instruction policy alleviate the threats
of dominance of English-medium instruction in Hong Kong? Current Issues in Language
Planning, 14(1), 34–51.
Qian, D. D. (2007). Assessing university students: Searching for an English language exit test.
RELC Journal, 38(1), 18–37.
Ransom, L. (2009). Implementing the post-entry English language assessment policy at the
University of Melbourne: Rationale, processes and outcomes. Journal of Academic Language
and Learning, 3(2), 13–25.
Read, J. (2015). Assessing English proficiency for university study. Basingstoke: Palgrave
Macmillan.
So, D. W. C. (1989). Implementing mother‐tongue education amidst societal transition from
diglossia to triglossia in Hong Kong. Language and Education, 3(1), 29–44.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Basingstoke:
Palgrave Macmillan.
Xi, X. (2008). Investigating the criterion-related validity of the TOEFL speaking scores for ITA
screening and setting standards for ITAs (ETS research reports, RR-08-02). Princeton:
Educational Testing Service.
Part II
Implementing and Monitoring
Undergraduate Assessments
Chapter 2
Examining the Validity of a Post-Entry
Screening Tool Embedded in a Specific Policy
Context
1 Introduction
The authors also showed how the institutional policy determines how the decisions
and consequences associated with a certain PELA play out. This crucial role of
policy in PELA implementation will be taken up later in this chapter.
This framework has subsequently been applied to the evaluation of the Diagnostic
English Language Needs Assessment at the University of Auckland (Read 2015)
and also to two well-established Australian PELAs claiming either explicitly or
implicitly to be diagnostic in their orientation: the Measuring the Academic Skills
of University Students (MASUS) test at the University of Sydney (Bonanno and
Jones 2007) and the Diagnostic English Language Assessment (DELA) at the
University of Melbourne (Knoch and Elder 2016). In the latter study, Knoch and
Elder consider the different inferences (evaluation, generalization, explanation/
extrapolation, decisions and consequences) that underpin claims about diagnostic
score interpretation in a PELA context and the associated warrants for which evi-
dential support is required. The findings of the evaluation revealed that support for
some of these warrants is lacking and that neither instrument can claim to be fully
diagnostic. Although each PELA was found to have particular strengths, the claim
that each provides diagnostic information about students which can be used as a
basis for attending to their specific language needs is weakened by particular fea-
tures of the assessment instruments themselves, or by the institutional policies
determining the manner in which they are used. The rich diagnostic potential of the
MASUS was seen to be undermined by limited evidence for reliability and also by
the lack of standardized procedures for administration. The DELA, while statisti-
cally robust and potentially offering valid and useful information about reading,
listening and writing sub-skills, was undermined by the policy of basing support
recommendations on the student’s overall score rather than on the sub-skill profile.
The authors concluded that the DELA ‘really functions as a screening test to group
students into three broad bands – at risk, borderline or proficient – and there is no
obvious link between the type of support offered and the particular needs of the
student’ (p. 16).
A further problem with both of these PELAs is that they are not universally
applied: MASUS is administered only in certain departments and DELA is admin-
istered only to categories of students perceived to be academically at risk.
Furthermore, there are few or no sanctions imposed on students who fail to sit the
test, on the one hand, or fail to follow the support recommendations, on the other.
Similar problems of uptake were identified with the Diagnostic English Language
Needs Assessment (DELNA) at the University of Auckland, a two-tiered instrument
which includes both an initial screening test involving indirect, objectively-scored
items and a follow-up performance-based diagnostic component, with the latter
administered only to those falling below a specified threshold on the former. One of
the challenges faced was convincing students who had performed poorly on the
screening component to return for their subsequent diagnostic assessment (Elder
and von Randow 2008). While uptake of the diagnostic component has improved
over time, as awareness of the value of the DELNA initiative has grown, the success
of the two-tiered system has been largely to due to the institution providing resources
for a full-time manager, a half-time administrator and a half-time adviser whose
26 U. Knoch et al.
jobs include raising awareness of the initiative among academic staff and students,
pursuing those who failed to return for the second round of testing, and offering one
on one counseling on their English needs as required.
Given that many institutions are unwilling to make a similar commitment of
resources to any PELA initiative, an alternative approach, which attempts to build
on the strengths of previous models and to address their weaknesses, was devised by
the authors of this chapter.
This paper offers an evaluation of the resulting instrument, known as the Post-
entry Assessment of Academic Language (PAAL), based on the PELA evaluation
framework referred to above. PAAL is the name adopted for a form of an academic
English screening test known historically, and currently in other contexts, as the
Academic English Screening Test (AEST). PAAL is designed to provide a quick
and efficient means of identifying those students in a large and linguistically diverse
student population who are likely to experience difficulties coping with the English
language demands of academic study, while at the same time providing some diag-
nostic information. In the interests of efficiency it combines features of the indirect
screening approach adopted at the University of Auckland (Elder and von Randow
2008) with a single task designed to provide some, albeit limited, diagnostic infor-
mation, removing the need for a second round of assessment. It is based on the
principle of universal testing, to allow for all at risk students to be identified, rather
than targeting particular categories of students (as was the case for DELA) and
builds on over 10 years of development and research done at the University of
Melbourne and the University of Auckland.
The AEST/PAAL was developed at the Language Testing Research Centre (LTRC)
at the University of Melbourne in early 2009. It was initially commissioned for use
by the University of South Australia; however, the rights to the test remain with the
LTRC.
The test is made up of three sections: a text completion task, a speed reading task
and an academic writing task, all completed within a 1-h time frame. (For a detailed
account of the initial development and trialling, refer to Elder and Knoch 2009). The
writing task was drawn from the previously validated DELA and the other two tasks
were newly developed. The 20-min text completion task consists of three short texts
and uses a C-test format (Klein-Braley 1985), in which every second word has been
partially deleted. Students are required to reconstruct the text by filling in the gaps.
The speed reading task, an adaptation of the cloze-elide format used for the screen-
ing component of DELNA at the University of Auckland (Elder and von Randow
2008), requires students to read a text of approximately 1,000 words in 10 min and
identify superfluous words that have been randomly inserted. The writing task is an
argumentative essay for which students are provided with a topic and have 30 min
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 27
to respond with 250–300 words (see Elder et al. 2009 for further detail). The text
completion and speed reading tasks, used for screening purposes, are objectively
scored and the writing task, intended for diagnostic use, is scored by trained raters
using a three-category analytic rating scale. The test scores from the two screening
components place students in one of three support categories as follows:
• Proficient. Students scoring in the highest range are deemed to have sufficient
academic English proficiency for the demands of tertiary study.
• Borderline. Students scoring in the middle range are likely to be in need of fur-
ther language support or development.
• At Risk. Students scoring in the lowest range are deemed likely to be at risk of
academic failure if they do not undertake further language support and
development.
It is recommended, for the sake of efficiency, that the writing component of the
AEST/PAAL be completed by all students but be marked only for those scoring in
the Borderline and At Risk categories on the two screening components. The writ-
ing thus serves to verify the results of the screening components for Borderline
students (where categorization error is more likely) and potentially yields diagnos-
tic information for less able students so that they can attend to their language weak-
nesses. The AEST/PAAL was however designed primarily as a screening test which
efficiently groups students into the three categories above. For reasons of practical-
ity and due to financial constraints the test was not designed to provide detailed
feedback to test takers beyond the classification they are placed into and informa-
tion about the support available to them on campus.
Following the development and trial of the prototype outlined in Elder and
Knoch (2009), three more parallel versions of the test were developed (Knoch
2010a, b, 2011), initially for use at the University of South Australia, as noted above.
In 2012, following a feasibility study (Knoch et al. 2012a), the University of
Melbourne’s English Language Development Advisory Group (ELDAG) supported
a proposal by the LTRC to put the test online for eventual use at the University of
Melbourne. This was funded in 2012 by a university Learning and Teaching
Initiative Grant. The online platform was then developed by Learning Environments,
a group of IT specialists supporting online learning and teaching initiatives at the
University. Following the completion of the platform, the online delivery was tested
on 50 test takers who had previously taken the University’s DELA (Knoch et al.
2012b). The students also completed a questionnaire designed to elicit their experi-
ences with the online platform. This small trial served as an extra check to verify
cut-scores (between the Proficient, Borderline and At risk levels) which had been set
during the development of the test using performance on the DELA as the bench-
mark (see below, and for further detail, Elder and Knoch 2009).
Based on the results of the small trial, a number of technical changes were
made to the delivery of the test and in Semester 2, 2013, a trial on two full cohorts
was undertaken (again funded by a Teaching and Learning Initiative grant) (Knoch
and O’Hagan 2014). The trial targeted all newly incoming Bachelor of Commerce
and Master of Engineering students as these two cohorts were considered to be
28 U. Knoch et al.
3 Methodology
As the overview of the historical development of the PAAL above shows, a number
of trials and data collections have been conducted over the years. In this section, we
will describe the following sources of data which we will draw on for this paper:
1. the PAAL development trial
2. the small trial of the online platform
3. the full trial on two student cohorts
There were 156 students who took part in the development trial of the PAAL, 71
from the University of South Australia and 85 from the University of Melbourne.
All students were first year undergraduates and were from a range of L1 back-
grounds, including a quarter from English-speaking backgrounds.
Test takers at both universities took the following components of the PAAL:
• Part A: Text completion (C-test with 41 texts of 25 items each) – 20 min
• Part B: Speed reading (Cloze elide with 75 items) – 10 min
• Part C: Writing task – 30 min
Test takers at the University of Melbourne had previously taken the DELA as
well, and therefore recent scores for the following skills were also available:
• Reading: 46 item reading test based on two written texts – 45 min
• Listening: 30 item listening test based on lecture input – 30 min
Fifty students from the University of Melbourne were recruited to take part in the
small trial of the online platform developed by Learning Environments. The stu-
dents completed a full online version of the PAAL from a computer or mobile
1
A C-test with four texts is used for trial administrations whilst the final test form only includes
three texts.
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 29
device at a place and time convenient to them. The final format of the PAAL adopted
for the online trial and the full trial was as follows:
• Part A: Text completion (C-test with 3 texts of 25 items each) – 15 min
• Part B: Speed reading (Cloze elide with 75 items) – 10 min
• Part C: Writing task – 30 min
Following the trial, 49 of the participants completed an online questionnaire
designed to elicit information about what device and browser they used to access the
test, any technical issues they encountered, and whether the instructions to the test
were clear and the timer visible at all times.
The full trial was conducted on two complete cohorts of commencing students at the
beginning of Semester 2, 2013: Bachelor of Commerce (BCom) and Master of
Engineering (MEng).
In the lead-up to the pilot implementation, extensive meetings were held with
key stakeholders, in particular Student Centre staff from the respective faculties. It
became evident very early in these discussions that universal testing is not possible
as there is no mechanism to enforce this requirement on the students. Although it
was not compulsory, all students in participating cohorts were strongly encouraged
through Student Centre communications and orientation literature to complete the
PAAL. At intervals during the pilot period, up to three reminder emails were sent by
the respective Student Centres to remind students they were expected to take the
test.
On completing the PAAL, all students were sent a report containing brief feed-
back on their performance and a recommendation for language support according to
their results. The support recommendations were drafted in consultation with the
Student Centres and Academic Skills2 to ensure recommendations were in accord
with appropriate and available offerings. The reports were emailed by the Language
Testing Research Centre (LTRC) to each student within 1–2 days of their complet-
ing the PAAL. Cumulative spreadsheets of all students’ results were sent by the
LTRC to the Student Centres on a weekly basis throughout the pilot testing period.
The PAAL was taken by 110 BCom students, or 35 % of the incoming cohort of
310 students. In the MEng cohort, PAAL was taken by 60 students, comprising
12 % of the total of 491. The level of uptake for the BCom cohort was reported by
the Commerce Student Centre as favourable compared with previous Semester 2
administrations of the DELA. Lower uptake for the MEng cohort was to be expected
since traditionally post-entry language screening has not been required for graduate
students at the University of Melbourne.
2
The unit responsible for allocation and delivery of academic language support at the University.
30 U. Knoch et al.
The full trial was followed up with an evaluation in which feedback was sought
from University stakeholders in order to develop recommendations for the best
future form of the PAAL. Feedback came from students in the trial by means of a
participant questionnaire and focus groups. Face-to-face and/or email consultation
was used to gather feedback from Student Centres and Academic Skills.
Student consultation commenced with an online survey distributed to all pilot
participants 2 weeks after taking the PAAL. Responses were received from 46 stu-
dents, representing a 27 % response rate. Survey respondents were asked for feed-
back on the following topics: the information they received about the PAAL prior to
taking the assessment; their experience of taking the PAAL (technical aspects, task
instructions, face validity and difficulty of tasks); the PAAL report (results and rec-
ommendations); and the options for support/development and follow-up advice
after taking the PAAL.
To gather more detailed feedback on these aspects of the PAAL, and to give stu-
dents an opportunity to raise any further issues, four focus groups of up to 60 min
duration were held: 20 students attended 1 of 4 faculty-specific focus groups, with
an average of 5 students in each group. Group discussion was structured around the
themes covered in the survey and participants were given the opportunity to elabo-
rate their views and to raise any other issues relating to the PAAL that were of inter-
est or concern to them.
4 Results
The remainder of the chapter will present some of the findings from the multiple
sources of evidence collected. We will organize the results under the inferences set
out by Knoch and Elder (2013) and have summarized the warrants and evidence in
a table for each inference at the beginning of each section. Below each table, we
describe the different sources of evidence in more detail and present the results for
each.
4.1 Evaluation
Table 2.1 summarizes the three key warrants underlying the Evaluation inference.
Evidence collected to find backing for each warrant is summarized in the final
column.
To find backing for the first warrant in Table 2.1, the statistical properties of the
PAAL were evaluated in the original trial as well as during the development of sub-
sequent versions. Table 2.2 summarizes the Cronbach alpha results, which are all
fairly consistent across the four versions. We also found a consistent spread of can-
didate abilities between new versions and the prototype version and a good spread
of item difficulty.
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 31
Table 2.1 Warrants and related evidence for the Evaluation inference
Evaluation
Warrants Evidence
1. The psychometric properties Psychometric properties of the test as reported in the initial
of the test are adequate development report (Elder and Knoch 2009) and subsequent
development reports (Knoch 2010a, b, 2011)
2. Test administration conditions Responses to feedback questionnaires from the small trial
are clearly articulated and and full trial (Knoch and O’Hagan 2014)
appropriate
3. Instructions and tasks are clear Responses to feedback questionnaires and focus groups
to all test takers from the full trial (Knoch and O’Hagan 2014)
The reliability statistics for the writing task used in this trial were also within
acceptable limits for rater scored writing tasks, as was the case for previous versions
(Elder et al. 2009).
To examine the second warrant in Table 2.1, we scrutinized the responses from
the feedback questionnaires from the small and the full trial. The small trial of the
online capabilities showed that there were several technical issues that needed to be
dealt with before the test could be used for the full trial. For example, slow loading
time tended to be an issue and some participants had experiences of the site ‘crash-
ing’ or losing their connection with the site. The trial also indicated the need to
further explore functionality to enable auto-correction features of some browsers to
be disabled and adjustments to be made to font size for small screen users. In addi-
tion, feedback from trial participants indicated that fine-tuning of the test-taker
interface was required. For example, some participants reported problems with the
visibility/position of the on-screen timer and with the size of the text box for the
writing task. The results of the trial further showed that there were variations in
functionality across different platforms and devices (most notably, the iPad).
Following this trial of the online capabilities of the system, a number of technical
changes were made to the online system before the full trial was conducted.
Overall, the findings of the student survey and focus groups conducted following
the full trial indicated that students’ experiences of the online testing system were
mostly positive in terms of accessibility of the website, clarity of the task instruc-
tions, and timely receipt of their report. Few students reported any technical prob-
lems, although there were a small number of students who found the system ‘laggy’,
or slow to respond to keystrokes, and some reported that they had lost their internet
32 U. Knoch et al.
connection during the assessment. Overall, the purpose of the assessment and the
benefits of taking it were clear to participating students and almost all of them
appreciated being able to take the assessment from home in their own time.
The final warrant investigates whether students understood all the instructions
and whether the task demands were clear. The questionnaire results from the two
trials show that the students commented positively about these two areas.
In sum, the Evaluation inference was generally supported by the data collected
from the different sources. The statistical properties of the PAAL were excellent,
and the administration conditions suited the students and were adequate for the
purpose, despite a few smaller technical problems which may have been caused by
the internet rather than the PAAL software. The task demands and task instructions
seemed clear to the test takers.
4.2 Generalizability
Table 2.3 lists the key warrants and associated supporting evidence we will draw on
in our discussion of the Generalizability inference.
The first warrant supporting the Generalizability inference is that different test
forms are parallel in design. The PAAL currently has four parallel forms or versions
(and two more are nearly completed), all of which have been based on the same speci-
fication document. The psychometric results from the development of Versions 2, 3
and 4 show that each of these closely resembles the prototype version (Version 1).
The second warrant is that appropriate equating methods are used to ensure
equivalence of test forms. The development reports of Versions 2, 3 and 4 outline
the statistical equating methods that have been used to ensure equivalence in the
meaning of test scores. Each time, a new version was trialed together with the
anchor version and Rasch analysis was used to statistically equate the two versions.
Table 2.3 Warrants and related evidence for the Generalizability inference
Generalizability
Warrants Evidence
1. Different test forms are parallel in Review of test features and statistical evidence from
design reports of the development of parallel versions (Knoch
2010a, b, 2011)
2. Appropriate equating procedures Review of equating evidence from reports of the
are used to ensure equivalent development of parallel versions (Knoch 2010a, b,
difficulty across test forms 2011)
3. Sufficient tasks are included to Psychometric properties of the test as reported in the
provide stable estimates of test taker initial development report (Elder and Knoch 2009) and
ability subsequent development reports (Knoch 2010a, b,
2011)
4. Test administration conditions are Discussion of test delivery and results from the survey
consistent of the full trial (Knoch and O’Hagan 2014)
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 33
Statistically equating the writing tasks is more difficult as no suitable anchor items
are available and only one writing task is included. However, the developers of the
writing task attempt to closely stay true to the test specifications and small trials of
new writing versions are carefully evaluated by a team of test developers to ensure
they are as equivalent in design as possible and are eliciting assessable samples of
writing performance from test candidates. Successive administrations of writing
tasks for the DELA (from which the PAAL writing task is drawn) have shown stable
estimates over different test versions as noted above.
The third warrant is that sufficient tasks are included to arrive at stable indicators
of candidate performance. Each PAAL has 150 items, 25 for each of the three texts
which make up the C-test and 75 in the cloze elide, as well as one writing task. As
the PAAL is a screening test, the duration of 1 h is already at the upper limit of an
acceptable amount of administration time. It is therefore practically impossible to
add any more tasks. However, the trials have shown that the test results are fairly
reliable indicators of test performance, with students being classified into the same
categories when taking two parallel forms of the test.
The final warrant supporting generalizability relates to the consistency of the test
administration. As students can take the test in their own time at a place of their
choosing, it is likely that the conditions are not absolutely consistent. For example,
a student might choose to take the test in a student computer laboratory that is not
entirely free of noise, or at home in quiet conditions. However, due to the low stakes
of the test, any differences in test taking conditions are probably not of great con-
cern. Due to the fact that the test is computer-delivered, the timing and visual pre-
sentation of the test items are likely to be the same for all students.
By and large, it seems that the Generalizability inference is supported by the
evidence collected from our trials.
Table 2.4 presents the warrants underlying the Explanation and Extrapolation infer-
ences as well as the evidence we have collected.
The first warrant states that test takers’ performance on the PAAL relates to their
performance on other assessments of academic language proficiency. During the
development of the prototype version of the PAAL, the cohort of students from the
University of Melbourne had already taken the Diagnostic English Language
Assessment (DELA) and their results could therefore be compared directly with
their performance on the PAAL.
Table 2.5 presents the correlational results of the two PELA tests. It can be seen
that overall screening test results correlated significantly with both the DELA over-
all raw scores and the DELA scaled scores.
The second warrant states that the scoring rubric captures relevant aspects of
performance. The scoring rubric used to rate the writing performances has been
developed on the basis of test developers’ intuitions from their experience in EAP
34 U. Knoch et al.
Table 2.4 Warrants and related evidence for the Explanation and Extrapolation inferences
Explanation and Extrapolation
Warrants Evidence
1. Performance on the PELA relates to performance on other Correlational results from
assessments of academic language proficiency the development report
(Elder and Knoch 2009)
2. Scoring criteria and rubrics capture relevant aspects of Review of the literature on
performance academic writing
3. Test results are good predictors of language performance in No data collected
the academic domain
4. Characteristics of test tasks are similar to those required of No data collected
students in the academic domain (and those in the language
development courses students are placed in)
5. Linguistic knowledge, processes, and strategies employed by No data collected
test takers are in line with theoretically informed expectations
and observations of what is required in the corresponding
academic context
6. Tasks do not unfairly favor certain groups of test takers No data collected
performance in their first year (as measured through WAMs3) shows that the DELA
(which correlates strongly with the PAAL) is a very strong predictor of WAMs. The
study clearly shows that a higher score on DELA is associated with higher WAMs
and that a higher DELA score is associated with a lower risk of failing.
The fourth warrant states that the task types in the PELA are similar to those
required of students in the academic domain. In the case of the PAAL, the test
designers set out to develop a screening test which would be automatically scored
and practical for test takers. It was therefore not possible to closely model the kinds
of tasks test takers undertake in the academic domain (e.g. listening to a lecture and
taking notes). However, the tasks chosen were shown to be very good predictors of
the scores test takers receive on the more direct language tasks included in the
DELA and it was therefore assumed that these indirect tasks could be used as sur-
rogates. Similarly, warrant five sets out that the test takers’ cognitive processes
would be similar when taking the PELA and when completing tasks in the academic
domain. Again, due to the very nature of the test tasks chosen, backing for this war-
rant might be difficult to collect. However, studies investigating the cognitive pro-
cesses of test takers completing indirect tasks such as C-tests and cloze elide (e.g.
Matsumura 2009) have shown that test takers draw on a very wide range of linguis-
tic knowledge to complete these tasks, including lexical, grammatical, lexico-
grammatical, syntactic and textual knowledge.
As for the final warrant, potential evidence has yet to be gathered from a larger
test population encompassing students from different backgrounds, including
native-English speaking (NES) and non-native speaking (NNES) students, and
those in university Foundation courses, who may be from low literacy backgrounds
or have experienced interrupted schooling. A previous study by Elder, McNamara
and Congdon (2003) in relation to the DELNA screening component at Auckland
would suggest that such biases may affect performance on certain items but do not
threaten the validity of the test overall. Nevertheless, the warrant of absence of bias
needs to be tested for this new instrument.
In sum, it would seem that the warrants for which evidence is available are rea-
sonably well supported, with the caveat that the scope and screening function of the
PAAL inevitably limits its capacity to fully represent the academic language
domain.
4.4 Decisions
Table 2.6 sets out the warrants and associated evidence for the Decisions
inference.
3
WAM (weighted average mark) scores are the average mean results for students’ first year course
grades. The results of this in-house study are from an unpublished report undertaken for the
University of Melbourne English Language Development Advisory Group committee which over-
sees the English language policy of the University of Melbourne.
36 U. Knoch et al.
Table 2.6 Warrants and related evidence for the Decisions inference
Decisions
Warrants Evidence
1. Students are correctly categorized based on Review of standard-setting activities
their test scores
2. The test results include feedback on test Evidence from the questionnaires and focus
performance and a recommendation groups of the full trial
3. The recommendation is closely linked to Review of institutional policy and evidence
on-campus support from focus groups of the full trial (Knoch and
O’Hagan 2014)
4. Assessment results are distributed in a Review of test documentation and evidence
timely manner from focus groups of the full trial
5. The test results are available to all relevant Review of test procedures
stakeholders
6. Test users understand the meaning and Evidence from questionnaires and focus
intended use of the scores groups of the full trial
The first warrant in the Decision inference states that the students are categorized
correctly based on their test score. Finding backing for this warrant involved two
standard-setting activities. The first was conducted as part of the development of the
prototype of the PAAL. A ROC (Receiver Operating Characteristics curve) analysis,
a technique for setting standards, was used to establish optimum cut-scores or
thresholds on the screening components of the test (c-test and cloze elide). A num-
ber of alternative cut-scores were proposed using either a specified DELA Writing
score or an overall DELA score (representing the average of reading and listening
and writing performance) as the criterion for identification of students as linguisti-
cally at risk. While these different cut-scores vary in sensitivity and specificity (see
Elder and Knoch 2009 for an explanation of these terms), they are all acceptably
accurate as predictors, given the relatively low stakes nature of the PAAL. Moreover,
the level of classification accuracy can be improved through the use of the writing
score to assist in decisions about borderline cases.
A further standard-setting exercise was conducted in preparation for the full trial.
To set the cut-scores for the three result categories outlined in the previous section
(i.e. ‘proficient’, ‘borderline’, ‘at risk’), a standard- setting exercise was conducted
with a team of trained raters at the Language Testing Research Centre using the
writing scripts from the small trial. All 50 writing scripts were evaluated by the
raters individually, evaluations were compared, and rating decisions moderated
through discussion until raters were calibrated with each other and agreement was
reached on the placement of each script in one of three proficiency groups: high,
medium or low. To arrive at the two cut-scores needed, i.e. between ‘proficient’ and
‘borderline’, and between ‘borderline’ and ‘at risk’, we used the analytic judgement
method (Plake and Hambleton 2001), a statistical technique for identifying the best
possible point for the cut-score. Based on this, the cut-scores from the development
trial were slightly shifted.
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 37
The second warrant states that test takers receive feedback on their performance
and a recommendation. The feedback component following the PAAL is minimal, a
fact that was criticized by the participants in the full trial. Students expressed disap-
pointment with the results statement given in the report, describing it as somewhat
generic and lacking in detail. Students in general would have preferred more diag-
nostic feedback to guide their future learning. Many also indicated they would have
liked to discuss their report with an advisor to better understand their results, and to
learn more about support opportunities.
Concerns were also raised about the vagueness of the support recommendation
given in the report, with many students wanting a clearer directive for what was
required of them. Many students stated they would have liked to receive follow up
advice on how to act on the recommendation, with many reporting that they did not
know how to access an advisor, or that they had received advice but it had not met
their expectations.
The third warrant states that the recommendation is closely linked to on-campus
support. Availability of appropriate support was identified as a problem by students
who felt that offerings were not suited to their proficiency, level of study or aca-
demic discipline, or were otherwise not appropriate to their needs. Some students
were also concerned that, in accessing the recommended support, they would incur
costs additional to their course fees. Where a credit-bearing course was recom-
mended, students expressed concerns about the implications for their academic
record of failing the course.
The fourth warrant states that the assessment results are distributed in a timely
manner. This was the case during the full trial, with results being distributed within
1–2 days of a student taking the assessment. Accordingly, in the evaluation of the
full trial, students commented positively on the timely manner in which the results
were distributed.
The next warrant relates to the availability of assessment results. During the full
trial implementation, all accessible stakeholders were made aware of the fact that
assessment results could be requested from the Language Testing Research Centre.
Students were sent their results as soon as possible, and the Student Centres of the
two cohorts were regularly updated with spreadsheets of the results. Unfortunately,
because of the size of the cohorts, it was not possible to identify lecturers who
would be responsible for teaching the students in question, and therefore lecturers
may not have been aware of the fact that they could request the results.
The final warrant states that test users understand the meaning and the intended
uses of the scores. The results of the evaluation of the full trial indicate that this was
not an issue, at least from the test-taker perspective. The purpose of the assessment
and the benefits of taking it were generally clear to students and students in the
focus groups indicated that they all understood that the test was intended to be help-
ful, that it was important but not compulsory and that it did not affect grades.
Overall, the Decisions inference was only partially supported. We could find
support for some aspects, including the categorization of the test takers, the expedi-
ent handling of test scores and that test takers generally understood their meaning.
No data was collected from other test users, however, so the extent to which they
38 U. Knoch et al.
understood the meanings and intended uses of the scores requires further investiga-
tion. Other aspects relating to the feedback profile, the recommendation and the
close link to on-campus support were not supported.
4.5 Consequences
Table 2.7 outlines the warrants of and associated evidence for the Consequences
inference.
The first warrant underlying the Consequences inference states that all targeted
test takers take the test, since this was the idea behind the streamlined test design
(which was designed to be administered universally to all new students). During the
full trial it became evident that institutional policy at the University of Melbourne
makes it impossible to mandate such an assessment because it goes beyond content
course requirements. Of the undergraduate Bachelor of Commerce cohort, only 110
students out of 310 (35 %) took the assessment. The numbers were even lower for
the Master of Engineering cohort, where only 60 students out of 491 (12 %) of stu-
dents took the PAAL. Evidence from the focus groups also showed that students
differed in their understandings of whom the PAAL is for, with many believing it to
be intended for ‘international’ students only. There was overall no sense among
students that the assessment was meant to be universally administered.
The next warrant states that test takers’ perceptions of the assessment are positive
and that they find the assessment useful. The data from the full trial, some findings
of which have already been reported under the Decision inference above, show
mixed results. Students were generally positive about the ease of the test administra-
tion and the information provided prior to taking the test, but were less positive
about the level of feedback provided and the follow-up support options available.
Therefore, this warrant is only partially supported.
The next warrant states that the feedback from the assessment is useful and
directly informs future learning. It is clear from the data from the full trial that the
Table 2.7 Warrants and related evidence for the Consequences inference
Consequences
Warrants Evidence
1. All targeted test takers sit for the test Evidence from the full trial
2. The test does not result in any stigma Evidence from the questionnaires and focus groups
or disadvantage for students
3. Test takers’ perceptions of the test Evidence from the questionnaires and focus groups
and its usefulness are positive
4. The feedback from the test is useful Evidence from the questionnaires and focus groups
and directly informs future learning
5. Students act on the test Evidence from the questionnaires, focus groups of
recommendation the full trial and follow-up correspondence with
Student Centres
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 39
students did not find the feedback from the assessment particularly useful; however,
it is important to remember that the PAAL is intended as a screening test, which is
designed to identify students deemed to be at risk in minimum time and with mini-
mum financial expenditure. When the test was designed, it was clear that no detailed
feedback would be possible due to financial as well as practical limitations. While
the analytically scored writing task potentially allowed for more detailed feedback,
the resources to prepare this feedback were not available for a large cohort of stu-
dents such as the one that participated in the full trial. More suitable online or on-
campus support options would also have improved the chances of this warrant being
supported. Unfortunately, offering more varied support provisions is costly and, in
the current climate of cost-savings, probably not a viable option in the near future.
The final warrant states that students act on the score recommendation.
Approximately 15 % of students taking the PAAL as part of the full trial were
grouped into the ‘at risk’ group. It is not clear how many of these students acted
upon the recommendation provided to them by enrolling in a relevant English lan-
guage support course but historically the compliance rates at the University of
Melbourne have been low. We suspect that the same would apply to this cohort for
many reasons, including the limited array of support options and the lack of any
institutional incentive or requirement to take such courses.
Overall, it can be argued that the Consequences inference was either not sup-
ported by the data collected in the full trial or that the relevant evidence was
lacking.
The chapter has described a new type of PELA instrument, which builds on previ-
ous models of PELA adopted in the Australian and New Zealand context. The
online PAAL, taking just 1 h to administer, was designed to be quick and efficient
enough to be taken by a large and disparate population of students immediately fol-
lowing their admission to the university, for the purpose of flagging those who
might face difficulties with academic English and identifying the nature of their
English development needs. Various types of validity evidence associated with the
PAAL have been presented, using an argument-based framework for PELA valida-
tion previously explicated by the first two authors (Knoch and Elder 2013) and
drawing on data from a series of trials.
The argument-based framework identifies the inferences that underlie validity
claims about a test, and the warrants associated with each inference. The first is
the Evaluation inference with its warrants of statistical robustness, appropriate
test administration conditions and clarity (for test-takers) of tasks and instruc-
tions. These warrants were generally supported by the different sources of evi-
dence collected, with an item analysis of each test component yielding excellent
reliability statistics, the writing rater reliability being within expected limits, and
feedback from test takers revealing that instructions were clear and tasks generally
40 U. Knoch et al.
manageable, apart from a few remediable technical issues associated with the
online testing platform.
Warrants that were tested in relation to the second, Generalizability, inference
were that different forms of the test were parallel in design and statistically compa-
rable in level of difficulty, that there were sufficient tasks or items to provide stable
estimates of ability and that test administration conditions were consistent. Results
reported above indicate that different forms of the test were comparable, both in
content and difficulty and that candidates were sorted into the same categories,
regardless of the version they took. The online delivery of the PAAL moreover
ensures consistency in the way test tasks are presented to candidates and in the time
allowed for task performance.
As for the third Explanation and Extrapolation inference, which has to do with
the test’s claims to be tapping relevant language abilities, correlational evidence
from the development trial showed a strong relationship between the PAAL scores
for Parts A and B and the more time-intensive listening and reading items of the
previously validated DELA, which had been administered concurrently to trial can-
didates. The warrant that the writing criteria capture relevant aspects of the aca-
demic writing construct is supported by research undertaken at the design stage.
The predictive power of the writing component test is also supported by in-house
data collected at the University of Melbourne showing the strong predictive power
of DELA scores in relation to WAM scores. Other warrants associated with this
inference have yet to be tested, however, and it is acknowledged that the length of
the test and the indirect nature of the screening tasks in Parts A and B somewhat
constrain its capacity to capture the academic language ability construct.
The Decision inference, the fourth in the argument-based PELA framework,
encompasses warrants relating to the categorization of students based on test scores
and the way test results are reported and received. Here the evidence presented gives
a mixed picture. Standard-setting procedures ensured that the test’s capacity to clas-
sify candidates into different levels was defensible. The meaning and purpose of the
testing procedure was well understood by test users and score reports were made
available to them in a timely manner. However, test-taker feedback revealed some
dissatisfaction with the level of detail provided in the score reports and with the
advice given about further support – perhaps because the avenues for such support
were indeed quite limited. The fact that feedback was gathered from only a portion
of the potential test taker population may also be a factor in these reactions as it
tends to be the more motivated students who participate in trials. Such students are
more likely to engage with the testing experience and expect rewards from it, includ-
ing a full description of their performance and associated advice.
Evidence supporting the warrants relating to the fifth, Consequences, inference
is even more patchy. Although the PAAL is designed to be administered to all
incoming students, participation in the testing was by no means universal in the
Faculties selected for the trial. In addition, there were mixed feelings about the use-
fulness of the initiative in informing future learning, due partly to the limited diag-
nostic information provided in the score reports but, more importantly, to the lack of
available support options linked to these reports. Whether many students in the ‘at
2 Examining the Validity of a Post-Entry Screening Tool Embedded in a Specific… 41
References
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford
University Press.
Bonanno, H., & Jones, J. (2007). The MASUS procedure: Measuring the academic skills of univer-
sity students – A diagnostic assessment. Sydney: Learning Centre, University of Sydney.
Available at: http://sydney.edu.au/stuserv/documents/learning_centre/MASUS.pdf
Dunworth, K. (2009). An investigation into post-entry English language assessment in Australian
universities. Journal of Academic Language & Learning, 3(1), A1–A13.
Elder, C., & Knoch, U. (2009). Report on the development and trial of the Academic English
Screening Test (AEST). Melbourne: University of Melbourne.
Elder, C., & von Randow, J. (2008). Exploring the utility of a web-based English language screen-
ing tool. Language Assessment Quarterly, 5(3), 173–194.
Elder, C., McNamara, T., & Congdon, P. (2003). Rasch techniques for detecting bias in perfor-
mance assessments: An example comparing the performance of native and non-native speakers
on a test of academic English. Journal of Applied Measurement, 4(2), 181–197.
Elder, C., Knoch, U., & Zhang, R. (2009). Diagnosing the support needs of second language writ-
ers: Does the time allowance matter? TESOL Quarterly, 43(2), 351–359.
Ginther, A., & Elder, C. (2014). A comparative investigation into understandings and uses of the
TOEFL iBT test, the International English Language Testing System (Academic) Test, and the
Pearson Test of English for Graduate Admissions in the United States and Australia: A case
study of two university contexts (ETS research report series, Vol. 2). Princeton: Educational
Testing Service.
42 U. Knoch et al.
Group, E. L. D. A. (2012). Performance in New Generation degrees and the University’s policy on
English language diagnostic assessment and support. Melbourne: University of Melbourne.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3),
527–535.
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (pp. 17–64).
Westport: American Council on Education/Praeger.
Klein-Braley, C. (1985). A cloze-up on the C-Test: A study in the construct validation of authentic
tests. Language Testing, 2(1), 76–104.
Knoch, U. (2010a). Development report of version 2 of the Academic English Screening Test
(AEST). Melbourne: University of Melbourne.
Knoch, U. (2010b). Development report of version 3 of the Academic English Screening Test.
Melbourne: University of Melbourne.
Knoch, U. (2011). Development report of version 4 of the Academic English Screening Test.
Melbourne: University of Melbourne.
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments
(PELAs). Papers in Language Testing and Assessment, 2(2), 1–19.
Knoch, U., & Elder, C. (2016). Post-entry English language assessments at university: How diag-
nostic are they? In V. Aryadoust & J. Fox (Eds.), Trends in language assessment research and
practice: The view from the Middle East and the Pacific Rim. Newcastle-upon-Tyne: Cambridge
Scholars Publishing.
Knoch, U., & O’Hagan, S. (2014). Report on the trial implementation of the post-entry assessment
of academic language (PAAL). Unpublished paper. Melbourne: University of Melbourne.
Knoch, U., Elder, C., & McNamara, T. (2012a). Report on Feasibility Study of introducing the
Academic English Screening Test (AEST) at the University of Melbourne. Melbourne:
University of Melbourne.
Knoch, U., O’Hagan, S., & Kim, H. (2012b). Preparing the Academic English Screening Test
(AEST) for computer delivery. Melbourne: University of Melbourne.
Matsumura, N. (2009). Towards identifying the construct of the cloze-elide test: A mixed-methods
study. Unpublished Masters thesis, University of Melbourne.
Plake, B., & Hambleton, R. (2001). The analytic judgement method for setting standards on com-
plex performance assessments. In G. J. Cizek (Ed.), Setting performance standards: Concepts,
methods, and perspectives (pp. 283–312). Mahwah: Lawrence Erlbaum.
Read, J. (2015). Assessing English proficiency for university study. London: Palgrave Macmillan.
Chapter 3
Mitigating Risk: The Impact of a Diagnostic
Assessment Procedure on the First-Year
Experience in Engineering
Abstract The global movement of students, the linguistic and cultural diversity of
university classrooms, and mounting concerns about retention and program com-
pletion have prompted the increased use of post-entry diagnostic assessment, which
identifies students at risk and provides them with early academic support. In this
chapter we report on a multistage-evaluation mixed methods study, now in its sixth
year, which is evaluating the impact of a diagnostic assessment procedure on the
first-year experience, student engagement, achievement, and retention in an under-
graduate engineering program. The diagnostic assessment procedure and concomi-
tant student support are analyzed through the lens of Activity Theory, which views
socio-cultural object-oriented human activity as mediated through the use of tools,
both symbolic (e.g., language) and material (e.g., computers, pens). Changes in
Activity Systems and their interrelationships are of central interest. In this chapter
we report on changes resulting from modifications to the diagnostic assessment
procedure that have increased its impact on the first-year experience by: (1) apply-
ing a disciplinary (rather than generic) assessment approach which was fine grained
enough to trigger actionable academic support; (2) embedding the diagnostic assess-
ment procedure within a required first-year engineering course, which increased the
numbers of students who voluntarily sought support; and (3) paying increased
attention to the development of social connections, which play an important role in
student retention and success.
1 Introduction
1
English and French are official languages of Canada and serve as mediums of instruction in
Canadian universities.
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 45
In this study, the diagnostic assessment procedure has as its goal to: (1) identify
entering students at-risk in the first year of their undergraduate engineering pro-
gram; and (2) generate a useful learning profile (i.e., what Cai 2015 notes must lead
to actionable feedback), which is linked to individually tailored (Fox 2009) and
readily available academic support for the learning of a first-year engineering
student.
The study is informed by sociocultural theory (Vygotsky 1987), which views
knowledge as contextualized and learning as social (e.g., Artemeva and Fox 2014;
Brown et al. 1989; Lave and Wenger 1991). From this perspective, both the assess-
ment and the pedagogical interactions that it triggers are situated within and
informed by the context and the community in which they occur (Lave and Wenger
1991). However, context is a complex and multi-layered construct, extending from
micro to increasingly (infinitely) macro levels. Thus, early in the study we encoun-
tered what is known in the literature as the “frame problem” (Gee 2011a, p. 67,
2011b, p. 31), namely, to determine the degree of situatedness that would be most
useful for the purposes of the diagnostic assessment procedure.
In the initial implementation of the diagnostic assessment, we operationalized
what Read (2012) refers to as a general academic language proficiency construct, of
relevance to university-level academic work. For our pilot in 2010, we drew three
tasks from the Diagnostic English Language Needs Assessment (DELNA) test bat-
tery (see, Read 2008, 2012; or, the DELNA website: http://www.delna.auckland.
ac.nz/). DELNA is a Post-Entry Language Assessment (PELA) procedure. As dis-
cussed in Read (2012), such PELA procedures operationalize the construct of aca-
demic English proficiency and draw on generic assessment materials of relevance
across university faculties and programmes. Alderson (2007) argues that “diagnosis
need not concern itself with authenticity and target situations, but rather needs to
concentrate on identifying and isolating components of performance” (p. 29).
Isolating key components of performance as the result of diagnosis provides essen-
tial information for structuring follow-up pedagogical support.
Of the three DELNA tasks that were used for the initial diagnostic assessment
procedure, two were administered and marked by computer and tested academic
vocabulary knowledge and reading (using multiple-choice and cloze-elide test for-
mats). The third task tested academic writing with a graph interpretation task that
asked test takers to write about data presented in a histogram. The task was marked
by raters trained to use the DELNA rubric for academic writing. We drew two
groups of raters, with language/writing studies or engineering backgrounds, and all
were trained and certified through the DELNA on-line rater training system (Elder
and von Randow 2008). However, while the language/writing raters tended to focus
on language-related descriptors, the engineering raters tended to focus on content.
Although inter-rater reliability was high (.94), there was a disjuncture between what
the raters attended to in the DELNA rating scale and what they valued in the mark-
ing (Fox and Artemeva 2011). Further, some of the descriptors in the DELNA
generic grid were not applicable to the engineering context (e.g., valuing length as
opposed to concise expression; focusing on details to support an argument, rather
than interpreting trends). The descriptors needed to map onto specific, actionable
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 47
3 Theoretical Framework
Our analysis of the diagnostic assessment procedure and concomitant student sup-
port discussed in this chapter are enriched through the use of Activity Theory (AT)
(Engeström 1987; Engeström et al. 1999; Leont’ev 1981). Vygotsky (1987) argued
that object-oriented human activity is always mediated through the use of signs and
symbols (e.g., language, texts). One of Vygotsky’s students, Leont’ev (1981), later
posited that collective human activity is mediated not only by symbolic, but also by
material tools (e.g., pens, paper). Leont’ev (1981) represented human activity as a
triadic structure, often depicted as a triangle, which includes a subject (i.e., human
actors), an object (i.e., something or someone that is acted upon), and symbolic or
material tools which mediate the activity. The subject has a motive for acting upon
the object in order to reach an outcome. AT was further developed by Engeström
(1987), who observed that “object-oriented, collective and culturally mediated
human activity” (Engeström and Miettinen 1999, p. 9) is best modelled as an activ-
ity system (e.g., see Figs. 3.1 and 3.2). In other words, “an activity system comes
into existence when there is a certain need . . . that can be satisfied by a certain activ-
ity” (Artemeva 2006, p. 44). As Engeström noted, multiple activity systems interact
over time and space. In the present chapter, we consider the diagnostic assessment
procedure as comprising two interrelated activity systems (Fig. 3.1): a diagnostic
assessment activity and a pedagogical support activity.
The diagnostic assessment procedure prompts or instigates the development of
the activity system of the undergraduate engineering student (Fig. 3.2), who
voluntarily seeks support from peer mentors within the Centre in order to improve
his or her academic achievement (i.e., grades, performance).
48 J. Fox et al.
Outcome of
activity: learning;
academic
Outcome of activity:
development;
learning profile
disciplinary
acculturation
Subject: Object: Subject: peer Object:
raters of students mentors students
diagnostic
assessment
Fig. 3.1 Activity systems of the diagnostic assessment procedure in undergraduate engineering
Object: diagnostic
Subject:
assessment results,
engineering
course assignments;
student Outcome: academic
tests; projects
achievement;
improved
performance in
engineering courses
It is important to note that the actors/subjects in the activity systems of the diag-
nostic assessment procedure (Fig. 3.1) are not the students themselves. As indicated
above, the raters in the diagnostic assessment activity system become peer mentors
in the pedagogical support activity. However, the students who use the information
provided to them in the learning profile and seek additional feedback or pedagogical
support bring into play another activity system (Fig. 3.2), triggered by the activities
depicted in Fig. 3.1.
In our study, the activities presented in Figs. 3.1 and 3.2 are all situated within
the community of undergraduate engineering, and, as is the case in any activity
system (Engeström 1987), are inevitably characterized by developing tensions and
contradictions. For example, there may be a contradiction between the mentors’ and
the students’ activity systems because of contradictions in the motives for these
activities. Mentors working with students within the Centre are typically motivated
to support students’ long-term learning in undergraduate engineering, whereas most
students tend to have shorter-term motives in mind, such as getting a good grade on
an assignment, clarifying instructions, or unpacking what went wrong on a test.
According to AT, such tensions or contradictions between the mentors’ activity sys-
tem and the students’ activity system are the site for potential change and develop-
ment (Engeström 1987). Over time, a student’s needs evolve and the student’s
motive for activity may gradually approach the motive of the mentors’ activity. One
of the primary goals of this study is to find evidence that the activity systems in Figs.
3.1 and 3.2 are aligning to ever greater degrees as motives become increasingly
interrelated. Evidence that the activity systems of mentors and students are aligning
will be drawn from students’ increasing capability, and awareness of what works,
why, and how best to communicate knowledge and understanding to others within
their community of undergraduate engineering as an outcome of their use of the
Centre. Increased capability and awareness enhance the first-year experience
(Artemeva and Fox 2010; Scanlon et al. 2007), and ultimately influence academic
success.
The diagnostic assessment procedure is also informed by empirical research
which is consistent with the AT perspective described above. This research has
investigated undergraduates’ learning in engineering (e.g., Artemeva 2008, 2011;
Artemeva and Fox 2010), academic and social engagement (Fox et al. 2014), and
the process of academic acculturation (Cheng and Fox 2008), as well as findings
produced in successive stages of the study itself (e.g., Fox et al. 2016). We are ana-
lyzing longitudinal data on an on-going basis regarding the validity of the at-risk
designation and the impact of pedagogical support (e.g., retention, academic suc-
cess/failure, the voluntary use of pedagogical support), and using a case study
approach to develop our understanding of the phenomenon of risk in first-year
engineering.
Having provided a brief discussion of the theoretical framework and on-going
empirical research that inform the diagnostic assessment procedure, in the section
below we describe the broader university context within which the procedure is situ-
ated and the issues related to its development and implementation.
50 J. Fox et al.
Mode of
Raters Delivery Students
Assessment Pedagogical
Presentation & Marketing
Quality Support
Fig. 3.3 Issues and tensions in diagnostic assessment: an ongoing balancing act
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 51
When a mandate for an assessment procedure had been established, and it had been
determined that a test or testing procedures would best address the mandate, the
critical first question is ‘do we know what we are measuring?’ (Alderson 2007,
p. 21). As Alderson points out, “Above all, we need to clarify what we mean by
diagnosis … and what we need to know in order to be able to develop useful diag-
nostic procedures” (p. 21). Assessment quality depends on having theoretically
informed and evidence-driven responses to each of the following key questions:
4.1.1 Is the construct well-enough understood and defined to warrant its operational
definition in the test?
4.1.2 Are the items and tasks meaningfully operationalizing the construct? Do they
provide information that we can use to shape pedagogical support?
4.1.3 Is the rating scale sufficiently detailed to provide useful diagnostic
information?
4.1.4 Are the raters consistent in their interpretation of the scale?
4.1.5 Does the rating scale and the resulting score or scores provide sufficient infor-
mation to trigger specific pedagogical interventions?
Once the results of a diagnostic assessment are available, there are considerable
challenges in identifying the most effective type of pedagogical support to provide.
The other contributors to this volume have identified the many different approaches
taken to provide academic support for students, especially for those who are identi-
fied as being at-risk. For example, in some contexts academic counsellors are
assigned to meet with individual students and provide advice specific to their needs
(e.g. Read 2008). In other contexts, a diagnosis of at-risk triggers a required course
or series of workshops.
Determining which pedagogical responses will meet the needs of the greatest
number of students and have the most impact on their academic success necessitates
a considerable amount of empirical research, trial, and (unfortunately) error.
Tensions occur, however, because there are always tradeoffs between what is opti-
mal support and what is practical and possible in a given context (see Sect. 4.3
below).
Research over time helps to identify the best means of support. Only with time is
sufficient evidence accumulated to support pedagogical decisions and provide
evidence-driven responses to the following questions.
4.2.1 Should follow-up pedagogical support for students be mandatory? If so, what
should students be required to do?
4.2.2 Should such support be voluntary? If so, how can the greatest numbers of
students be encouraged to seek support? How can students at-risk be reached?
52 J. Fox et al.
4.2.3 What type of support should be provided, for which students, when, how, and
by whom?
4.2.4 How precisely should information about test performance and pedagogical
support be communicated to the test taker/student? What information should be
included?
4.2.5 How do we assess the effectiveness of the pedagogical support? When should
it begin? How often? For what duration?
4.2.6 What on-going research is required to evaluate the impact of pedagogical
interventions? Is there evidence of a change in students’ engagement and their
first-year experience?
4.2.7 What evidence do we need to collect in order to demonstrate that interventions
are working to increase overall academic retention and success?
4.3.5 How should we recruit and train personnel to provide effective pedagogical
support?
4.3.6 Who should monitor their effectiveness?
4.3.7 How clear is the tradeoff between cost and benefit?
4.3.8 How can we tailor the diagnostic assessment procedure and pedagogical fol-
low-up to maximize practicality, cost-effectiveness, and impact?
Given the complex balancing act of developing and implementing a diagnostic
assessment procedure, we have limited our discussion in the remainder of this chap-
ter to two changes that have occurred since the assessment was first introduced in
2010. These evidence-driven changes were perhaps the most significant in improv-
ing the quality of the assessment itself and increasing the overall impact of the
assessment procedure. The two changes that were implemented in 2014 are:
• Embedding diagnostic assessment within a mandatory, introductory engineering
course; and,
• Providing on-going pedagogical support during the full academic year through
the establishment of a permanent support Centre for engineering students.
These changes were introduced as a result of research studies, occurring at dif-
ferent phases and multiple stages of development and implementation of the proce-
dure, informed by the theoretical framework, and guided by the questions listed
above and the overall research question: Does a diagnostic assessment procedure,
combining assessment with individual pedagogical support, improve the first-year
experience, achievement, and retention of undergraduate engineering students? If
so, in what ways, how, and for whom? In this chapter we focus on evidence which
relates the two changes in 2014 to differences in the first-year experience.
5.1 Method
The two key changes to the diagnostic assessment procedure, namely, embedding it
in a required course and providing a permanent Centre for support, were supported
by initial results of the longitudinal mixed-methods study which is exploring the
effectiveness of this diagnostic assessment procedure by means of a multistage-
evaluation design (Creswell 2015). As Creswell notes, “[t]he intent of the multi-
stage evaluation design is to conduct a study over time that evaluates the success of
a program or activities implemented into a setting” (p. 47). It is multistage in that
each phase of research may, in effect, constitute many stages (or studies). These
stages may be qualitative (QUAL or qual, depending upon dominance), quantitative
(QUAN or quan, depending upon dominance) or mixed methods, but taken together
they share a common purpose. Figure 3.4 provides an overview of the research
design and includes information on the studies undertaken within phases of the
54
J. Fox et al.
research; how the qualitative and quantitative strands were integrated; and the rela-
tionship of one phase of the research to the next from 2010 to 2016.
Engaging in what is essentially a ‘program-of-research’ approach is both com-
plex and time-consuming. Findings are reported concurrently as part of the overall
research design. For example, within the umbrella of the shared purpose for this
longitudinal study, Fox et al. (2016) report on a Phase 1 research project investigat-
ing the development of a writing rubric, which combines generic language assess-
ment with the specific requirements of writing for undergraduate engineering
purposes. As discussed above, the new rubric was the result of tensions in the diag-
nostic assessment procedure activities (Fig. 3.1) in Phase 1 of the study. The generic
DELNA grid and subsequent learning profile contradicted some of the expectations
of engineering writing (e.g., awarded points for length, anticipated an introduction
and conclusion, did not value point form lists). Increasing the engineering relevance
of the rubric also increased the diagnostic feedback potential of the assessment
procedure and triggered differential pedagogical interventions. In other words, ten-
sions in the original activity system led to the development of a more advanced
activity system (Engeström 1987), which better identifies and isolates components
of performance (Alderson 2007). In another Phase 1 project, McLeod (2012)
explored the usefulness of diagnostic feedback in a case study which tracked the
academic acculturation (Artemeva 2008; Cheng and Fox 2008) of a first-year under-
graduate engineering student over the first year of her engineering program. She
documented the interaction of diagnostic assessment feedback, pedagogical sup-
port, and the student’s exceptional social skill in managing risk, as the student navi-
gated through the challenges of first-year engineering. McLeod’s work further
supports the contention that social connections within and across a new disciplinary
community directly contribute to a student’s academic success.
Another component of the multistage evaluation design includes a qualitative
case study in Phase 2 with academic success in engineering being the phenomenon
of interest. The case study is drawing on semi-structured interviews with first-year
and upper-year engineering students, TAs, professors, peer mentors, administrators,
etc. to investigate the impact of the diagnostic assessment from different stakeholder
perspectives. There is also a large-scale quantitative study in Phase 2 which is track-
ing the academic performance of students who were initially identified as being at-
risk and compares their performance with their more academically skilled
counterparts using various indicators of student academic success (e.g., course
withdrawals, course completions and/or failures, drop-out rates, and use of the
Centre which provides pedagogical support).
As indicated above, the design is recursive in that the analysis in Phase 2 will be
repeated for a new cohort of entering engineering students and the results will fur-
ther develop the protocols designed to support their learning. The current study will
be completed in 2016. On-going in-house research is now the mandate of the per-
manent Centre (see Fig. 3.4, Phase 3).
56 J. Fox et al.
5.2 Participants
As indicated in the introduction to this chapter, from the beginning there were two
overarching concerns to address in evaluating the impact of the diagnostic assess-
ment procedure within this engineering context:
• Development of a post-entry diagnostic assessment which would effectively
identify undergraduate engineering students at-risk early in their first term of
study; and
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 57
• Provision of effective and timely academic support for at-risk students, as well as
any other first-year engineering students, who wanted to take advantage of the
support being offered.
Initially, although students were directly encouraged to take the diagnostic
assessment, their participation and use of the information provided by the assess-
ment as well as follow-up feedback and post-entry support was voluntary. There
were no punitive outcomes (e.g., placement in a remedial course; required atten-
dance in workshops; reduction in course loads and/or demotion to part-time status).
Students received feedback and advice on their diagnostic assessment results by
email a week after completing the assessment. Their performance was confidential
(neither their course professors nor the TAs assigned to their courses were informed
of their results). The emails urged students to drop in for additional feedback at a
special Centre to meet with other, upper year students (in engineering and writing/
language studies) and get additional feedback, information, and advice on how to
succeed in their engineering courses.
In 2014, two critical changes occurred in the delivery of the diagnostic assess-
ment procedure which dramatically increased its impact. These two changes, which
are the focus of the findings below, were the cumulative result of all previous stages
of research, and as indicated above, have had to date the largest impact on the qual-
ity of the diagnostic assessment procedure.
Each of the key changes implemented in 2014 is discussed separately in relation
to the findings which informed the changes.
In Phase 1 of the study (2011–2012), 489 students (50 % of the first-year under-
graduate engineering cohort) were assessed with three of DELNA’s diagnostic tasks
leased from the University of Auckland (DELNA’s test developer). The DELNA
tasks were administered during orientation week – the week which precedes the
start of classes in a new academic year and introduces new students to the univer-
sity. Students were informed of their results by email and invited on a voluntary
basis to meet with peer mentors to receive pedagogical support during the first
months of their engineering program. A Support Centre for engineering students
was set up during the first 2 months of the 2011–2012 term. It was staffed by upper-
year engineering and writing/language studies students who covered shifts in the
Centre from Monday through Friday, and who had previously rated the DELNA
writing sub-test.
During the 6 weeks the Centre was open, only 12 (2 %) of the students sought
feedback on their diagnostic assessment results. The students tended to be those
who were outstanding academically and would likely avail themselves of every
opportunity to improve their academic success, or students who wanted information
on an assignment. Only 3 of the 27 students who were identified as at-risk (11 %)
visited the Centre for further feedback on their results and took advantage of the
58 J. Fox et al.
pedagogical support made available. At the end of the academic year, ten of the at-
risk group had dropped out or were failing; seven were borderline failures; and, ten
were performing well (including the three at-risk students who had sought
feedback).
In 2012–2013, 518 students (70 % of cohort) were assessed, but only 33 students
(4 %) voluntarily followed-up on their results. However, there was evidence that
three of these students remained in the engineering program because of early diag-
nosis and pedagogical intervention by mentors in the Centre. Learning of the suc-
cess of these three students, the Dean of Engineering commented, “Retaining even
two students pays for the expense of the entire academic assessment procedure.”
In 2013–2014, the DELNA rating scale was adapted to better reflect the engi-
neering context. This hybrid writing rubric (see Fox et al. 2016) improved the grain
size or specificity of relevant information provided by the rubric and enhanced the
quality of pedagogical interventions. Further, DELNA’s graph interpretation task
was adapted to represent the engineering context. Graphic interpretation is central
to work as a student of engineering (and engineering as a profession as well).
However, when the DELNA graph task was vetted with engineering faculty and
students, they remarked that “engineers don’t do histograms”. This underscored the
importance of disciplinarity (Prior 1994) in this diagnostic assessment procedure.
It became clear that many of the versions of the generic DELNA graph task were
more suited for social science students than for engineering students, who most
frequently interpret trends with line graphs. In order to refine the diagnostic feed-
back elicited by the assessment and shape pedagogical interventions to support stu-
dents in engineering, it became essential that engineering content, tasks, and
conventions be part of the assessment. Evidence suggested that the frame of general
academic language proficiency (Read 2015) was too broad; decreasing the frame
size and situating the diagnostic assessment procedure with engineering text, tasks,
and expectations of performance also increased both the overall usefulness of the
assessment (Bachman and Palmer 1996) as well as the relevance and specificity
(grain size) of information included in the learning profiles of individual students.
More specificity in learning profiles also increased the quality and appropriacy of
interventions provided to students in support of their learning. From the perspective
of Activity Theory, the mentors were increasingly able to address the students’
motives: to use the learning profiles as a starting point; to mediate activity in rela-
tion to the students’ motives to improve their performance in their engineering
classes or achieve higher marks on a lab report or problem set. The increased rele-
vance of the mentors’ feedback and support suggests increased alignment between
the activity systems of mentors and students (Figs. 3.1 and 3.2).
Domain analysis which investigated undergraduate engineering supported the
view that engineering students might be at-risk due to more than academic language
proficiency. For example, some students had gaps in their mathematics background
while others had difficulty reading scientific texts. Still others faced challenges in
written expression required for engineering (e.g., concise lab reports; collaborative
or team writing projects). Importantly, a number of entering students, who were
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 59
deemed at-risk when the construct was refined to reflect requirements for academic
success in engineering, were first-language English speakers. Thus, as the theory
and research had suggested, a disciplinary-specific approach would potentially have
the greatest impact in supporting undergraduates in engineering.
As early as 2012–2013, we had pilot tested two line graphs to replace the generic
DELNA graphs. The new graphs illustrated changes in velocity over time in an
acceleration test of a new vehicle. However, the graphs proved to be too difficult to
interpret, given that they appeared on a writing assessment without any supporting
context. Convinced that it was important to provide a writing task that better repre-
sented writing in engineering, in 2013 we also piloted and then administered a writ-
ing task that was embedded in the first lecture of a mandatory course, which all
entering engineering students are required to take regardless of their future disci-
pline (e.g., mechanical, aerospace, electrical engineering). During the first lecture,
the professor introduced the topic, explained its importance, showed a video that
extended the students’ understanding of the topic, and announced that in the follow-
ing class, the students would be asked to write about the topic by explaining differ-
ences in graphs which illustrated projected versus actual performance. Students
were invited to review the topic on YouTube and given additional links for readings,
should they choose to access them.
In 2014–2015, using the same embedded approach, we again administered the
engineering graph task to 1014 students (99 % of the cohort). As in 2013, students
wrote their responses to the diagnostic assessment in the second class of their
required engineering course. The writing samples were far more credible and infor-
mative than had been the case with the generic task, which was unrelated to and
unsupported by their academic work within the engineering program. The informa-
tion produced by the hybrid rubric provided more useful information for peer men-
tors. Other diagnostic tasks were added to the assessment including a reading from
the first-year chemistry textbook in engineering, with a set of multiple choice ques-
tions to assess reading comprehension, and a series of mathematics problems which
represented foundational mathematics concepts required for first-year.
Thus, the initial generic approach evolved into a diagnostic procedure that was
embedded within the university discipline of engineering and operationalized as an
academic literacy construct as opposed to a language proficiency construct. The
embedded approach is consistent with the theory that informed the study. As Figs.
3.1 and 3.2 illustrate, activities are situated within communities characterized by
internal rules and a division of labour.
In the literature on post-entry diagnostic assessment, an embedded approach was
first implemented at the University of Sydney in Australia as Measuring the
Academic Skills of University Students (MASUS) (Bonanno and Jones 2007; Read
2015). Like the diagnostic procedure that is the focus of the present study, MASUS:
• operationalizes an academic literacy construct,
• draws on materials and tasks that are representative of a discipline,
• is integrated with the teaching of courses within the discipline, and
• is delivered within a specific academic program.
60 J. Fox et al.
In December 2014, data from field notes collected by peer mentors indicated a
dramatic increase in the number of students who were using the Centre (see details
below). In large part, the establishment of a permanent support Centre, staffed with
upper-year students in both engineering and in writing/language studies filled a gap
in disciplinary support that had been evident for some time. The change in the medi-
ational tools in the support activity system (i.e., to a permanent Centre), embedded
within the context of the first-year required course in engineering, has also had an
important impact on student retention and engagement.
In the context of voluntary uptake (Freadman 2002), where the decision to seek sup-
port is left entirely to the student, one of the greatest challenges was reaching stu-
dents at-risk. As McLeod (2012) notes, such students are at times fearful, unsure,
unaware, or unwilling to approach a Centre for help – particularly at the beginning
of their undergraduate program. Only those students with exceptional social net-
working skills are likely to drop-in to a support Centre in the first weeks of a new
year. These students manage a context adeptly (Artemeva 2008; Artemeva and Fox
2010) so that support works to their advantage (like the at-risk student who was the
focus of McLeod’s research).
From 2010 to 2012, the Centre providing support was open during the first 2
months of the Fall term (September–October) and was located in a number of dif-
ferent sites (wherever space was available). When the Centre closed for the year,
interviews with engineering professors and TAs, instructors in engineering com-
munications courses, and upper-year students who had worked in the Centre sug-
gested the need for pedagogical support was on-going. Of particular note was the
comment of one TA in a first-year engineering course who recounted an experience
with a student who was failing. She noted, “I had no place to send him. He had no
place to go.” This sentiment was echoed by one of the engineering communications
instructors who, looking back over the previous term, reported that one of her stu-
dents had simply needed on-going support to meet the demands of the course.
However, both she and her TA lamented their inability to devote more time to this
student: “He was so bright. I could see him getting it. But there was always a line of
other students outside my office door who also needed to meet with me. I just
couldn’t give him enough extra time to make a difference.” Again, the type of sup-
port the student needed was exactly that which had been provided by the Centre. It
was embedded in the context of engineering courses, providing on-going relevant
feedback on engineering content, writing, and language.
As a result of evidence presented to administrators that the diagnostic assessment
procedure and concomitant pedagogical support were having a positive impact, in
2013 a permanent space in the main engineering building was designated and named
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 61
the Elsie MacGill Centre2 (by popular vote of students in the engineering program).
It was staffed for the academic year by 11 peer mentors, 8 upper-year students from
engineering and 3 from language/writing studies. In addition, funding was made
available for on-going research to monitor the impact of the assessment procedure
and pedagogical support.
From an Activity Theory perspective, the engagement of engineering students in
naming, guiding, and increasingly using the Centre is important in understanding
the evolution of the initial activities (see Figs. 3.1 and 3.2). As students increasingly
draw on the interventions provided by mentors within the Centre (Fig. 3.2), motives
of the two activities become more aligned and coherent; the potential for the nov-
ice’s participation in the community of undergraduate engineering is increased
because motives are less likely to conflict (or at least will be better understood by
both mentors and students). As motives driving the activities of mentors and stu-
dents increasingly align, the potential for positive impact on a student’s experience,
retention and academic success is also increased. Evidence of increasingly positive
impact was gathered from a number of sources.
From September to December 2014, the peer mentors with an engineering back-
ground recorded approximately 135 mentoring sessions (often with one student, but
also with pairs or small groups). However, the engineering peer mentors did not
document whether students seeking support had been identified as at-risk.
During the same 3 month period, three students in the at-risk group made
repeated visits (according to the log maintained by the writing/language studies
peer mentors). However, not only at-risk students were checking in at the Centre
and asking for help. There were 46 other students who used the pedagogical support
provided by the writing/language studies peer mentors in the Elsie Centre (as it is
now popularly called). In total, approximately 184 first-year students (19 % of
cohort) sought pedagogical support in the first 3 months of the 2014–2015 academic
year, and the number has continued to grow. Peer mentors reported that there were
so many students seeking advice that twice during the first semester they had to turn
some students away.
Increasingly, second-year students were also seeking help from the Elsie Centre.
The majority were English as a Second Language (ESL) students who were
struggling with challenges posed by a required engineering communications course.
It was agreed, following recommendations of the engineering communications
course instructors, that peer mentors would work with these students as well as all
first-year students in the required (core) engineering course. In January 2015, one of
the engineering communications instructors, with the support of the Elsie Centre,
began awarding 1 % of a student’s mark in the communications course for a visit
and consultation at the Elsie Centre.
2
Elsie MacGill was the first woman to receive an Electrical Engineering degree in Canada and the
first woman aircraft designer in the world. She may be best known for her design of the Hawker
Hurricane fighter airplanes during World War II. Many credit these small and flexible airplanes for
the success of the Allies in the Battle of Britain. Students within the engineering program voted to
name the Centre after Elsie MacGill.
62 J. Fox et al.
Consistent with the theoretical framework informing this research, situating the
diagnostic assessment procedure within a required engineering course has made a
meaningful difference in students’ voluntary uptake (Freadman 2002) of pedagogi-
cal support. As discussed above, in marketing the diagnostic assessment procedure
to these engineering students, it was critical to work towards student ownership.
Findings suggest that the students’ increased ownership is leading to an important
change in how students view and participate in the activity of the Centre. The engi-
neering students in the 2014–2015 cohort seem to view the ‘Elsie Centre’ as an
integral part of their activity system. As one student, who had just finished working
on a lab report with a writing/language studies mentor, commented: “That was awe-
some. I’m getting to meet so many other students here, and my grades are getting
better. When I just don’t get it, or just can’t do it, or I feel too stressed out by all the
work…well this place and these people have really made a difference for me.”
The Elsie Centre mentors have also begun to offer workshops for engineering
students on topics and issues that are challenging, drawing on the personal accounts
of the students with whom they have worked. The mentors have also undertaken a
survey within the Centre to better understand what is working with which students
and why. The survey grew out of the mentors’ desire to elicit more student feedback
and examine how mentors might improve the quality and impact of their pedagogi-
cal support. The activity system of the diagnostic assessment procedure (Fig. 3.1) is
evolving over time, informed by systematic research, self-assessment, and the men-
tors’ developing motive to be more effective in supporting more students. In other
words, a more advanced activity system is emerging (Engeström 1987) which
allows for further alignment in the activity systems of the mentors and the first-year
students (Figs. 3.1 and 3.2).
6 Conclusion
Although the final figures for the 2014–2015 academic year are not yet available,
there is every indication that the two changes made to the diagnostic assessment
procedure, namely embedding the assessment in the content and context of a first-
year engineering course, and setting up a permanent support Centre named and
owned by these engineering students, have greatly increased its impact. Students are
more likely to see the relevance and usefulness of diagnostic feedback and peda-
gogical support when it relates directly to their performance in a required engineer-
ing course. The Centre is open to all first-year students, and students of all abilities
are using it. As a result, the Centre does not suffer from the stigma of a mandatory
(e.g., remedial) approach. Increased engagement and participation by students is
evidence of a growing interconnectedness that is shaping students’ identities as
members of the undergraduate engineering community. Lave and Wenger (1991)
and Artemeva (2011) discuss the development of a knowledgeably skilled identity as
an outcome of a novice’s learning to engage with and act with confidence within a
community. The development of this new academic identity (i.e., functioning
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 63
References
Alderson, J. C. (2007). The challenges of (diagnostic) testing: Do we know what we are measur-
ing? In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner, & C. Doe (Eds.), Language testing
reconsidered (pp. 21–39). Ottawa: University of Ottawa Press.
Anderson, T. (2015). Seeking internationalization: The state of Canadian higher education.
Canadian Journal of Higher Education, 45(4), 166–187.
Artemeva, N. (2006). Approaches to learning genres: A bibliographical essay. In N. Artemeva &
A. Freedman (Eds.), Rhetorical genre studies and beyond (pp. 9–99). Winnipeg: Inkshed
Publications.
Artemeva, N. (2008). Toward a unified theory of genre learning. Journal of Business and Technical
Communication, 22(2), 160–185.
Artemeva, N. (2011). “An engrained part of my career”: The formation of a knowledge worker in
the dual space of engineering knowledge and rhetorical process. In D. Starke-Meyerring,
A. Paré, N. Artemeva, M. Horne, & L. Yousoubova (Eds.), Writing in knowledge societies
(pp. 321–350). Fort Collins: The WAC Clearinghouse and Parlor Press. Available at http://wac.
colostate.edu/books/winks/
Artemeva, N., & Fox, J. (2010). Awareness vs. production: Probing students’ antecedent genre
knowledge. Journal of Business and Technical Communication, 24(4), 476–515.
Artemeva, N., & Fox, J. (2014). The formation of a professional communicator: A sociorhetorical
approach. In V. Bhatia & S. Bremner (Eds.), The Routledge handbook of language and profes-
sional communication (pp. 461–485). London: Routledge.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford: Oxford University Press.
64 J. Fox et al.
Bonanno, H., & Jones, J. (2007). The MASUS procedure: Measuring the academic skills of
University students. A resource document. Sydney: Learning Centre, University of Sydney.
Retrieved from http://sydney.edu.au/stuserv/documents/learning_centre/MASUS.pdf
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning.
Educational Researcher, 18(1), 32–42.
Browne, S., & Doyle, H. (2010). Discovering the benefits of a first year experience program for
under-represented students: A preliminary assessment of Lakehead University’s Gateway
Program. Toronto: Higher Education Quality Council of Ontario.
Cai, H. (2015). Producing actionable feedback in EFL diagnostic assessment. Paper delivered at
the Language Testing Research Colloquium (LTRC), Toronto, 18 Mar 2015.
Cheng, L., & Fox, J. (2008). Towards a better understanding of academic acculturation: Second
language students in Canadian universities. Canadian Modern Language Review, 65(2),
307–333.
Creswell, J. W. (2015). A concise introduction to mixed methods research. Los Angeles: Sage.
Elder, C., & von Randow, J. (2008). Exploring the utility of a web-based English language screen-
ing tool. Language Assessment Quarterly, 5(3), 173–194.
Engeström, Y. (1987). Learning by expanding: An activity-theoretical approach to developmental
research. Helsinki: Orienta-Konsultit Oy.
Engeström, Y., & Miettinen, R. (1999). Introduction. In Y. Engeström, R. Miettinen, & R. I.
Punamäki (Eds.), Perspectives on activity theory (pp. 1–16). Cambridge: Cambridge University
Press.
Engeström, Y., Miettinen, R., & Punamäki, R. I. (Eds.). (1999). Perspectives on activity theory.
Cambridge: Cambridge University Press.
Fox, J. (2009). Moderating top-down policy impact and supporting EAP curricular renewal:
Exploring the potential of diagnostic assessment. Journal of English for Academic Purposes,
8(1), 26–42.
Fox, J. (2015). Editorial, Trends and issues in language assessment in Canada: A consideration of
context. Special issue on language assessment in Canada. Language Assessment Quarterly,
12(1), 1–9.
Fox, J., & Artemeva, N. (2011). Raters as stakeholders: Uptake in the context of diagnostic assess-
ment. Paper presented at the Language Testing Research Colloquium (LTRC), University of
Michigan, Ann Arbor.
Fox, J., & Haggerty, J. (2014). Mitigating risk in first-year engineering: Post-admission diagnostic
assessment in a Canadian university. Paper presented at the American Association of Applied
Linguistics (AAAL) Conference, Portland.
Fox, J., Cheng, L., & Zumbo, B. (2014). Do they make a difference? The impact of english lan-
guage programs on second language (L2) students in Canadian universities. TESOL Quarterly,
48(1), 57–85. doi:10.1002/tesq.
Fox, J., von Randow, J., & Volkov, A. (2016). Identifying students-at-risk through post-entry diag-
nostic assessment: An Australasian approach takes root in a Canadian university. In V. Aryadoust
& J. Fox (Eds.), Trends in language assessment research and practice: The view from the
Middle East and the Pacific Rim (pp. 266–285). Newcastle upon Tyne: Cambridge Scholars
Press.
Freadman, A. (2002). Uptake. In R. Coe, L. Lingard, & T. Teslenko (Eds.), The rhetoric and ideol-
ogy of genre (pp. 39–53). Cresskill: Hampton Press.
Gee, J. P. (2011a). An introduction to discourse analysis: Theory and method (3rd ed.). London:
Routledge.
Gee, J. P. (2011b). How to do discourse analysis: A toolkit. London: Routledge.
Huhta, A. (2008). Diagnostic and formative assessment. In B. Spolsky & F. M. Hult (Eds.), The
handbook of educational linguistics (pp. 469–482). Malden: Blackwell.
Lave, J., & Wenger, E. (1991). Situated learning: Legitimate peripheral participation. Cambridge:
Cambridge University Press.
3 Mitigating Risk: The Impact of a Diagnostic Assessment Procedure… 65
Leont’ev, A. N. (1981). The problem of activity in psychology. In J. V. Wertsch (Ed.), The concept
of activity in Soviet psychology (pp. 37–71). Armonk: Sharp.
McLeod, M. (2012). Looking for an ounce of prevention: The potential for diagnostic assessment
in academic acculturation. Unpublished M.A. thesis. Carleton University, Ottawa.
Office of Institutional Research and Planning. (2014). Retention and Graduation of Undergraduates
for “B. Engineering” – 1998 to 2012. Ottawa: Carleton University.
Prior, P. (1994). Writing/disciplinarity: A sociohistoric account of literate activity in the academy.
Mahwah: Lawrence Erlbaum.
Read, J. (2008). Identifying academic language needs through diagnostic assessment. Journal of
English for Academic Purposes, 7(2), 180–190.
Read, J. (2012). Issues in post-entry language assessment in English-medium universities. Revised
version of a plenary address given at the Inaugural Conference of the Association for Language
Testing and Assessment of Australia and New Zealand (ALTAANZ). Australia, University of
Sydney. 10 Nov 2012.
Read, J. (2015). Assessing English proficiency for university study. Gordonsville: Palgrave.
Scanlon, D. L., Rowling, L., & Weber, Z. (2007). ˜You don’t have like an identity: you are just lost
in a crowd”: Forming a student identity in the first-year transition to university. Journal of Youth
Studies, 10(2), 223–241.
Tinto, V. (1993). Leaving college: Rethinking the causes and cures of student attrition (2nd ed.).
Chicago: University of Chicago Press.
Vygotsky, L. S. (1987). Thinking and speech (N. Minick, Trans.). In R. W. Rieber & A. S. Carton
(eds.), The collected works of L. S. Vygotsky: Vol. 1. Problems of general psychology
(pp. 39–285). New York: Plenum Press. (Original work published 1934).
Chapter 4
The Consequential Validity of a Post-Entry
Language Assessment in Hong Kong
Edward Li
Abstract The launch of the 3 + 3 + 4 education reform in Hong Kong has posed
challenges to as well as created opportunities for tertiary institutions. It has invari-
ably led to reviews of the effectiveness of their existing English language curricula
and discussions among language practitioners in the tertiary sector as to what kind
of English curriculum and assessment would serve the needs and interest of the new
breed of senior secondary school graduates, who have had only six years to study
English in the new education system as compared with seven years in the old sys-
tem. This chapter reports on the pedagogical and assessment strategies adopted by
the Hong Kong University of Science and Technology (HKUST) to embrace these
challenges, and the findings of a pilot study conducted to investigate the consequen-
tial validity of a post-entry language assessment used at HKUST. Consequential
validity is often associated with test washback. In Messick’s expanded notion of test
validity (Messick 1989), the evidential and consequential bases of test score inter-
pretation and test score use are considered as crucial components of validity. It cov-
ers not just elements of test use, but also the impact of testing on students and
teachers, the interpretation of test scores by stakeholders, and the unintentional
effects of the test. This chapter reports the findings of the pilot study and discusses
their implications for the use of PELAs.
E. Li (*)
Center for Language Education, The Hong Kong University of Science and Technology,
Hong Kong, China
e-mail: lcedward@ust.hk
1 Introduction
The teaching context at HKUST resembles closely what these five Good Practice
Principles recommend. The migration from the 3-year degree to the 4-year degree
has led the University’s senior management to reinstate the importance of English
language development in the new 4-year undergraduate programme and the need for
an English proficiency threshold for the progression to Year 2 onwards.
Communication competence in English (and Chinese) is stated as one of the major
learning outcomes of undergraduate programmes and as a graduate attribute for the
4-year degree. As indicated in Table 4.1, 12 out of 120 credits for an undergraduate
programme are allocated to the development of English language ability and half of
them to the first year English Core curriculum to help students build a solid founda-
tion of proficiency and adequate academic literacy in English before they proceed to
the senior years of studies. Six credits of study means six contact hours in the class-
room with another six hours of out-of-class learning. The purpose of having a
bottom-heavy English curriculum is to utilize the foundation year as much as pos-
sible to engage students intensively and actively in developing their English profi-
ciency. At the same time, the Center for Language Education (CLE) also expands
enormously the range of informal curricular language learning activities to cater for
students’ needs and interest. Students could choose to spend two full weeks in an
intensive on-campus immersion programme before the start of the first semester,
participate in the academic listening workshops, enroll onto non-credit bearing
short courses each targeting a specific language skill or aspect, join the regular
theme-based conversation groups, or enjoy blockbusters in the movie nights. They
could also seek one-on-one consultations with CLE advisors for more personalised
help with their learning problems. In the 4-year degree programme, students are
surrounded by various accessible learning opportunities in and out of the
classroom.
In the English Core curriculum, ELPA serves both formative and summative
assessment purposes. Like many other PELAs, ELPA is administered to students
before the start of the first semester for early identification of the at-risk group who
would need additional support for their English language learning. It also plays the
role of a no-stakes pre-test to capture students’ proficiency profile at the start point
of their year-long English learning journey. The test results can be used as a refer-
ence point for students’ self-reflection and choice of learning resources. At the end
of the second semester of the English Core, ELPA is administered as a post-test to
track students’ proficiency gain throughout the year and to see if they have reached
the proficiency threshold to proceed to Year 2. Unlike the pre-test, the ELPA post-
test carries much higher stakes and the test results are counted towards the grades
for the English Core. Because of its importance in the curriculum, care has been
taken to incorporate assessment features in the design of ELPA that would maxi-
mize the possibility of positive washback on students and teachers.
Incorporating a test into a curriculum appropriately is never an easy task.
Developing a homegrown PELA is even more daunting. Why reinvent the wheel?
“If a test is regarded as important, if the stakes are high, preparation for it can come
to dominate all teaching and learning activities. And if the test content and testing
techniques are at variance with the objectives of the course, there is likely to be
harmful backwash.” (Hughes 2003:1) In the reconceptualization of the curriculum
and assessment for the 4-year degree, ELPA as a PELA is designed in such a way as
to:
• Capitalize on the flexibility and autonomy of an in-house PELA concerning what
to assess and how to assess to serve the needs of the curriculum. ELPA is
curriculum-led and curriculum-embedded;
• Help to constructively align teaching, learning and assessing. It should add value
to consolidate the strengths of or even enhance the effectiveness of the English
Core curriculum. The ultimate aim is that ELPA should generate positive wash-
back on learning and teaching;
• Create a common language about standards or achievements between students
and teachers. The senior secondary education examination results might not be
appropriate as a reference to articulate expectations in the context of university
study;
• Frame the assessment rubrics for the English Core course assessments; and
• Support teaching and learning by establishing a feedback loop. The test results
should act as a guide to inform students about how to make better use of the
learning resources at CLE and plan their study more effectively.
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong 71
ELPA is designed to assess the extent to which first-year students can cope with
their academic studies in the medium of English. It assesses both the receptive
(reading, listening and vocabulary) and productive (speaking and writing) language
skills in contexts relevant to the academic studies. ELPA consists of two main test
components. The written test assesses students’ proficiency in reading, listening and
writing, and their mastery of vocabulary knowledge. The speaking test is an 8-min
face-to-face interview, assessing students’ readiness to engage in a conversation on
topics within the target language domain meaningfully and fluently (Table 4.2).
Despite the pre-test results being used as an indication of possible areas for more
work by students, ELPA is not a diagnostic test by design. Typical diagnostic tests
have the following characteristics (Alderson 2005: 10–12):
• They are more likely to focus on weaknesses than strengths. As a result, diagnos-
tic tests may not be able to generate a comprehensive proficiency profile of stu-
dents to inform decisions.
• They are likely to be less authentic than proficiency tests. Students would be
required to complete tasks different to what they are expected to do in real life.
Consequently, cooperation of students in taking the test may not be guaranteed.
• They are more likely to be discrete-point than integrative. This means diagnostic
tests would be likely to focus more on specific elements at local levels than on
global abilities. For the assessments of productive skills (speaking and writing) –
the vital components of first-year progress – the discrete-point testing approach
may not be most appropriate.
ELPA adopts mainly the performance-based test design and direct-testing
approach, situating test items in academic contexts and using authentic materials
whenever possible. What students are required to do in ELPA resembles closely the
tasks they do in the real-life target domain. Assessments of the productive skills
adhere to the direct testing approach. The prompts are generically academic or
everyday social topics which an educated person is expected to be able to talk or
write about. The input materials for the reading and listening sections are taken
from authentic sources such as newspapers, magazines and journal articles, lectures,
and seminars, with no or very slight doctoring to avoid content bias or minimize the
Overlap
English Core
ELPA
need for prior knowledge. The test domain for the assessment of vocabulary knowl-
edge is defined by theAcademic Word List and the benchmark word frequency lev-
els for the junior and senior secondary education in Hong Kong.
As pointed out by Messick (1996), possessing qualities of directness and authen-
ticity is necessary but not sufficient for a test to bring beneficial consequences. The
test should also be content and construct valid. In the case of ELPA, the construct
andcontent validity is addressed by full integration of the test and the curriculum,
using the list of most desirable constructs for development not only as the specifica-
tions for test development but also the curriculum blueprint for materials writing
and articulation of learning outcomes. This is to ensure as high a degree as possible
of critical alignment of the assessment, teaching and learning components of the
curriculum foundation. Although the curriculum has a broader spectrum of contents
and objectives than those of the ELPA test specifications, all the major constructs
intended for assessment in ELPA are significantly overlapped in the curriculum.
The constructs are assessed in ELPA as snapshots and in course assessments as
achievements over time for formative and summative feedback. Overlap just
between the test and the course is not automatically equivalent to content validity.
The latter also involves the relationships of curriculum and assessment with the
target language domain, as suggested by Green (2007) for positive washback (Fig.
4.1)
ELPA also adopts the criterion-referenced testing approach. The ELPA rating
scale is made up of seven levels, with Level 4 set as the threshold level for the
English Core curriculum. The performance descriptors for each level are written as
can-do statements with typical areas of weaknesses identified for further improve-
ment. The performance descriptors are also used in an adapted form as the assess-
ment rubrics for the course assessments of the English Core. This is to establish a
close link between ELPA and the coursework. Both the ELPA test results and the
feedback on course assignments are presented to students in the form of ELPA lev-
els attained and corresponding performance descriptors. After the pre-test results
are released, there will be individual consultations between students and their class
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong 73
Messick (1989, 1996) argues the need in test validation studies to ensure that the
social consequences of test use and interpretation support the intended purposes of
testing and are consistent with the other social values. He believes that the intended
consequences of a test should be made explicit at the test design stage and evidence
then be gathered to determine whether the actual test effects correspond to what was
intended. Regarding what evidence for validating the consequences of test use
should be collected and how the evidence should be collected, Messick’s theoretical
model of validity does not give much practical advice for practitioners. His all-
inclusive framework for validating the consequences of score interpretation and use
requires evidence on all the six contributing facets of validity. There is no indication
as to the operationalization and prioritization of the various aspects of validity to
justify the consequences of test use.
On the other hand, Weir (2005) offers a more accessible validation framework as
an attempt to define and operationalize the construct. He proposes three possible
areas to examine Messick’s consequential validity: differential validity – or what he
calls avoidance of test bias in his later writings (Khalifa and Weir 2009; Shaw and
Weir 2007); washback on individuals in the classroom/workplace; and impact on
institutions and society. As can be seen, test fairness is a key component of this
framework. Weir argues that one of the basic requirements for a test to create benefi-
cial effects on stakeholders is that the test has to be fair to the students who take it,
regardless of their backgrounds or personal attributes. If the intended construct for
assessment is under-represented in the test or the test contains a significant amount
of construct-irrelevant components, the performance of different groups of students
will likely be affected differentially and the basis for score interpretations and use
would not be meaningful or appropriate (American Educational Research
Association et al. 1999). This echoes Messick’s argument for authenticity and
directness as validity requirements by means of minimal construct under-
representation and minimal construct-irrelevant items in a test.
Weir’s validation framework also makes a distinction between washback and
impact by relating the former to the influences the test might have on teachers and
teaching, students and their learning (Alderson and Wall 1993), and defining the
latter as the influences on the individuals, policies and practices in a broader sense
74 E. Li
(Wall 1997; Hamp-Lyons 1997; Bachman and Palmer 1996), though washback and
impact are not two entirely unrelated constructs. As Saville (2009:25) points out,
“the impact: washback distinction is useful for conceptualizing the notion of impact,
but it does not imply that the levels are unconnected.” Building on Messick and
Weir’s ideas above, an expanded validation framework for evaluating the conse-
quential validity of PELAs is proposed in Table 4.3, with avoidance of test bias
subsumed under test content and washback divided into two related categories: per-
ceptions of the test and actions taken as a result of test use (Bailey 1996).
This section reports on the first phase of an exploratory longitudinal study of the
consequential validity of ELPA. The findings are presented along the dimensions of
the expanded validation framework.
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong 75
As stated in Sect. 3, the institutional context in which ELPA is used mirrors the
recommendations of the Good Practice Principles by AUQA (2009). One aspect
that might seem to have fallen short of Good Practice Principle No. 7 is that students
only take ELPA twice, first before the start of Year 1 and then at the end of Year 1.
It might be considered as not providing the on-going opportunities for students’
self-assessment. The situation at HKUST is that the reflection on one’s language
development needs does not happen through re-sitting ELPA multiple times for the
purposes of self-assessment. This is different from the use of DELTA, for example,
a PELA used in some sister institutions in Hong Kong, which is intended to track
students’ proficiency development at multiple points throughout their degree study
(see Urmston et al. this volume). In the case of ELPA, where assessment is fully
integrated with the curriculum, opportunities for self-reflection about language
development needs and further language work are provided in the context of the
course, particularly when the class teacher gives feedback to students on their course
assignment using the ELPA assessment framework.
The intended validity was examined by the ELPA test development team, the
English Core curriculum team and an external consultant in the form of a reflection
exercise. The intended or a priori validity evidence (Weir 2005) gathered before the
test event can be used to evaluate whether the test design has the potential for ben-
eficial consequences. The group generally felt that ELPA follows the sound princi-
ples of performance-based, criterion-referenced, and direct-testing approaches.
Given the limitations of PELAs, the challenge for the ELPA team is how to ensure
that the test assesses the most relevant and desirable constructs and can “sample
them widely and unpredictably” (Hughes 2003: 54) in the target language domain
without making the test too long or having students take the test multiple times
before inferences can be made. There is always tension between validity, reliability
and practicality. Our response to this challenge is that ELPA and the curriculum
complement each other in plotting the different facets of students’ proficiency
development – as test-takers vs learners; in test-based snapshots vs extended pro-
cess assessments; and in examination conditions, with no learning resources vs in
resource-rich classroom assessment situations where students can be engaged in
collaborative assessments with peers.
Students’ perceptions of test validity are one important source of evidence to sup-
port a test use, and arguably even more influential than the statistical validity evi-
dence in explaining their willingness to take part in the testing event and shaping
their learning behaviors afterwards. More than 1,400 incoming students participated
in a survey immediately after they took ELPA as a pre-test after admission in
76 E. Li
summer 2014. The questionnaire covered aspects including test administration and
delivery, affective factors such as anxiety and preferences for test formats, and the
test quality. Since students had not yet started the English Core courses, questions
related to content validity or the degree of overlap between the test, the course and
the English ability required in the academic courses were not included in this sur-
vey. Twenty students were randomly selected for two rounds of focus discussions in
October 2014, after two months of degree studies, to give more detail about the
effect of ELPA on their perceptions or attitudes and the actions they took as a result
of taking the test.
Reported in this section are the findings on perceptions of test difficulty, test
importance, test fairness/bias, and validity. To reveal the possible differences
between students from different schools at HKUST, the data were subjected to one-
way ANOVAs. The ratings of each related item were analysed as dependent mea-
sures and “SCHOOL” (Business, Engineering, Science and Humanities) was a
between-subject factor. Post-hoc Tukey HSD tests were conducted with ratings
between different schools when a significant main effect of SCHOOL was found.
The significance of the post-hoc tests was corrected by the Holm-Bonferroni
method.
Overall, students found the ELPA test somewhat between ‘3=appropriate’ and
‘2=difficult’ (mean = 2.43; SD = 0.766) in terms of test difficulty (Table 4.4). One-
way ANOVAs were conducted with ratings for each sub-test. The results showed
that there were significant main effects of SCHOOL (p < 0.05) in most cases, with
the only exception in Reading (p = 0.09). Post-hoc tests indicated that significant
differences in SCHOOL mainly existed between Business and Engineering students
(p < 0.01), with Engineering students finding the test more difficult. Writing and
Speaking were most positively rated as ‘appropriate’ but Vocabulary Part 2 as sig-
nificantly more difficult than other parts.
To check if Writing and Speaking were statistically different from other parts,
the data of 1,049 students who rated all six parts were subjected to repeated-
measures ANOVA. The ratings of each part were analysed as dependent measures,
with “TEST” (Reading, Listening, Vocabulary Part 1, Vocabulary Part 2, Writing
and Speaking) as a within-subject factor. Results showed that there was a significant
main effect of TEST (F(4.57, 4786.85) = 351.04, p < 0.001. Greenhouse-Geisser
correction was used for sphericity). Post-hoc paired t-tests were conducted to com-
pare the ratings for Writing and Speaking to those for the other parts (significance
was corrected by the Holm-Bonferroni method). Results of the post-hoc tests
showed that Writing and Speaking were rated higher, and closer to “appropriate”,
(p < 0.001) than other parts, but no difference was found between Writing and
Speaking.
This pattern can be attributed to students’ familiarity with and preference for test
formats. In the subsequent focus group discussions, students commented that they
were familiar and comfortable with the essay writing and interview test formats. As
they said, the test formats for Writing and Speaking were what they ‘expected’.
Vocabulary Part 2, on the other hand, consisted of 30 sentences with the target word
in each of them gapped. Students had to understand the given context and then fill
in the most appropriate word to complete the meaning of the sentence. Not only did
students say they were not familiar with this test format but recalling the target word
Table 4.4 Students’ perception of test difficulty
Question: Is the ELPA test easy? (5 = very easy; 3 = appropriate; 1 = very difficult)
Overall Business Engineering Science Humanities
Mean SD N Mean SD N Mean SD N Mean SD N Mean SD N
ELPA test 2.43 0.77 1462 2.52 0.79 474 2.34 0.76 572 2.43 0.75 378 2.63 0.63 38
Reading 2.54 0.85 1469 2.51 0.83 474 2.50 0.87 573 2.63 0.87 383 2.64 0.74 39
Listening 2.30 0.90 1470 2.37 0.89 475 2.24 0.91 573 2.27 0.91 383 2.62 0.78 39
Vocab Pt. 1 2.75 0.99 1469 2.93 0.96 476 2.63 1.00 572 2.68 1.00 383 2.89 0.86 38
Vocab Pt. 2 2.12 0.95 1469 2.29 0.94 475 1.99 0.96 573 2.09 0.93 382 2.36 0.81 39
Writing 3.05 0.88 1469 3.14 0.88 475 2.96 0.89 574 3.06 0.84 381 3.21 0.89 39
Speaking 3.07 0.87 1061 3.10 0.85 365 3.02 0.91 381 3.07 0.82 289 3.54 0.95 26
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong
77
78 E. Li
from their mental lexicon to match the given context was stress-provoking. The less
common the target words, the more stress they would cause for students. Vocabulary
Part 1 seemed more ‘manageable’ as choices were provided to match with the mean-
ing of the target word. At least the students thought the multiple choice format
induced less stress.
The content of the writing and speaking parts of the test was considered as ‘neu-
tral’ and ‘impartial’ (Table 4.5: written test – mean = 3.09, SD.668; speaking test –
mean = 3.06, SD = 0.670). No significant main effect of SCHOOL was found for
either test (written test: (F(3,1462) = 0.58, p = 0.63); speaking test: (F(3,1087) =
1.21, p = 0.31)). These questionnaire findings were confirmed by nearly all the par-
ticipants in the focus groups. The writing and speaking prompts they were given in
these two parts were topics which they said concerned their everyday life and there-
fore they had views to express. They did not feel they were ‘being tricked or
trapped’, or that the topics gave them any advantage or disadvantage in terms of
prior knowledge. Choice of topic was mentioned as a potential area for improving
the social acceptability of ELPA. For example, students understood that it might not
be possible to have more than one topic or theme to choose from in a live speaking
test, but they preferred a choice of topics in the writing paper if possible. This was
because they were used to choices in the Hong Kong Diploma of Secondary
Education (HKDSE) Examination which adopts the graded approach where candi-
dates could choose to attempt the easier or the more challenging versions of some
of the sections. They could also choose from a range of topics for the HKDSE
Writing paper.
The perceived validity of ELPA was rated between ‘3=accurately measured my
English ability’ and ‘2=my performance was somewhat worse than my English abil-
ity’ consistently across all papers and backgrounds (Table 4.6). No significant main
effect of SCHOOL (p > 0.05) was found in any of the tests.
They generally felt satisfied that the test assessed their English proficiency to a
reasonably good extent, though more than two-thirds of the participants in the focus
groups said that they could have done better. Because of this ‘can-be-better’ mental-
ity, they chose ‘2’ rather than a ‘3’ to this question.
Overall, the students thought the test was fair and it was important for them to get
good results in ELPA (Table 4.7). The one-way ANOVA revealed a significant effect
of SCHOOL (F(3,1464) = 5.79, p = 0.001). Post-hoc Tukey HSD tests indicated that
Business students (mean = 4.38 ± 0.79) perceived getting good results in ELPA as
more important than Science students did (mean = 4.23 ± 0.89, p < 0.001), among all
comparisons. As both means are above 4, this can be interpreted as a difference in
the degree of importance rather than in whether it was considered important or not.
Table 4.5 Students’ perception of content bias
Question: Does the test content give you more advantage than other students? (5 = I have advantage over other students; 3 = I do not have any advantage
or disadvantage; 1 = I have disadvantage over other students)
Overall Business Engineering Science Humanities
Mean SD N Mean SD N Mean SD N Mean SD N Mean SD N
Written test 3.09 0.67 1466 3.07 0.66 473 3.08 0.66 573 3.12 0.69 381 3.15 0.71 39
Speaking 3.05 0.67 1091 3.06 0.69 368 3.02 0.68 401 3.06 0.62 294 3.25 0.80 28
test
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong
79
80
In the focus group discussions, students expressed mixed feelings about the test.
Three out of the 20 students in the discussions held rather negative views. The fact
that they had already been admitted to the university meant that they felt they should
have satisfied the University’s English language requirement. While they accepted
the need for the English Core for first year students, they did not quite understand
why ELPA was necessary after admission. To them diagnosis was not meaningful
because they said they had very good ideas about their strengths and weaknesses in
English. When prompted to state what their weaknesses were, they made reference
to their results in the English Language subject in the HKDSE examination. Their
resentment seemed to be related to their perception of the policy’s fairness.
Interestingly, their ELPA pre-test results showed that they did not belong to one
single ability group – one achieved relatively high scores in ELPA on all sections,
one mid-range around the threshold level and one below the threshold. They said
ELPA had no influence on them and would not do anything extra to boost their test
performance.
Other focus group participants accepted ELPA as a natural, appropriate arrange-
ment to help them identify their language needs for effective study in the medium
of English. They agreed that the learning environment in the university was much
more demanding on their English ability than that in secondary school. For exam-
ple, they were required to read long, unabridged authentic business cases or journal
articles and listen to lectures delivered by professors whose regional accents might
not be familiar to them. They showed enhanced motivation to do well in the ELPA
post-test not just for English Core but also for other academic courses, though to
some this attitude might have been caused by increased anxiety and trepidation
rather than appreciation of opportunities for personal development.
When asked whether they had done more to improve their English proficiency
than before as a result of ELPA, the 17 student participants who had a more positive
attitude towards ELPA said they had spent more time on English. However, the
washback intensity of the pre-test was obviously not very strong. They said this was
partly because the ELPA post-test was not imminent and partly because English
Core already took up a substantial amount of their study time allocated to language
development. They knew improvement of English proficiency would take a long
time but unanimously confirmed that they would do more on the requirements of
ELPA to get themselves prepared for the ELPA post-test. The washback intensity
was expected to become stronger in the second semester when the ELPA post-test
was closer in time.
Regarding what to learn and how to learn for ELPA, 15 student participants
chose expanding their vocabulary size as their top priority. They reasoned that
vocabulary was assessed not just in the vocabulary section, but also the writing and
speaking sections of the test. In addition, they felt that Vocabulary Part 2 (Gap Fill)
was the most difficult section of the test. Improving their lexical competence would
be the most cost-efficient way to raise their performance in ELPA, though they
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong 83
doubted whether they could get better scores in Vocabulary Part 2 because of its
unpredictability. They also thought learning words was more manageable in terms
of setting learning targets and reducing the target domain into discrete units for
learning than the more complex receptive and productive language skills. They all
started with, probably under their teachers’ advice, the items on the Academic Word
List, a much smaller finite set of words than other word frequency levels. They
made use of online dictionaries for definitions and the concordance-based learning
materials to consolidate their understanding of the words in relation to their usage
and collocation. Two students who were from Mainland China said that they kept a
logbook for vocabulary learning, which included not just the English words but also
words and expressions of other languages such as Cantonese, the spoken language
used by the local Hong Kong people.
Apart from vocabulary knowledge, eight participants chose writing as another
major learning target for the first semester. Six focused on expanding the range of
sentence structures they could use in an essay whereas four identified paragraph
unity as their main area for further improvement. They said that these were criterial
features of the ELPA performance descriptors for writing and were taught in class.
They said they had a better idea of how to approach the learning targets – they fol-
lowed the methodology used in the textbook for practice, for example, varying sen-
tence patterns instead of repeating just a few well-mastered structures, writing clear
topic sentences and controlling ideas for better reader orientation. However, they
said that teachers’ feedback on the practice was crucial for the sustainability of the
learning habits.
Six teachers agreed to join a focus group discussion to share their views on ELPA
and how the test affected their perceptions and the way they taught in the classroom.
Overall the six teachers had a very positive attitude towards ELPA and its role in the
English Core curriculum. Nearly all teachers on the English Core teaching team are
ELPA writing and speaking assessors. Many of them had been involved in item
writing and moderation at some stage, and a few were designated to carry out more
core duties in the assessor training, quality assurance, and test administration and
operation. Their minor grievances seemed to be all related to the extra workload
caused by ELPA, though these extra duties were all compensated with teaching
relief. However, they all commented ELPA had helped to make the assessment com-
ponent of the course more accurate and the assessment outcomes more credible.
They felt that their involvement in different aspects of ELPA had helped them gain
a better understanding of the assessment process. For example, they found the elab-
orate Rasch analysis-based rater anchoring plans for the speaking and writing
assessments extremely tedious, yet necessary for quality assurance.
Despite a heavier assessment load for the assessors, they agreed that the whole
procedure helped to minimize teacher variability in assessment across classes and
84 E. Li
as a result they had more confidence in the assessment outcomes. Another aspect
that they appreciated most was the integration of criterion-referenced ELPA perfor-
mance descriptors into course assessment. They said ELPA performance descriptors
evolved into a common language with explicit, mutually understood reference
points among teachers and between teachers and students to discuss standards and
achievements. This was very useful for them to formulate teaching targets and to
help students to articulate needs and set milestones for learning. They felt that stu-
dent extrinsic motivation was enhanced.
All the teacher participants agreed that ELPA was a high-stakes test in the
English Core, so naturally they considered it their duty to help students to at least
meet the threshold requirement by the end of the second semester. Regarding the
washback on teachers as to what to teach and how to teach, the influence from
ELPA was less clear. Two teachers pointed out that they would recommend that
students be more exposed to authentic English and practice using English whenever
and wherever possible, whether or not ELPA was an assessment component of the
course. They believed that an increase in proficiency would naturally lead to better
performance in ELPA, so what they did in the classroom was simply based on the
principles and good practice of an effective EFL classroom. ELPA could be one of
the more latent teaching goals, but certainly was not to be exploited as the means for
teaching.
While the other four teacher participants agreed that this naturalistic approach
was also what they followed in the classroom to help students to develop their
receptive skills, they saw more commonality between the sections of the ELPA test
on productive skills and vocabulary knowledge and the curriculum, so that could
form a useful basis for teaching that served the purposes of both. For example, topic
development and coherence is one of the assessment focuses of both the ELPA
speaking and writing tests. What these teachers said they did was teach students
strategies for stronger idea development which were emphasized in the ELPA per-
formance descriptors, i.e., substantiation of arguments with examples and reasons,
rebuttal, paragraph unity with clear topic sentences and controlling ideas. They
argued that this approach would contribute to desirable performance in both the test
and the course assessment. Asked if they would teach in the same way were ELPA
not a component of the course, they admitted that they would do more or less the
same but the intensity perhaps would be much weaker and their focus would prob-
ably be only on the tasks required in the course assessment rather than a wider range
of tasks similar to those used in the ELPA test.
5.5 Impact
The first impact that ELPA has created is in the resources it demands for mainte-
nance of the current scope of operation, i.e. from test development, test administra-
tion, management of a team of assessors, statistical analysis, quality assurance to
score and result reporting. It is a resource-hungry assessment operation. Yet it is a
4 The Consequential Validity of a Post-Entry Language Assessment in Hong Kong 85
view shared by the development team, the curriculum team, and the Center senior
management that the current assessment-curriculum model adopted is the best pos-
sible approach in our context to ensure a high degree of overlap between the test, the
curriculum and the target language domain for positive washback. Colleagues in
CLE would regard their various involvements in ELPA as opportunities for personal
and professional development, from which the gains that they might obtain could
benefit the development and delivery of other courses.
Because of the high stakes of ELPA in the English Core curriculum, a policy has
been put in place to counteract the probabilistic nature of assessments – all border-
line fail cases in ELPA and English Core are required to be closely scrutinized to
ensure accuracy in the evidence and appropriateness in the decisions made. Those
who fail to meet the threshold by the second semester can retake the course and the
test in the summer. Classes for repeaters will be much smaller in size to provide
more support and attention for the students, and they will have a range of options as
to what types of out-of-class support they need and prefer. Evidence from the class-
room about achievements and learning progress is also gathered to support any
further decisions to be made.
6 Conclusion
References
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. New York: Continuum.
Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115–129.
American Educational Research Association, American Psychological Association, and National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing. Washington, DC.
AUQA (Australian Universities Quality Agency). (2009). Good practice principles for English
language proficiency for international students in Australian universities. Report to the
Department of Education, Employment and Workplace Relations, Canberra.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University
Press.
Bailey, K. M. (1996). Working for washback: A review of the washback concept in language test-
ing. Language Testing, 13(3), 257–279.
Green, A. (2007). IELTS washback in context: Preparation for academic writing in higher educa-
tion (Studies in Language Testing, Vol. 25). Cambridge: UCLES/Cambridge University Press.
Hamp-Lyons, L. (1997). Washback, impact and validity: Ethical concerns. Language Testing,
14(3), 295–303.
Hughes, A. (2003). Testing for language teachers (2nd ed.). Cambridge: Cambridge University
Press.
Khalifa, H., & Weir, C. J. (2009). Examining reading: Research and practice in assessing second
language reading (Studies in Language Testing, Vol. 29). Cambridge: UCLES/Cambridge
University Press.
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments
(PELAs). Papers in Language Testing and Assessment, 2(2), 48–66.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York:
American Council on Education and Macmillan Publishing Company.
Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3),
241–256.
Read, J. (2015). Plenary speech: Issues in post-entry language assessment in English-medium uni-
versities. Language Teaching, 48(2), 217–234.
Saville, N. (2009). Developing a model for investigating the impact of language assessment within
educational contexts by a public examination provider. Unpublished PhD thesis, University of
Bedfordshire.
Shaw, S., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second
language writing (Studies in Language Testing, Vol. 26). Cambridge: UCLES/Cambridge
University Press.
Wall, D. (1997). Impact and washback in language testing. In C. Clapham & D. Corson (Eds.),
Language testing and assessment (pp. 291–302). Amsterdam: Kluwer Academic Publishers.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. New York:
Palgrave Macmillan.
Chapter 5
Can Diagnosing University Students’ English
Proficiency Facilitate Language Development?
A. Urmston (*)
English Language Centre, Hong Kong Polytechnic University, Hong Kong, China
e-mail: alan.urmston@polyu.edu.hk
M. Raquel
Centre for Applied English Studies, University of Hong Kong, Hong Kong, China
e-mail: michelle.raquel@hku.hk
V. Aryadoust
National Institute of Education, Nanyang Technological University,
Singapore, Republic of Singapore
e-mail: vahid.aryadoust@nie.edu.sg
1 Introduction
1
For a detailed description of the self-access centres in universities in Hong Kong, see Miller and
Gardner (2014).
2
See, for example, http://elc.polyu.edu.hk/Clubs-Societies
3
For more information on EES, see http://elc.polyu.edu.hk/EES/
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 89
4
The Hong Kong Diploma of Secondary Education (HKDSE) and previously the Hong Kong
Advanced Level Use of English examinations.
5
Between 2003 and 2014, the Hong Kong Government employed the International English
Language Testing System (IELTS) under its Common English Proficiency Assessment Scheme,
90 A. Urmston et al.
what has been missing is a mechanism to determine students’ abilities in the lan-
guage earlier on in their university studies, at a time when they would still be able
to improve. In other words, there has been a perceived need for a formative assess-
ment of the English proficiency of students while at university that could both mea-
sure students’ development in the language and offer diagnostic information to help
them to improve. The Diagnostic English Language Tracking Assessment (DELTA)
was developed to address this need.
DELTA is considered to be an assessment of English language proficiency.
According to Bachman and Palmer’s (1996) conceptualisation, a frame of reference
such as a course syllabus, a needs analysis or a theory of language ability is required
to define the construct to be assessed. As DELTA is not tied to any particular sylla-
bus and is designed for use in different institutions, it is therefore based on a theory
of language ability, namely that presented in Bachman and Palmer (1996), which
was derived from that proposed by Bachman (1990). Under their theory or frame-
work, language ability consists of language knowledge and strategic competence.
As the format of DELTA is selected response, it is considered that strategic compe-
tence does not play a major part in the construct and so language knowledge (includ-
ing organisational knowledge – grammatical and textual, and pragmatic
knowledge – functional and sociolinguistic) (Bachman and Palmer 1996) is regarded
as the underlying construct. Given the academic English flavour of DELTA, it
should be regarded as an academic language proficiency assessment, as students
have to apply their knowledge of academic English to be able to process the written
and spoken texts and answer the test items.
DELTA consists of individual multiple-choice tests of listening, vocabulary,
reading and grammar (with a writing component under development). The reading,
listening, and grammar items are text-based, while the vocabulary items are dis-
crete. Despite the fact that a multiple-choice item format limits the items to assess-
ing receptive aspects of proficiency, this format was chosen to allow for immediate
computerised marking. The Assessment lasts 70 min. Each component (except
vocabulary) consists of a number of parts, as shown in Table 5.2.
The listening and reading components consist of four and three parts respectively
and the grammar component two parts. The DELTA system calculates the total
number of items in these three components and then adds items to the vocabulary
component such that the total number of items on the assessment equals 100.
DELTA uses discrete items which test identified subskills of listening, reading,
grammar and vocabulary. The subskills are written in accessible language in the
DELTA student report. In addition, Rasch measurement is used to calibrate items
and generate the reports. Student responses to the test items are exported from the
DELTA system and imported into WINSTEPS v3.81 (Linacre 2014), a Windows-
based Rasch measurement software program. After analysis, students’ DELTA
such that final-year undergraduates were funded to take IELTS and the results used by the
Government as a measure of the effectiveness of English language enhancement programmes at
the Government-funded institutions. In addition, one institution, the Hong Kong Polytechnic
University, requires its graduating students to take its own Graduating Students’ Language
Proficiency Assessment (GSLPA).
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 91
D 180
E
L
160
T
A
140
M
E 120 125
118 120
A 113
S
U 100
R
E 80
60
08/2011 08/2012 08/2013 Graduation
DELTA Measure Predicted Measure Target Measure
Measures and item calibrations are returned to the DELTA system and the reports
are generated. DELTA Measures are points on the DELTA proficiency scale, set at
0–200, which allows progress (or growth) to be tracked each time a student takes the
DELTA (see Fig. 5.1).6
With the aid of the report that students receive after taking DELTA7 plus the kind
of programmes and activities described, DELTA has been employed at four univer-
sities in Hong Kong to help raise English proficiency and motivate students to con-
tinue to maintain or (hopefully) raise their proficiency throughout their time at
6
For a detailed description of the development and implementation of DELTA, see Urmston et al.
(2013).
7
For a sample DELTA report, see Appendix.
92 A. Urmston et al.
Again, programmes such as the EES, in which DELTA is embedded, enable the
diagnosis to continue as a process beyond the taking of the assessment itself. At
Baptist University, students identified as in need of language support take DELTA
as a component of the English Language Mentoring Service (ELMS). According to
the website of the Language Centre at Baptist University:
ELMS … aims to provide students with a supportive environment conducive to English
language learning in their second semester at university. The content will be discussed and
negotiated between the students and lecturers, taking into consideration the wishes and
needs of the students and the expertise and advice of the lecturer. Students will also be
encouraged to learn relevant topics at their own pace and in their own time, and according
to their individual needs and priorities (http://lc.hkbu.edu.hk/course_noncredit.php).
Wherever feasible this is the case with DELTA, though students are able to take
the assessment on a voluntary basis and whether or not there is future treatment
depends on the student. In such cases, students are strongly encouraged to make use
of the learning resources provided through links embedded into the DELTA report
and to seek advice from an instructor or language advisor.
8
In fact for listening, reading and grammar, as DELTA adopts a testlet model, its items cannot be
considered as totally discrete, though for purposes of item calibration and ability measurement,
unidimensionality is assumed.
94 A. Urmston et al.
2 The Study
This chapter reports on a study of the 475 students who took DELTA for the second
time in 2013. These were students for whom it would be possible to determine
whether their English proficiency had changed over one academic year at university
in Hong Kong. The study aimed to answer the following research questions:
1. Can DELTA reliably measure difference in students’ English language profi-
ciency over a 1-year period?
2. Was there a difference in the students’ proficiency, i.e. their DELTA Measures,
between their first and second attempt of the DELTA?
3. What might account for this difference (if any) in DELTA Measures?
4. Does the DELTA have any impact on students’ language proficiency develop-
ment and if so, how?
3 Methodology
The study attempts to answer the first two research questions by determining empir-
ical differences in DELTA Measures of the 475 students over the two attempts at
DELTA using statistical methods described in the following sections. To answer the
third and fourth questions, a questionnaire was administered to gain information on
students’ past and current language learning activities, experiences and motivations,
their perceptions of the ability of DELTA to measure their English proficiency, and
the usefulness of the DELTA report. The questionnaire was delivered online through
Survey Monkey and students were asked to complete it after they had received their
second attempt DELTA report. A total of 235 students responded to the question-
naire. Of these 235, in-depth focus group interviews were conducted with eight
students to determine the extent to which DELTA has any impact on students’ lan-
guage learning while at university. Content analysis of the data (Miles and
Huberman, 1994; Patton 2002) was undertaken using QSR NVivo 10.
Overall, 2244 test items (1052 in 2012 and 1192 items in 2013) were used to assess
the reading and listening ability as well as grammar and vocabulary knowledge of
the students who took DELTA in 2012 and 2013. One thousand and five items were
common and used to link the test takers. The psychometric quality of the items
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 95
Table 5.4 Reliability and separation indices of the DELTA components across time
2012 2013
Test Reliability Separation Reliability Separation
Reading .97 5.50 .96 4.70
Listening .99 8.37 .98 6.65
Grammar .97 5.47 .92 3.48
Vocabulary .99 13.28 .99 8.27
across time was examined using Rasch measurement. Initially, the items adminis-
tered in 2012 were linked and fitted to the Rasch model. The reliability and separa-
tion statistics, item difficulty, and infit and outfit mean square (MNSQ) coefficients
were estimated. Item difficulty invariance as well as reliability across time were also
checked. Item reliability in Rasch measurement is an index for the reproducibility
of item difficulty measures if the items are administered to a similar sample drawn
from the same population. Separation is another expression of reliability that esti-
mates the number of item difficulty strata.
Infit and outfit MNSQ values are chi-square statistics used for quality control
analysis and range from zero to infinity; the expected value is unity (1), but a slight
deviation from unity, i.e., between 0.6 and 1.4, still indicates productive measure-
ment in the sense that the data is likely not affected by construct-irrelevant
variance.
To estimate item difficulty measures, each test component was analyzed sepa-
rately using WINSTEPS v3.81 (Linacre 2014). In each analysis, the item difficulty
measures were generated by first deleting misfitting items and the lowest 20 % of
persons (students) to ensure the best calibration of items for each component. These
calibrations were then used to generate person measures.
Table 5.4 presents the reliability and separation indices of the reading, listening,
grammar and vocabulary components across time. The components maintain their
discrimination power; for example, the reliability and separation coefficients of the
reading component in 2012 were .97 and 5.50, respectively, and the reliability index
in 2013 was highly similar (.96); the separation index, if rounded, indicates the pres-
ence of approximately five strata of difficulty across the two time points. We also
note similarities between separation and reliability estimates of the other DELTA
components across the 2 years. A seemingly large difference exists between the
separation statistics of the vocabulary test across the 2 years, despite their equal reli-
ability estimates. The discrepancy stems from the nature of reliability and separa-
tion indices: whereas the near-maximum reliability estimate (.99) is achieved, the
separation index has no upper bound limit and can be any value equal to or greater
than eight, depending on the sample size and measurement error (Englehard 2012).
Overall, there is evidence that the reliability of the components did not drop across
time. In addition, the infit and outfit MNSQ values of the test items all fell between
0.6 and 1.40, indicating that the items met the requirements of the Rasch model and
it was highly unlikely that construct-irrelevant variance confounded the test data.
96 A. Urmston et al.
160
140
2012 item calibrations
120
100
80
60
40
20
0
0 20 40 60 80 100 120 140 160 180
We inspected the invariance of item difficulty over time by plotting the difficulty of
items in 2012 against that in 2013. Figure 5.2 presents the related scatterplot of the
1005 items which were common across administrations. As the figure shows, almost
all items fall within the 95 % two-sided confidence bands represented by the two
lines, suggesting that there was no significant change in item difficulty over time.
This can be taken as evidence for invariant measurement since the psychometric
features of the test items are sustained across different test administrations
(Engelhard 2012).
After ensuring that the test items had stable psychometric qualities, we performed
several bootstrapped paired sample t-tests with 1000 bootstrap samples to deter-
mine whether there was a significant difference between the mean DELTA measures
of the students across time. Bootstrapping was used to control and examine the
stability of the DELTA results, approximate the population parameters, and estimate
the true confidence intervals (DiCiccio and Efron 1996). This test initially estimates
the correlation between the DELTA measures over time. Table 5.5 presents the cor-
relation coefficients of the DELTA measures in 2012 and 2013. Except in the case
of the overall measures, the correlation coefficients are below .700, meaning that,
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 97
on average, the rank-ordering of students across times one and two tended to be
rather dissimilar (Field 2005) (e.g. low/high measures in 2012 are not highly associ-
ated with low/high measures in 2013). This suggests that there might have been an
increasing or decreasing trend for the majority of the students who took the
DELTA. The bootstrapped correlation also provides 95 % confidence intervals, indi-
cating that the true correlations (the correlation of the components in the popula-
tion) fall between the lower and upper bands. For example, the estimated correlation
between the listening measures of the students in 2012 and 2013 was .449, with
lower and upper bands of .378 and .516 respectively. This suggests that the esti-
mated correlation is close to the mean of the bands and is therefore highly reliable.
The bootstrapped paired sample t-test results are presented in Table 5.6. While
the listening, vocabulary and overall test measures significantly increased across
time, as indicated by the significant mean differences (p < 0.001), the reading and
grammar measures had no significant increase.
To relate this evidence of increase in DELTA measures to growth in terms of
English language proficiency of the students, we took as a parameter a 0.5 logit
difference in the DELTA scale.9 This was then used as the cut-off point to determine
9
A difference of 0.5 logits on a scale of educational achievement is considered statistically signifi-
cant in educational settings based on OECD (2014) results of an average of 0.3–0.5 annualised
score point change reported by the PISA 2012 test. The PISA scale is from 0 to 500.
98 A. Urmston et al.
How would you rate your improvement in English reading skills during your first
year at University?
250
200
High
150 Average
100 Low
None
50
0
Identifying specific Understanding Inferring meaning Feeling at ease
information main ideas (general while reading
comprehension) (confidence)
How would you rate your improvement in English listening skills during your first
year at university?
250
200
High
150
Average
Low
100
None
50
0
Identifying specific Understanding Inferring meaning Feeling at ease
information main ideas (general while reading
comprehension) (confidence)
How would you rate your improvement in English grammar during your first year at
university?
250
200
High
Quite high
150
Average
Quite low
100
Low
None
50
0
Grammatical accuracy in writing Grammatical accuracy in speaking
How would you rate your improvement in English vocabulary during your first year
at university?
250
200 High
Quite high
150
Average
Quite low
100
Low
None
50
0
Understanding Academic Academic Feeling at ease with
unknown words vocabulary vocabulary use vocabulary
from context knowledge
Of other activities, English was used by more than half the students for email,
online messaging, social media and listening to music for more than an hour a day
(Table 5.10).
The perceptions of improvement or lack of it are further explained by the stu-
dents’ motivations to learn English while at university. Survey results revealed that
students’ main reasons for English language learning were factors such as meeting
a course requirement, eligibility to participate in exchange programmes, importance
of English for future career, and encouragement from teachers and parents. However,
lack of confidence in their ability to learn English and feelings of anxiety while
learning a language continued to be hindrances. These results suggest that English
language learning while at university is mainly for pragmatic reasons, i.e. the need
to use English for academic purposes.
102 A. Urmston et al.
In order to determine how students perceive the impact of DELTA on their English
language learning habits, eight of the students from Lingnan University were asked
to participate in focus group interviews to elaborate on their perceptions of DELTA
and its impact on their English learning. At Lingnan University, students use their
DELTA report as input for the independent learning component of their compulsory
English language enhancement course, English for Communication II. The inde-
pendent learning component accounts for 20 % of the course grade and many stu-
dents do use their DELTA reports for diagnosis, i.e. to help identify areas of relative
weakness, formulate learning plans or pathways and work on these in their indepen-
dent learning.
Three growers and five sustainers participated in the focus group interviews. All
of the students claimed that DELTA was able to reflect their English proficiency in
that the DELTA report accurately reported their strengths and weaknesses. All of
them used the report as a first step to improve their English proficiency. What dis-
tinguished the growers from the sustainers, however, was how they approached their
own language learning. First, a quote from one of the growers:
I tried [using the independent learning links in the DELTA report] when I was in year one,
semester one but I stopped trying it because I have my own way of learning English, which
is like in last year, my semester one, I listened to TED speech. I spent summer time reading
ten English books and made handful notes. I also watched TVB news [a local English-
language TV station] online to practice my speaking. I also watched a TV programme. I
used to use the advanced grammar book and there is a writing task, I forget the name, I
bought it. It helped to improve my English. It’s really a good book, it helped me to improve
my grammar and writing skills. So people have different ways to learn English. I’ve found
my way to learn English. I think these websites may be useful to someone, but not to me.
Tony (grower)
Table 5.11 Top English activities by growers and sustainers that helped improve their English
1 Reading in English (fiction, non-fiction, magazines)
2 Using self-access centre (Speaking and/or Writing Assistance Programme)a
3 Listening to lecturesa
4 Watching TV shows or programmes
5 Text messaging
6 Talking to exchange students (inside or outside the classroom)
7 Academic reading (journal articles, textbooks)a
8 Using dictionary to look for unknown words
9 Listening to and/or watching TED talks
10 Doing grammar exercises
11 Listening to music
12 Test preparation
13 Watching YouTube clips
14 Watching movies
15 Doing online activities
16 Attending formal LCE classesa
17 Memorising vocabulary
18 Joining clubs and societies
19 Reading and writing emails
20 Exposure to English environment
a
Study-related activities
Sustainers were similar to growers in that they did not find the independent learn-
ing links provided in the DELTA report useful for their learning. However, they
required further guidance from teachers to improve their English. They felt that the
DELTA report was useful and accurately reflected their strengths and weaknesses
but they attributed their lack of development to not having support from teachers to
show them what the next step in their language learning should be. This confirms
the survey finding that lack of confidence in their ability to learn English was a hin-
drance to further development, as well as supporting Alderson et al.’s (2014) second
principle for diagnostic assessment, that teacher involvement is key.
The participants were also asked to describe the top activities that they thought
helped them improve various aspects of their English. Table 5.11 lists the top 20
activities that growers and sustainers specifically thought were useful in their
English language growth.
Surprisingly, only three of the top ten activities are study-related (listening to
lectures, using self-access centre and academic reading) and the rest are all non-
study-related activities. Reading in English was the most popular activity followed
by the use of services offered by the self-access centre and finally listening to lec-
tures and watching TV shows or programmes in English. These results suggest that
104 A. Urmston et al.
if students want to improve their English, they clearly have to find activities that suit
their learning styles, and this in turn will motivate them to learn. As Tony said,
So I think it’s very important when you think about your proficiency - if you’re a highly
motivated person then you will really work hard and find resources to improve your English.
But if you’re like my roommate, you don’t really work hard in improving English, then his
English proficiency skills will be just like a secondary school student. Seriously. Tony
(grower)
5 Discussion
The first of our research questions asked whether the diagnostic testing instrument
used, DELTA, can reliably measure difference in students’ English language profi-
ciency over a 1-year period. Overall, the results of the psychometric analysis pro-
vided fairly strong support for the quality of the four component tests (listening,
reading, grammar and vocabulary). In addition, the bootstrapped paired sample
t-test results indicated that there was a statistically significant difference between
students’ performance across time. In other words, DELTA can be used to measure
differences in English language proficiency over a 1-year period.
Secondly, there was a difference shown in some students’ proficiency, i.e. their
DELTA Measures, between their first and second attempts of the DELTA. Inevitably,
some students improved or grew while others actually showed regression or decline.
In most cases, though, there was no difference measured. Results seemed to indicate
an overall increase in proficiency of the group in terms of numbers of growers being
greater than the numbers of decliners, which would no doubt please university
administrators and programme planners. More specifically, there were more grow-
ers than decliners in listening and vocabulary, while reading and grammar saw no
discernible change. Such information again is useful for programme planners and
teachers in that they can look at which aspects of their English language provision
they need to pay more attention to.
In seeking what might account for this difference in DELTA Measures, we have
looked at students’ reported English-use activities. Time spent in lectures, seminars
and tutorials requiring them to listen in English seems to have impacted their profi-
ciency in this skill, while their self-reported attention to academic reading seems to
have improved their academic vocabulary to a greater extent than their reading
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 105
skills. Indications are that students who do show growth are those that adopt their
own strategies for improvement to supplement the use of the language they make in
their studies.
Qualitative results suggest that DELTA has impact as students report that it is
valuable as a tool to inform them of their strengths and weaknesses in their English
proficiency. For those required to create independent learning plans, DELTA reports
are the first source of information students rely on. The real value of DELTA, how-
ever, is the tracking function it provides. Interviews with growers and sustainers
suggest that those students who want to improve their proficiency obviously do
more than the average student; these students are fully aware of their learning styles
and seek their own learning resources and maximize these. Thus, DELTA’s tracking
function serves to validate the perception that their efforts have not been in vain.
This suggests that perhaps DELTA should be part of a more organised programme
which helps students identify learning resources that suit their learning styles and
needs and involves the intervention of advisors or mentors. An example of this is the
Excel@English Scheme (EES) at Hong Kong Polytechnic University mentioned
previously. This scheme integrates DELTA with existing language learning activi-
ties as well as custom-made learning resources and teacher mentoring. It allows for
student autonomy by providing the support that is clearly needed.
6 Conclusion
This chapter has described how a diagnostic assessment can be used to inform and
encourage ESL students’ development in English language proficiency as support
for them as they progress through English-medium university studies. The assess-
ment in question, the Diagnostic English Language Tracking Assessment (DELTA)
has been shown to provide reliable measures of student growth in proficiency, while
the diagnostic reports have proved to be a useful starting point for students in their
pursuit of language development. What has become clear, though, is that the diag-
nostic report alone, even with its integrated language learning links, is not enough
and students need the support of teachers to help them understand the results of the
diagnostic assessment and provide the link to the resources they can use and the
materials that are most appropriate for them, given their needs and learning styles.
Clearly a bigger picture needs to be drawn to learn more about how a diagnostic
assessment like DELTA can impact language development, and this will be possible
as more students take the assessment for a second, third or even fourth time.
Language proficiency development is a process and it is to be hoped that for univer-
sity students, it is one that is sustained throughout their time at university.
106 A. Urmston et al.
7 Appendix
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 107
108 A. Urmston et al.
5 Can Diagnosing University Students’ English Proficiency Facilitate Language… 109
References
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning
and assessment. London: Continuum.
Alderson, J. C., Brunfaut, T., & Harding, L. (2015). Towards a theory of diagnosis in second and
foreign language assessment: Insights from professional practice across diverse fields. Applied
Linguistics, 36(2), 236–260.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University
Press.
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education,
5(1), 7–71.
Buck, G. (2001). Assessing listening. New York: Cambridge University Press.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence intervals. Statistical Science, 11,
189–228.
Engelhard, G. (2012). Invariant measurement: Using Rasch models in the social, behavioral, and
health sciences. New York: Routledge.
Evans, S., & Green, C. (2007). Why EAP is necessary: A survey of Hong Kong tertiary students.
Journal of English for Academic Purposes, 6(1), 3–17.
Evans, S., & Morrison, B. (2011). Meeting the challenges of English-medium higher education:
The first-year experience in Hong Kong. English for Specific Purposes, 30(3), 198–208.
Field, A. (2005). Discovering statistics using SPSS (2nd ed.). London: Sage.
Lee, I., & Coniam, D. (2013). Introducing assessment for learning for EFL writing in an assess-
ment of learning examination-driven system in Hong Kong. Journal of Second Language
Writing, 22(1), 34–50.
Linacre, J. M. (2014). Winsteps: Rasch model computer program (version 3.81). Chicago: www.
winsteps.com
Miles, M. B., & Huberman, M. A. (1994). Qualitative data analysis: An expanded sourcebook.
Thousand Oaks: Sage.
Miller, L., & Gardner, D. (2014). Managing self-access language learning. Hong Kong: City
University Press.
OECD. (2014). Pisa 2012 results in focus: what 15-year-olds know and what they can do with
what they know. The Organisation for Economic Co-operation and Development (OECD).
Retrieved from. http://www.oecd.org/pisa/keyfindings/pisa-2012-results-overview.pdf
Patton, M. Q. (2002). Qualitative research and evaluation methods. Thousand Oaks: Sage.
Sadler, R. (1998). Formative assessment: Revisiting the territory. Assessment in Education, 5(1),
77–84.
Taras, M. (2005). Assessment – summative and formative – some theoretical considerations.
British Journal of Educational Studies, 53, 466–478.
Urmston, A., Raquel, M., & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary students’
English language proficiency: The development and validation of DELTA. Hong Kong Journal
of Applied Linguistics, 14(2), 60–82.
Yorke, M. (2003). Formative assessment in higher education: Moves towards theory and the
enhancement of pedagogic practice. Higher Education, 45(4), 477–501.
Part III
Addressing the Needs
of Doctoral Students
Chapter 6
What Do Test-Takers Say? Test-Taker
Feedback as Input for Quality Management
of a Local Oral English Proficiency Test
Xun Yan, Suthathip Ploy Thirakunkovit, Nancy L. Kauper, and April Ginther
X. Yan (*)
Department of Linguistics, University of Illinois at Urbana-Champaign,
Urbana-Champaign, IL, USA
e-mail: xunyan@illinois.edu
S.P. Thirakunkovit
English Department, Mahidol University, Bangkok, Thailand
e-mail: suthathip.thi@mahidol.ac.th
N.L. Kauper
Oral English Proficiency Program, Purdue University, West Lafayette, IN, USA
e-mail: nkauper@purdue.edu
A. Ginther
Department of English, Purdue University, West Lafayette, IN, USA
e-mail: aginther@purdue.edu
process during that same period of time. Carefully considering these responses has
contributed substantially to our quality control processes.
1 Introduction
OEPT in 2001 in order to (1) track student access to and use of test preparation
materials and (2) understand and monitor examinee perception of general OEPT
characteristics. Section III of the PTQ, consisting of two open-ended questions, was
added in 2009 in order to identify any problems that may have been missed in test-
taker responses to the fixed-response items in Sections I and II. Monitoring exam-
inee feedback through the PTQ has become a central component of our quality
management process.
2 Literature Review
Of particular interest in our context are studies examining test-taker feedback about
semi-direct testing formats for oral proficiency testing. Given that the Speaking
subsection of the TOEFL iBT is semi-direct and is taken by the majority of interna-
tional applicants for North American universities to demonstrate required language
proficiency, semi-direct oral proficiency testing can now be assumed largely famil-
iar to prospective international examinees; however, familiarity does not ensure
comfort with, or acceptance of, the procedures associated with the semi-direct
format.
The benefits of semi-direct testing are largely associated with cost effectiveness
and efficiency in that interviewers are not required and ratings of recorded perfor-
mances can be captured, stored, and rated remotely after the real-time administra-
tion of the test. However, cost efficiency alone cannot justify the use of semi-direct
formats, and researchers have considered the comparability of semi-direct and
direct formats to determine whether examinees are ranked similarly across formats.
In a comparison of the ACTFL Oral Proficiency Interview (OPI) to its semi-direct
counterpart (the ACTFL SOPI), Stansfield and Kenyon (1992) reported a high
degree of concurrent validity based on strong positive correlations (0.89–0.92)
across direct and semi-direct formats. Shohamy (1994) also found strong positive
correlations across a Hebrew OPI and SOPI but cautioned against assuming total
fidelity of the formats as language samples produced in response to the direct OPI
tended to be more informal and conversational in nature, while those produced in
response to the SOPI displayed more formality and greater cohesion.
The absence of an interviewer can be seen as either a negative or positive attri-
bute of the semi-direct format. The most obvious drawback in the use of semi-direct
formats lies in the omission of responses to questions and in the lack of opportunity
for responses to be extended through the use of interviewer-provided probes; that is,
the apparent limitations to the validity of the format are due to the absence of inter-
activity. On the other hand, standardization of the test administration removes the
variability associated with the skill and style of individual interviewers, resulting in
an increase in reliability and fairness, in addition to cost effectiveness and
efficiency.
116 X. Yan et al.
Test-takers can reasonably be expected to value validity over reliability, and they
may not appreciate cost effectiveness even if they benefit. Comparisons of semi-
direct and direct formats typically include examinations of test-taker perceptions
and historically these perceptions have favored direct oral testing formats over their
semi-direct counterparts. McNamara (1987), Stansfield et al. (1990), Brown (1993),
and Shohamy et al. (1993) report test-taker preferences for semi-direct formats
ranging from a low of 4 % (Shohamy et al.) to 57 % (Brown); however, the prefer-
ence for the semi-direct format reported by Brown seems to be an outlier. In a more
recent comparison of test-taker oral testing preferences in Hong Kong, Qian’s
(2009) results suggest that a shift may be taking place. In his study, while 32 % of
the respondents reported that they preferred the direct format, 40 % were neutral, so
perhaps the strongly negative perceptions reported in former studies may be dissi-
pating. Again, however, only 10 % of his respondents actually favored the semi-
direct format. The remainder disliked both formats or had no opinion.
In a much larger scale study of test-taker feedback, Stricker and Attali (2010)
report test-taker attitudes towards the TOEFL iBT after sampling iBT test-takers
from China, Colombia, Egypt, and Germany. Stricker and Attali reported that, while
test-taker attitudes towards the iBT tended to be moderately favorable overall, they
were more positive towards the Listening and Writing subsections, and decidedly
less favorable about the Speaking subsection. Test-takers were least favorable about
the Speaking subsection in Germany, where 63 % disagreed that The TOEFL gave
me a good opportunity to demonstrate my ability to speak in English, and in
Colombia, where 45 % also disagreed with the statement. Although the researchers
did not directly ask respondents to comment on direct versus semi-direct test for-
mats, it seems safe to assume that the semi-direct format is implicated in their less
favorable attitudes toward the Speaking subsection. Substantial percentages of
respondents in Egypt (40 %) and China (28 %) also disagreed with the statement,
but negative attitudes were less prevalent in these countries.
If a language testing program decides that the benefits of semi-direct oral testing
are compelling, then the program should acknowledge, and then take appropriate
steps to ameliorate, the negative test-taker attitudes towards the semi-direct format.
Informing prospective test-takers about the use of the format and providing test
preparation materials is an essential step in this process; however, the provision of
information is only the first step. Collecting, monitoring, and evaluating test-taker
feedback concerning their access to and use of test prep materials is also
necessary.
Test quality management processes and validation processes may intersect in the
type of evidence collected. The two processes differ, however, in their purposes and
in their use of evidence, quality management being a more practical process aimed
at effecting improvements to testing processes rather than providing supporting
warrants in a validity argument such as that articulated for TOEFL by Chapelle
et al. (2008). But because quality management processes may result in improve-
ments to test reliability and fairness, such procedures may affect, and be considered
part of, the overall validation process.
Acknowledging the scarcity of literature on systematic quality management pro-
cedures for language testing, Saville (2012) emphasizes the importance of periodic
test review for effective quality control (p. 408) and provides a quality management
model that links periodic test review to the assessment cycle. He argues that quality
control and validity go hand in hand. Saville’s quality management process, like the
test validation process, is iterative in nature and consists of the following five stages:
1. Definition stage: recognize goals and objectives of quality management;
2. Assessment stage: collect test-related feedback and identify potential areas for
improvement;
3. Decision stage: decide on targeted changes and develop action plans
accordingly;
4. Action stage: carry out action plans; and
5. Review stage: review progress and revise action plans as necessary.
(pp. 408–409)
A comprehensive quality management process favors collection of multiple fac-
ets of information about a test, including information about and from test-takers.
Shohamy (1982) was among the first language testers who called for incorporation
of test-taker feedback to inform the test development process and test score use.
Stricker et al. (2004) also advocated periodic monitoring of test-taker feedback.
Although one might argue that feedback from test-takers can be very subjective, and
some studies have shown that test-taker feedback is partially colored by proficiency
level or performance (Bradshaw 1990; Iwashita and Elder 1997), there are certain
types of information that only test-takers can provide. Only test-takers themselves
can report on their access to and use of test preparation materials, their understand-
ing of test instructions and materials, their experience of the physical test environ-
ment, and their attitudes about interactions with test administrators.
3 Research Questions
The purpose of the present study is to describe and discuss how test-taker feedback
is used in our quality control procedures for the OEPT. We have found that test-taker
feedback provides a practical starting point for improvement of the test and our
administrative procedures. The following are questions/concerns that we address by
examining responses to the three sections of the post-test questionnaire (PTQ).
118 X. Yan et al.
1. To what extent are prospective test-takers provided information about the OEPT?
2. To what extent do they actually use OEPT test prep materials?
3. Do test-takers find selected characteristics of the test (prep time, item difficulty,
representation of the ITA classroom context) acceptable and useful?
4. What experiences and difficulties do test-takers report?
5. Which aspects of the testing process may require improvement?
6. Do actions taken to improve test administration have an effect?
When examinees fail the test, their OEPT test record serves both placement and
diagnostic functions for the ESL communication course for ITAs, a graduate-level
course taught in the OEPP. OEPT scores allow the OEPP to group students into
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 119
course sections by proficiency level. Because the course is reserved exclusively for
students who have failed the test, and course sections are small, instructors (who are
also OEPT raters) use the test recordings to learn about their assigned students’
language skills. Instructors listen to and rate test performances analytically, assign-
ing scores to a student’s skills in about a dozen areas related to intelligibility, flu-
ency, lexis, grammar and listening comprehension, using the same numerical scale
as the OEPT. By this means, and before the first class meeting, instructors begin to
identify and select areas that a student will be asked to focus on improving during
the semester-long course. Students then meet individually with instructors to listen
to and discuss their test recording, and the pair formulate individual goals, descrip-
tions of exercises and practice strategies, and statements of how progress towards
the goals will be measured. This OEPT review process also helps students under-
stand why they received their test score and were placed in the OEPP course. During
subsequent weekly individual conferences, goals and associated descriptions on the
OEPT review document may be adjusted until they are finalized on the midterm
evaluation document. Scores assigned to language skills on the OEPT review
become the baseline for scores assigned to those skills on the midterm and final
course evaluations, providing one means of measuring progress. The same scale is
also used for evaluating classroom assessment performances.
Not only is the OEPT test linked to the OEPP course by these placement, diag-
nostic and review practices, but test raters/course instructors benefit from their dual
roles when rating tests and evaluating students. Course instructors must make rec-
ommendations for or against certification of oral English proficiency for each stu-
dent at the end of the course. Training and practice as OEPT raters provide instructors
with a mental model of the level of proficiency necessary for oral English certifica-
tion; raters can compare student performances on classroom assessments to charac-
teristics described in the OEPT scale rubrics and test performances. In turn, when
rating tests, raters’ understanding of the test scale and of the examinee population is
enhanced by their experience with students in the course; OEPT examinees are not
just disembodied voices that raters listen to, but can be imagined as students similar
to those in their classes. Thus, the OEPT test and OEPP course are closely associ-
ated in ways both theoretical and practical.
Two forms of the OEPT practice test, identical in format and similar in content to
the actual test, are available online so that prospective test-takers may familiarize
themselves with the computer-mediated, semi-direct format, task types, and content
of the test. A description of the OEPT scale and sample item responses from actual
test-takers who passed the test are also available to help test-takers understand the
types and level of speaking and listening skills needed to pass the test. The practice
test website also provides video orientations to university policies and the OEPP,
and taped interviews with graduate students talking about student life.
120 X. Yan et al.
The OEPT is administered in campus computer labs several times a month through-
out the academic year. At the beginning of each testing session, test-takers are given
an orientation to the test, in both written and oral, face-to-face formats, including
directions for proper use of the headset and information about the test-user interface
and the post-test questionnaire. Examinees are also given a brochure after taking the
test which provides information about scoring, score use, and consequences of test
scores.
Each examinee completes a PTQ immediately following the test. The questionnaire
is part of the computer test program and generally requires 5–10 min to complete.
The PTQ is divided into three sections. Section I consists of nine fixed-response
items that elicit information about awareness and use of the OEPT Practice Test
Website; Section II consists of 11 items that elicit information about the overall test
experience and includes questions about the within-test tutorial, within-test prepara-
tion time and response time, and whether test-takers believe their responses to the
OEPT actually reflect their ability. Each question in Part II uses either a binary scale
(yes or no) or a 5-point Likert scale (strongly agree, agree, no opinion, disagree, or
strongly disagree). Section III consists of the following two open-ended questions:
1. Did you encounter any difficulties while taking the test? Please explain.
2. We appreciate your comments. Is there anything else that you would like us to
know?
All questionnaire responses are automatically uploaded to a secure database on
the university computing system.
5 Method
We review OEPT survey responses after each test administration. In this paper, we
will briefly discuss responses from Section I of the PTQ that were collected in the
four test administrations during the fall semester of 2013 (N = 365), as these most
accurately reflect our current state. The remainder of this paper will discuss our
analysis of responses to the two open-ended questions collected over a 3-year
period – 1440 responses from 1342 test-takers1 who took the OEPT between August
1
The number of responses is higher than the number of examinees because some examinees took
the test more than once.
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 121
2009 and July 2012. All test-takers were matriculated graduate students from around
the world, most in their 20s and 30s. The majority of test-takers came from the
Colleges of Engineering (43 %) and Science (24 %). Responses to closed-ended
questions from Part I of the survey were analyzed with Statistical Analysis System
(SAS), version 9.3, in terms of frequency counts and percentages. Written responses
to open-ended questions were coded into categories.
Figure 6.1 presents item responses to questions about test-taker awareness and use
of the OEPT Practice Test. The figure is interesting in several aspects. First of all,
despite the fact that information about the practice test and the URL for the practice
test is included in the admission letter that the Graduate School sends to each admit-
ted student, slightly less than 60 % of our test-takers typically report that they were
informed by the Graduate School about the OEPT. A higher percentage (87 %)
report being informed by their departments about the OEPT requirement for pro-
spective teaching assistants. A slightly lower percentage (80 %) report that they were
advised to take the practice test, and 70 % report having completed the practice test.
We believe that 70 % completion of the practice test is problematic – especially
when we take into account the negative perceptions of the semi-direct format
reported in the literature. Furthermore, the absence of knowledge about the test
format may negatively affect subsequent test-taker performance.
If we break on departments, we can see that prospective test-takers’ completion
of the practice test differs considerably across departments. In some departments,
only 50 % of the examinees completed the practice test; in some smaller programs,
no one completed the test, whereas in other departments, virtually all test-takers
completed it (Fig. 6.2).
Fig. 6.1 Test-takers’ awareness of the OEPT and the OEPT practice test (N = 365)
122 X. Yan et al.
80.00%
60.00%
40.00%
20.00%
0.00%
ME (N=55) IE (N=33) PULSE STAT (N=21) CS (N=24) ECE (N=53) CE (N=15)
(N=21)
Fig. 6.2 Percent of examinees who completed the OEPT practice test (larger departments). ME
Mechanical Engineering, IE Industrial Engineering, PULSE Purdue University Interdisciplinary
Life Science, STAT Statistics, CS Computer Science, ECE Electrical Computer Engineering, and
CE Civil Engineering
It is clearly the case that the provision of a practice test is not sufficient. Local
testing programs must accept the added responsibility/obligation for tracking over-
all awareness and use by department, and then alerting departments about the rela-
tive success of their efforts to inform their students.
After tracking awareness and use since the OEPT became operational in 2001,
we recently achieved our goal of 80 % completion of the practice test across all
departments, and have set a new goal of 90 %. Our ability to hit 90 % is negatively
influenced by turnover in departments of the staff/administrative positions that
function as our liaisons. In addition, some test-takers will probably always decide to
forego the opportunity.
Section II of the PTQ asks test-takers to provide information about the test itself.
Responses to these items are presented in Fig. 6.3.
First, we ask whether test-takers consider themselves prepared to take the test.
This item is somewhat problematic because respondents can interpret this question
as referring to test prep or language proficiency. In either case, only 60 % of the
test-takers over the 3-year period agreed or strongly agreed that they were prepared.
Test-takers are more positive about test length and the instructions within the test.
In both cases, at least 80 % agreed that both the test length and the instructions were
acceptable.
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 123
SA
I learned something about being a TA by
taking the test.
A
N
I had enough time to prepare my answers.
Fig. 6.3 Responses to items on the OEPT post-test questionnaire Part III (N = 365). SA Strongly
agree, A agree, N No opinion, D Disagree, SD Strongly disagree
We ask the next two questions because we are always interested in understanding
whether test-takers perceive that our items are reflective of actual TA tasks in class-
room contexts and whether they have learned anything about being a TA by taking
the OEPT. In both cases, around 80 % of the test-takers either agree or strongly
agree that the items are reflective of TA tasks and that the test was informative about
teaching contexts.
The next two questions ask whether test-takers had enough time (1) to prepare
their responses and (2) to respond. When we developed the OEPT, we decided to
allow test-takers 2 min to prepare and 2 min to respond to each item. Unlike other
tests that allow only a very short prep time (e.g., TOEFL iBT speaking allows 15 s),
we opted to turn down the pressure, perhaps at the expense of authenticity. Given
124 X. Yan et al.
In the initial stages of the coding process, we read and analyzed responses to open-
ended questions several times and identified eight major categories of test-taker
feedback: (1) Positive comments; (2) Comments that indicate no difficulties without
further elaboration; Difficulties associated with: (3) test administration, (4) test
preparation, (5) test design, (6) test-taker characteristics and performance, (7) other
difficulties; and (8) Miscellaneous comments (See Appendix B for the coding
scheme). Next, one coder coded all 1440 comments using the eight categories listed
above. A second coder randomly coded 25 % of the comments. Inter-coder reliabil-
ity, calculated as exact agreement, was 92.5 %. However, in our data, some com-
ments on difficulties covered more than one topic. Therefore, these difficulty
comments were coded multiple times according to the different topics mentioned
(see Table 6.1); as a result, the total number of topics is higher than the total number
of comments.
Responses to the open-ended questions showed that examinee experience with
the OEPT over the 3-year period was overall positive. As seen in the last column of
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 125
Among positive comments, the most frequently mentioned topics were the test
administration process, interactions with test administrators, authenticity and rele-
vance of test content, and washback effects of the test, as illustrated by the following
comments:
The OEPT test was useful and the process of giving the test was reasonable and well
arranged. (November 2011)
I find the topics presented here more relevant and engaging than the ones in TOEFL.
(October 2010)
The test was administered very well. (August 2010)
The test was conducted in an excellent manner. (August 2009)
No difficulties encountered. System was nicely set up and program ran smoothly.
(August 2009)
The test was not only a good indicator of preparedness for a TA position but also rele-
vant in context including information that made me more aware of what it is to be a TA and
a graduate student in general. Kudos to the OEPT team! (January 2012)
These comments reflect advantages of local language tests over large-scale lan-
guage tests, many of which lie in the relative ease of anticipating and accommodat-
ing test-taker needs.
Another important advantage of local language tests has much to do with contex-
tualization and positive washback effects. Even though large professional testing
companies follow a set of rigorous procedures to monitor the quality of their
language tests, the fact that these tests serve selection or certification purposes
across a wide range of contexts somewhat limits the contextualization of test items
and interpretations of test scores to a general level. The development of local lan-
guage tests, on the other hand, is often dictated by particular purposes, such as
placement of language learners in specified courses according to proficiency levels
or certification of language proficiency for a specific purpose. These functionalities
permit item writers to situate language tasks in contexts that represent the range of
communicative tasks that may be found in the local context.
Written comments indicating problems or difficulties fell into five broad categories,
in descending order of frequency: test design, test administration, test preparation,
test-taker characteristics and performance, and other difficulties (see the first col-
umn of Table 6.2 below). These broad categories were further broken down into
individual topics, including—in descending order of frequency—test environment,
item preparation and response time, difficulty of individual test items, online prac-
tice test and associated test preparation materials, semi-direct test format, testing
equipment and supplies, and test-taker concerns about their language proficiency
and test performance.
For purposes of test quality management, we are concerned with the appropriate-
ness of test administration procedures, the availability of test preparation materials,
and minimization of construct-irrelevant variance. While collection of test-taker
feedback marked the starting point in this particular quality management process,
determining and implementing possible improvements represented the more chal-
lenging part of the process. As we discovered, some test-taker comments referred to
issues that test administrators and the OEPP had been making ongoing efforts to
address, while other comments required no action plans.
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 127
Among topics mentioned in the survey responses, we have been particularly atten-
tive to two areas of difficulty that call for improvement efforts: noise in the test
environment, and technical problems with the OEPT practice test website.
Noise and distraction from other test-takers is the most commonly identified diffi-
culty associated with test administration. OEPT testing labs are not equipped with
sound-proof booths that separate computer stations, or with sound-proof headsets.
As a result, survey comments often include complaints about noise, especially dur-
ing the big testing week in August prior to the beginning of the academic year, when
test sessions are generally full. Here are some sample comments on this topic:
I was able to listen to other students which distracted me sometimes. (August 2009)
The fellow test-takers talking beside each other was little disturbing. The headphone
needs to be noise proof. (August 2010)
The students sitting around me talked too loud making me hard to concentrate on my
test. (August 2011)
The noise problem has been an ongoing target of effort by test administrators
since the OEPT began. Having exhausted possible and affordable technical solu-
tions to the noise problem, we began to focus more narrowly on test administration.
In 2010, we began to tell examinees during pre-test orientation that if they were
speaking too loudly, we would request that they lower their volume a little by show-
ing them a sign, being careful not to disturb them while preparing for or responding
to an item. In 2011, we began to model for test-takers during the pre-test orientation
the desired level of speaking volume. Most examinees have been able to comply
with the volume guidelines, but there are occasionally one or two test-takers per test
administration who have difficulty keeping their voice volume at a moderate level;
the majority of those adjust, however, after being shown a sign that reads “Please
speak a little more softly.”
The decreasing numbers and percentages of complaints about noise during
August test administrations in the three academic years covered in the data (Table
6.3) suggest that these noise-reducing efforts described above have been somewhat
successful. However, despite some improvement, persistent noise issues require
ongoing monitoring. In August 2012, we began to set up cardboard trifold presenta-
tion screens around each testing station. The screens do not reduce noise to a great
degree, but they have been an overall improvement to the test-taking environment
by reducing visual distractions to test-takers. The OEPP has also been involved
recently in university-wide discussions about the need for a testing center on cam-
pus that could provide a more appropriate testing environment for the OEPT.
128 X. Yan et al.
Table 6.3 Number of complaints about noise during August test administrations by year
Year Frequency Percent Total responses
2009 21 9.29 226
2010 14 6.19 226
2011 9 3.77 239
Table 6.4 Number of complaints about accessibility to OEPT practice test website by academic
year
Academic year Frequency Percent Total response
2009–2010 30 6.40 469
2010–2011 11 2.24 492
2011–2012 5 1.04 479
Despite efforts to keep the OEPT practice test website up-to-date and accessible,
changing technology standards along with users across the globe attempting to
access the site using a wide variety of hardware, software, and Internet access
conditions resulted in persistent reports of website user problems from 2009 to
2012. Although Table 6.4 shows a trend of decrease in the numbers of complaints
about accessibility to the practice test over the 3 years covered in our data, the OEPP
considers even a small number of reported problems with use of the practice test
website to be too many.
In response to ongoing test-taker comments about difficulties accessing the
practice test, the OEPP contracted software developers in 2012 to replace the Java
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 129
Table 6.5 OEPT score distributions for examinees requesting more preparation and response time
Time constraints
Preparation Response
Score Frequency Percent Frequency Percent
35 5 9.62 2 9.09
40 15 28.85 3 13.64
45 6 11.54 0 0.00
50 19 36.54 13 59.09
55 6 11.54 2 9.09
60 1 1.91 2 9.09
Total 52 100.00 22 100.00
version of the practice test with an internet-based version and to improve its com-
patibility with the Mac OS operating system. JAR files were dispensed with. To
better track and evaluate technical problems, an online contact form was created for
users to report problems. Monitoring of contact form submissions and of post-test
questionnaire responses from 2012 to 2013 academic year test administrations have
indicated very few user problems with the new version of OEPT practice test web-
site since improvements were made.
A small number of examinees (n = 742) mentioned a desire or need for longer prepa-
ration and response time.
It would be better if the preparation time was longer. But it was acceptable. (August 2009)
Very helpful if the prepation [sic] time can be longer would be nice. (January 2012)
Nothing much except at one point I felt the time was little less for the response. (August
2010)
Yes, I wish I had more time in some questions. In fact, I wish there was no timing at all.
The timer either makes you talk faster or omit valious [sic] information. (August 2011)
Table 6.5 captures a rather interesting phenomenon related to this topic: 19 com-
plaints (almost 40 %) about preparation time and 13 complaints (almost 60 %) about
2
This number is smaller than the frequency of the topic of item preparation and response time
shown in Table 6.2 because some test takers requested both shorter item preparation and response
time in their comments.
130 X. Yan et al.
Table 6.6 OEPT score distributions for examinees commenting on the difficulty of test items
Item type
Graph Listening
Score Frequency Percent Frequency Percent
35 1 3.23 5 8.62
40 8 25.81 19 32.76
45 11 35.48 16 27.59
50 9 29.03 14 24.14
55 2 6.45 3 5.17
60 0 0.00 1 1.72
Total 31 100.00 58 100.00
response time were made by examinees who passed the test with a score of 50
(fourth row). This observation may reflect examinees’ attempts to ensure their best
test performance, especially when some might have realized that their proficiency
level was near the cut-off for passing.
Examinee efforts to maximize the quality of their test performance coincide with
the original bias for best rationale of OEPT test developers for allotting 2 min of
preparation time and 2 min of response time for test items. As mentioned above, to
reduce test anxiety and elicit better test-taker performance, OEPT test developers
decided to extend the length of preparation and response times to 2 min each. About
80 % of OEPT examinees in the PTQ Part I data agreed that 2 min was sufficient
time for item preparation and item response.
Moreover, as the results in Table 6.6 suggest, complaints about graph and listen-
ing items were observed from examinees across most score levels (second and
fourth rows), suggesting that test-takers, regardless of their oral English proficiency
level, tend to consider graph and listening items more difficult and challenging than
text items. Compared with text items, graph and listening items are more integrated
in terms of the cognitive skills required for processing the item information.
6 What Do Test-Takers Say? Test-Taker Feedback as Input for Quality Management… 131
Statistical item analyses of the OEPT also identify graph as the most difficult, fol-
lowed by listening items (Oral English Proficiency Program 2013a, p. 29). Therefore,
test-taker perceptions of the greater difficulty of graph and listening items are legiti-
mate. However, the OEPP relies partly on these integrated items to differentiate
between higher and lower proficiency examinees. Because of their relatively higher
difficulty levels, graph items—as suggested in the literature on using graphic items
in language tests (Katz et al. 2004; Xi 2010)—are effective in eliciting higher-level
language performances, e.g., the use of comparatives and superlatives to describe
statistical values and differences and the use of other vocabulary needed to describe
trends, changes and relationships between variables on a graph.
Nevertheless, it should be acknowledged that test-taker familiarity with graphs
may have a facilitating effect on test performance (Xi 2005). Prospective OEPT
examinees are encouraged to familiarize themselves with OEPT graph item formats
by taking the OEPT practice tests. Familiarity with these item formats can also be
increased by listening to the sample test responses and examining the OEPT rating
scale on the practice test website. This information can help test-takers and test
score users understand the intent of the graph items, which is to elicit a general
description and interpretation of trends illustrated by the graph, rather than a recita-
tion of all the numbers shown (as is also explained in the test item instructions). Yet,
to address examinees’ concerns from a customer service perspective, we continue to
seek improved and multiple ways to stress to examinees the importance of taking
the practice test and familiarizing themselves with the purpose and scoring of the
test.
In addition to the test-taker difficulties mentioned above, there were also some types
of comments referring to conditions beyond the control or purview of test develop-
ers or test administrators. These comments include low-scoring examinees’ concern
about their language proficiency, test-takers’ physical or mental conditions and test
performance, and miscellaneous requests and suggestions. Some sample comments
of these types are provided below:
No problem test went smoothly except I was feeling cold and had sore throat. (August 2009)
May be drining [sic] water fecility [sic] should be provided. Throat becomes dry some-
times while speaking continuously. (August 2009)
Nothing in particular but some snacks after the test would be nice as a reward for finish-
ing the test. (November 2010)
Sorry about evaluating my record maybe it is the worst English you ever heard. (August
2010)
Sometime [sic] I just forgot the words I want to use when I am speaking. (August 2009)
I was very nervous so there are some cases where I paused and repeated one thing a
number of times. (August 2011)
The addition in 2009 of Part III to the post-test questionnaire (the two open-
ended questions) was made to open up a channel of communication for test-takers
to express freely their reactions to the test. This channel is as much an effort to
involve test-takers—an important group of stakeholders—in the quality manage-
ment procedures as an assurance to the test-takers that their voices are heard by the
language testers. Reading comments such as those above contribute to test adminis-
trators’ awareness of examinees as individuals, each with unique feelings, disposi-
tions, and personal circumstances. This awareness is in keeping with the recognition
of the humanity of test-takers, mentioned in the International Language Testing
Association (ILTA) Code of Ethics (2000).
One consequence resulting from these types of comments was the creation of a
new test preparation brochure that was distributed on paper in 2013 to all graduate
departments and added electronically to the OEPP website (Oral English Proficiency
Program 2013b). In addition to providing general information about the purpose of
the test, the brochure directs readers to the practice test website, advises students to
bring a bottle of water and a jacket or sweater with them to the test, and alerts pro-
spective examinees about issues of noise in the test environment. The purpose of the
brochure is to facilitate test registration and administration, but as with most attempts
to better inform constituents, the challenge is to get the brochure in the hands (liter-
ally or figuratively) of prospective examinees and for them to read and understand it.
7 Conclusion
test-taker comments on the PTQ Section III, a systematic review of test-taker com-
ments collected from a longer period of time has offered us substantial benefits not
possible with smaller data sets from shorter time spans. The process was enlight-
ening in that it offered a different perspective of the data, allowing us to observe
trends over time and subsequently to better identify and evaluate possible changes
or improvements to the test. The open-ended questions on the OEPT PTQ in par-
ticular have provided an opportunity to collect a wide variety of information that
contributes to the quality management process. We can therefore recommend prac-
tices of this sort to local testing organizations that use similar instruments for col-
lecting test-taker feedback but do not have a large number of examinees on a
monthly basis.
While not all feedback requires action, all feedback should be reviewed and con-
sidered in some way. Test developers must examine test-taker feedback in relation
to the purpose of the test and the rationale behind the test design. Although an
important purpose of quality management is to minimize construct-irrelevant vari-
ance, test developers must hold a more or less realistic perspective of what they can
do given the parameters of their testing contexts.
The choice to collect and examine test-taker feedback as a routine practice stems
not only from recommendations to involve test-takers in quality management pro-
cedures as a best practice for local language testing organizations, but also from our
recognition of test-takers as important stakeholders in the OEPT. In regard to qual-
ity control, the European Association for Language Testing and Assessment
(EALTA) Guidelines for Good Practice (2006) mention that there needs to be a
means for test-takers to make complaints. The ILTA Code of Ethics (2000) states
that the provision of ways for test-takers to inform language testers about their con-
cerns is not just their right but also a responsibility. However, in order for test-takers
to act on that responsibility, they must be given opportunities to do so. Having a
channel built in to the test administration process is a responsible way for language
testers to facilitate the involvement of test-takers in quality management.
Appendices
References
Bailey, K. (1984). The foreign ITA problem. In K. Bailey, F. Pialorsi, & J. Zukowski-Faust (Eds.),
Foreign teaching assistants in U.S. universities (pp. 3–15). Washington, DC: National
Association for Foreign Affairs.
Bradshaw, J. (1990). Test-takers’ reactions to a placement test. Language Testing, 7(1), 13–30.
Brown, A. (1993). The role of test-taker feedback in the test development process: Test-takers’
reactions to a tape-mediated test of proficiency in spoken Japanese. Language Testing, 10(3),
277–301.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (2008). Building a validity argument for the
Test of English as a Foreign Language. New York: Routledge.
ETS (Educational Testing Service). (1985). A guide to SPEAK. Princeton: Educational Testing
Service.
European Association for Language Testing and Assessment. (2006). Guidelines for good practice
in language testing and assessment. Retrieved August 1, 2013, from http://www.ealta.eu.org/
documents/archive/guidelines/English.pdf
Fox, J. (2004). Biasing for the best in language testing and learning: An interview with Merrill
Swain. Language Assessment Quarterly, 1(4), 235–251.
136 X. Yan et al.
Ginther, A., Dimova, S., & Yang, R. (2010). Conceptual and empirical relationships between tem-
poral measures of fluency and oral English proficiency with implications for automated scor-
ing. Language Testing, 27(3), 379–399.
International Language Testing Association. (2000). Guidelines for practice. Retrieved August 1,
2013, from http://www.iltaonline.com/index.php?option=com_content&view=article&id=57
&Itemid=47
International Language Testing Association. (2010). Guidelines for practice. Retrieved August 1,
2013, from http://www.iltaonline.com/index.php?option=com_content&view=article&id=122
&Itemid=133
Iwashita, N., & Elder, C. (1997). Expert feedback? Assessing the role of test-taker reactions to a
proficiency test for teachers of Japanese. Melbourne Papers in Language Testing, 6(1), 53–67.
Katz, I. R., Xi, X., Kim, H., & Cheng, P. C.-H. (2004). Elicited speech from graph items on the Test
of Spoken English (TOEFL research report. No. 74). Princeton: Educational Testing Service.
McNamara, T.F. (1987). Assessing the language proficiency of health professionals.
Recommendations for the reform of the Occupational English Test (Report submitted to the
Council of Overseas Professional Qualifications.) Department of Russian and Language
Studies, University of Melbourne, Melbourne, Australia.
Oppenheim, N. (1997). How international teaching assistant programs can protect themselves
from lawsuits. (ERIC Document Reproduction Service No. ED408886).
Oral English Proficiency Program. (2013a). OEPT technical manual. West Lafayette: Purdue
University.
Oral English Proficiency Program. (2013b). Preparing for the oral English proficiency test: A
guide for students and their Departments. West Lafayette: Purdue University.
Qian, D. (2009). Comparing direct and semi-direct modes for speaking assessment: Affective
effects on test takers. Language Assessment Quarterly, 6, 113–125.
Saville, N. (2012). Quality management in test production and administration. In G. Fulcher &
F. Davidson (Eds.), The Routledge handbook of language testing (pp. 395–412). New York:
Routledge Taylor & Francis Group.
Shohamy, E. (1982). Affective considerations in language testing. Modern Language Journal,
66(1), 13–17.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing, 11(2),
99–123.
Shohamy, E., Donitsa-Schmidt, S., & Waizer, R. (1993). The effect of the elicitation mode on the
language samples obtained in oral tests. Paper presented at the 15th Language Testing Research
Colloquium, Cambridge, UK.
Stansfield, C. W., & Kenyon, D. M. (1992). Research on the comparability of the oral proficiency
interview and the simulated oral proficiency interview. System, 20(3), 347–364.
Stansfield, C. W., Kenyon, D. M., Paiva, R., Doyle, F., Ulsh, I., & Cowles, M. A. (1990). The
development and validation of the Portuguese speaking test. Hispania, 73, 641–651.
Stricker, L. J., & Attali, Y. (2010). Test takers’ attitudes about the TOEFL iBT (ETS Research
report. No. RR-10-02). Princeton: Educational Testing Service.
Stricker, L. J., Wilder, G. Z., & Rock, D. A. (2004). Attitudes about the computer-based test of
English as a foreign language. Computers in Human Behavior, 20, 37–54.
Xi, X. (2005). Do visual chunks and planning impact the overall quality of oral descriptions of
graphs? Language Testing, 22(4), 463–508.
Xi, X. (2010). Aspects of performance on line graph description tasks: Influenced by graph famil-
iarity and different task features. Language Testing, 27(1), 73–100.
Zeidner, M., & Bensoussan, M. (1988). College students’ attitudes towards written versus oral tests
of English as a foreign language. Language Testing, 5(1), 100–114.
Chapter 7
Extending Post-Entry Assessment
to the Doctoral Level: New Challenges
and Opportunities
J. Read (*)
School of Cultures, Languages and Linguistics, University of Auckland,
Auckland, New Zealand
e-mail: ja.read@auckland.ac.nz
J. von Randow
Diagnostic English Language Needs Assessment, University of Auckland,
Auckland, New Zealand
e-mail: janetvonrandow@gmail.com
1 Introduction
and, often in the natural sciences, they feel that they do not have the expertise to
tackle language issues (Aitchison et al. 2012; Murray 2012).
EAL doctoral candidates with low levels of language proficiency will struggle to
acquire the discipline-specific academic literacy which is essential for graduate
studies (Braine 2002), especially considering that at the end of 3–4 years they have
to produce a written thesis (Owens 2006). Assisting such students with appropriate
English language enrichment as early as possible in their candidature has been
shown to be essential if students are to rise to the language challenge (Manathunga
2014). Other researchers in the field of doctoral study support this view, suggesting
that such students should have an assessment on arrival and ongoing support as
required (Sawir et al. 2012), that language “should be taken seriously … and should
be addressed very early in the candidature” (Owens 2007, p. 148) and that language
enrichment be ongoing for students “who require support in refining their use of
English over time” (Manathunga 2014, p. 73). Addressing language needs early
should give the EAL doctoral candidates the confidence to engage more actively
with their peers, their department and most importantly their supervisors (Benzie
2010; Sawir et al. 2012), thus positively influencing their learning (Seloni 2012).
Early identification of language needs in a systematic way seems to call for a
post-admission assessment programme. Although numerous Australian universities
have introduced post-entry language assessments (PELA) at the undergraduate level
(see Dunworth 2009; Dunworth et al. 2013; Read 2015, Chap. 2), it appears that
only the University of Auckland in New Zealand has expanded the scope of its
assessment to cover doctoral candidates as well. The University’s Diagnostic
English Language Needs Assessment (DELNA) is described in some detail below.
One feature which distinguishes a PELA from the kind of placement test that
international students commonly take when they enter an English-medium univer-
sity is the scheduling of an individual advisory session after the students have com-
pleted the assessment, to discuss their results and recommend options for enhancing
their language development. These sessions have been shown to work well with
undergraduate students in Canada (Fox 2008) and Australia (Knoch 2012). While
some authors argue that the individual consultation is not cost-effective (Arkoudis
et al. 2012), it would seem that, when doctoral students may already be somewhat
demoralised by the language difficulties they have been experiencing, a one-on-one
meeting to follow up the assessment is highly desirable (Aitchison 2014; Laurs
2014). This session enables them to discuss their unique language needs and be
listened to. It also provides information about specific language resources and initi-
ates the networking that Carter and Laurs (2014) have described as important for
doctoral candidates as they embark on their studies.
Any evaluation of a language support programme of this kind needs to focus on
its impact in terms of positive outcomes for the students. Thus, Knoch and Elder
(2013; this volume, Chap. 1) have developed a framework for the validation of post-
entry assessments which gives prominence to the consequences of the assessment.
As these authors argue, “the consequences of using the PELA and the decisions
informed by the PELA should be beneficial to all stakeholders” (Knoch and Elder
2013, p. 60). The stakeholders here include university faculties and departments,
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 141
supervisors and, of course, the students. In the first instance, “the success of any
PELA initiative relies on uptake of the advice stemming from test results” (2013,
pp. 52–53). In his recent book, Read (2015, Chap. 10) has presented the available
evidence for the impact of DELNA on the academic language development of
undergraduate students at Auckland. The present study represents an initial investi-
gation of the consequences for student learning of extending the DELNA require-
ment to doctoral students. It is necessary, then, to give some background information
about how DELNA operates and what provisions are made by the University to
enhance the academic language skills of doctoral candidates.
3.1 Background
The response of the Board of Graduate Studies was to make the DELNA assess-
ment a requirement for all incoming doctoral candidates, including domestic students
142 J. Read and J. von Randow
with English as a first language. In 2011, therefore, completion of the DELNA pro-
cess became one of the goals set for doctoral candidates in their first year of registra-
tion, which is known as the provisional year (The University of Auckland 2013b).
Once the provisional year goals are achieved, the registration is confirmed. As a
result, since 2011 more than 400 doctoral candidates who were required to complete
the DELNA Diagnosis and take up a language enrichment programme have complied
with this provision.
year, to form the basis of an exit interview with the doctoral consultant at ELE. The
whole process is outlined in Fig. 7.1.
There are two main forms of language enrichment available at the University.
• The first consists of credit courses in academic English, English writing and
scientific communication. Taking one of these courses, as appropriate to their
needs, is a requirement for students with a language waiver and those whose
band scores in the Diagnosis are below 6. No additional tuition is charged but the
students must complete the course requirements with a minimum grade of B. For
more advanced candidates, there is a postgraduate credit course on Developing
Academic Literacy, which is of particular value for Arts students.
• The other main type of enrichment consists of activities which involve more
individual motivation and effort on the student’s part. The English Language
Enrichment (ELE) centre offers individual consultations with an adviser, online
and print resources, and weekly small-group conversation sessions – all designed
specifically for EAL students (www.library.auckland.ac.nz/ele). ELE is a spe-
cialist unit within Student Learning Services (SLS), which runs workshops and
provides resources for doctoral candidates on topics such as thesis writing, aca-
demic integrity, accessing the literature and presenting research (www.library.
auckland.ac.nz/study-skills/postgraduate). The adviser also informs the candi-
date about individual learning opportunities such as accessing academic English
podcasts.
There is an ongoing evaluation procedure for DELNA which is well established in
the case of undergraduate students and has now been extended to doctoral candidates
as well. Students who have taken the DELNA Diagnosis receive an email at the end
of their first semester of study, requesting that they complete an anonymous online
questionnaire about their experience of DELNA and their first semester of study.
Over the last 11 years, the response rate has been about 22 %. Those who submit the
144 J. Read and J. von Randow
questionnaire are then invited to provide their contact details if they are willing to
participate in a follow-up interview with a DELNA staff member to express their
views and experiences at greater length. Typically three to four interviews have been
conducted each semester. The interviews are quite separate from the process of deter-
mining whether the students have fulfilled the post-DELNA requirement to enhance
their academic English skills at the end of their provisional year.
4 The Study
Thus, drawing on both the literature on doctoral students and the experience to date
with DELNA as a programme for undergraduate students, the present study aimed
to obtain some preliminary evidence of the effectiveness of the DELNA require-
ment and follow-up action in enhancing the academic language development of
doctoral students at Auckland. More specifically, the following research questions
were addressed:
1. How did the participants react to being required to take DELNA as they began
their provisional year of registration?
2. How did they respond to the advisory session and advice?
3. How did they evaluate the language enrichment activities they were required or
recommended to undertake?
4. What other strategies did they adopt to help them cope with the language
demands of their provisional year?
5. What role did the doctoral supervisors play in encouraging language
development?
4.1 Participants
The participants in this study were 20 doctoral students who took the DELNA
Diagnosis between 2011 and 2013 and who agreed to participate in the evaluation
interview. As explained above, they were very much a self-selected group and thus
a non-probability sample. A demographic profile of the participants is presented in
Table 7.1. In an informal sense, the group was representative of the international
student cohort at the doctoral level in having Engineering as the dominant faculty
and Chinese as the most common first language. Seven of the students had language
waivers, which meant that they had not achieved the minimum IELTS or TOEFL
scores before admission, whereas the other 13 had not reached the required cut
score in the DELNA Screening. The four students with a band score of 5 from the
DELNA Diagnosis were at particular risk of language-related difficulties, but the 13
with a score of 6 were also deemed in considerable need of language enrichment.
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 145
The main source of data for this study was the set of interviews with the 20 doctoral
candidates. The interviews were transcribed and imported into NVivo, where they
were double-coded line by line to extract the salient themes for analysis. Other data
sources available to the researchers were: the candidates’ results profiles generated
from the DELNA Diagnosis, their responses to the online evaluation questionnaire,
and their exit reports on the language enrichment undertaken by the end of the pro-
visional year. However, the analysis in this chapter focuses primarily on the inter-
view data.
4.3 Results
The findings will be presented in terms of responses to the research questions, draw-
ing selectively on the data sources.
4.3.1 How Did the Participants React to Being Required to Take DELNA
as They Began Their Provisional Year of Registration?
that “IELTS got you in, but that was only the beginning”, and another advised others
to do it because “it could record your English level and … improve your English”.
I think it’s okay because you know they might have some specific goals to see the English
level of the students. I think it’s okay. (#4 Engineering student, China)
Seven expressed certain concerns: two were nervous beforehand about taking
DELNA; one found that post-DELNA language study interrupted his research proj-
ect; two noted that their supervisors suddenly focused more on their language skills;
and one was worried she would be sent home if her English skills were found want-
ing. Another felt that she had passed IELTS and so her “English was over”, a per-
ception that is not uncommon.
Actually after I know I need to take DELNA I am a little worried – other English examina-
tion – I passed the IELTS already you know – my English is over. (#7 Engineering student,
China)
I think, oh my God, another exam, I hope I can pass! I felt like that, I thought I just worry
about if I could not pass. (#8, Engineering student, China)
Before I take the DELNA I didn’t want to do it, to be honest. I am very afraid of any
kinds of English test and I am afraid if the results are not good I may come home. (#1
Education student, China)
Understandably, the students did not always understand the nature of DELNA
before they took the assessment, even though they are given access to information
about the programme in advance. Several of them felt that the Screening did not
give as good an indication of their ability as the Diagnosis, which consisted of tasks
which were relatively familiar to those who had taken IELTS.
Maybe IELTS is some structure I am familiar with – I familiar with that type test but DELNA
was quite different. I don’t prepare enough to do right. (#14, Engineering student, Vietnam)
4.3.2 How Did They Respond to the Advisory Session and Advice?
All the participants welcomed the DELNA advisory session, where they felt “lis-
tened to” and “valued”, which Chanock (2007) points out is essential. It gave them
the opportunity to discuss the issues as they saw them; their strengths were acknowl-
edged and the areas that needed improvement identified. All agreed that the advice
was detailed and clear, the interview was valuable, and that they were able to keep
in touch with the language adviser later.
She gave me lots of useful suggestions about my language, focus on my language. And I
followed her suggestions and I received a lot of benefit. (#17, Business student, China).
The Language Adviser provided direction for English learning, suggests variable infor-
mation to improve listening and writing skills. She indicated what I should study to
improve – this is a good way to understand. Experts can help identify. After the discussion
I know what I should improve. She gave me some documents and provided useful websites.
(#18, Engineering student, Thailand)
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 147
Credit Courses
Ten of the participants, especially those with language waivers, were required to
pass a suitable credit course in their first year of study. Since most of the students in
our sample were in Engineering or Science, seven of them took SCIGEN 101:
Communicating for a Knowledge Society. Offered by the Science Faculty, the
course develops the ability to communicate specialist knowledge to a variety of
audiences, through written research summaries, oral presentations and academic
posters. Even though it is primarily a first-year undergraduate course, the partici-
pants found it very helpful and relevant to their future academic and professional
needs.
In SCIGEN they teach us how …how to explain, how to transfer knowledge, I mean if I am
writing something I am sensing – it is really helpful. …. And then a good thing teach me
how to make a poster – the important thing how to do the writing. (#2, Engineering student,
China)
And also give a talk in front of about five students and two teachers. Yes, because before
I give the talk I had few opportunity to speak so many people, in front of so many people.
(#19 Science student, China)
Two students expressed appreciation for being able to take a taught course, espe-
cially one presented “in English style”, alongside native-speaking students. Thus,
there were opportunities for social contacts with others in the class, although one
student from China commented that he found “few chance[s] to speak with others”.
One source of stress for two participants was the need to obtain at least a B grade
for the course, particularly since 50 % of the grade was based on a final written
exam.
Apart from SCIGEN 101, two Business students took courses in Academic
English, focusing on listening and reading, and writing respectively; and a more
proficient Education student enrolled in a Research Methods course, which he
found “challenging”, but “helpful” and “important”. A few students also took a
course, as recommended by their supervisor, to enhance their knowledge of a sub-
ject relevant to their doctoral research.
148 J. Read and J. von Randow
Workshops
The participants were advised about the workshops offered by Student Learning
Services (SLS). There is an initial induction day attended by all new doctoral can-
didates which introduces them to the whole range of workshops available to them.
It is also possible for them to book an individual session with a learning adviser,
particularly for assistance with writing, at both SLS and ELE. In the interviews,
they made universally favourable comments about the value of the workshops and
individual consultations they participated in, as well as the ELE learning resources.
yeah mostly the workshop in ELE and the library is very good. I try to attend most of them
but I missed some of them. (#5 Business student, Iran)
Before I submit this I attended a Student Learning Workshop – you carry your research
proposal draft and they provide a native English speaker – the person who helped me edit
the proposal told me my English writing was very good – because he helped others and
couldn’t understand as well. He proofread – … he worked with me and some information
he did not understand we rewrote. (#18 Engineering student, Thailand)
Writing was the major focus for all these activities, given the ultimate goal of
producing a doctoral thesis. However, four participants were more immediately
writing conference papers with their supervisors, who were doing major editing of
their work, and other supervisors were insisting on early writing outputs. This gave
an urgency to the students’ efforts to seek assistance in improving their writing.
Conversation Groups
present and assess their own improvement week by week as they tried to match up
to the level of those with greater fluency.
Before I come to NZ I had so few opportunities to speak English but I come here and joined
the Let’s Talk. I have more expression practise English with other people and make good
communication. It is so good, really. (#19 Science student, China)
I have been to Let’s Talk – it was great – … it is like a bunch of PhD students getting
there and talk about their PhD like and how they deal with supervisors – … – people from
all over the world. You can hear from people talking with different accents, different cultural
backgrounds. It is very interesting. I am really interesting to learning Japanese – and actu-
ally have met some Japanese friends there. (#17, Business student, China)
There were numerous favourable comments about the impact of these sessions
on the students’ confidence to communicate in English. On the other hand, one par-
ticipant (#11) found that a few fluent students tended to dominate the discussion,
and another came to the conclusion that the value of the sessions diminished over
time:
The problem with Let’s Talk was after a while everything was the same. I mean after a few
months, and I think everyone going to Let’s Talk I meet a lot of people also and they just
give up. I mean after that you just don’t need it. (#12, Science student, Iran)
This of course can be seen as evidence of the success of the programme if it gave
the students the confidence to move on and develop their oral skills in other ways.
In the area of listening skills, the DELNA advisers recommended that students
access an academic podcast site developed at another New Zealand university
(http://martinmcmorrow.podomatic.com). This resource draws on the well-known
TED talks (www.ted.com) and has a particular focus on academic vocabulary devel-
opment as well as comprehension activities. Fourteen of the participants listened
regularly and all agreed that using this podcast simultaneously increased their
knowledge of New Zealand and significantly improved their listening and reading
skills.
As reflected in their low DELNA scores, reading was expected to be really dif-
ficult for four of the participants, and indeed they found the reading demands of
their doctoral programmes somewhat overwhelming. Two participants followed the
DELNA advice to attend workshops on reading strategies and access an online pro-
gramme on effective reading. Others reported no particular difficulty with reading
in their disciplinary area – apart from the volume of reading they were expected to
complete. Three of the students found it particularly useful to work through the
recommended graded readers and other materials available at ELE.
I love this part [ELE] – I can borrow lots of books – not academic books – there are lots of
magazines and story books – not – because in our university most of the books are very ,
you know, just for the research, academic books, but I can go there to … borrow some
books, very nice books and they also have the great website online and we can practise in
English on the website- I try to do it in the lunchtime – … so I just open their website, have
a look, just a few minutes, but it’s good. (#7, Engineering student, China)
150 J. Read and J. von Randow
4.3.4 What Other Strategies Did the Participants Adopt to Help Them
Cope with the Language Demands of Their Provisional Year?
It was a common observation by the participants that they had less opportunity to
communicate with others in English on campus than they might have expected.
Typically, they were assigned to an office or a lab with a small number of other
doctoral candidates and tended to focus on their own projects for most of the day:
Because I am a PhD student and we are always isolated in our research and only the time
that I have during the day is just – speaking with my friend during lunch time – it is very
short. Although my partner and I speak together in English… (#14 Engineering student,
Vietnam)
The situation was exacerbated when fellow PhD students were from one’s own
country:
We spend most of our time in our office – we are quiet. About 10 sentences a day – the case
is getting worse with Iranian colleagues in the office. So far no gatherings of doctoral stu-
dents to present work but it is going to happen. (#9 Engineering student, Iran)
Frankly speaking, in my research group there are most – 80 % Chinese. We discuss the
research details in Chinese, which is very convenient for us, so we use the most common
way. (#15 Engineering student, China)
This made it all the more important for the students to follow the recommenda-
tions of the DELNA language advisers and take up the formal language enrichment
activities available on campus, which they all did to varying degrees. However, five
of the participants, demonstrating the “agency” that Fotovatian (2012) describes in
her study of four doctoral candidates, created their own opportunities to improve
their language skills. One moved out of her original accommodation close to the
university because there were too many students who spoke her L1 and went into a
New Zealand family homestay. She felt that the distance from the University was
offset by the welcoming atmosphere and the constant need to speak English. She
kept the BBC as a favourite on her laptop in her office, and took 10 min breaks every
now and then, as advised by an adviser in ELE, to listen to the news in English. In
the interview her communication skills were impressive and she reported that her
friends had commented on her improvement. Another engineering student, who had
seemed completely disoriented on arrival in Auckland, was confident and enthusias-
tic in the interview. He too had moved into an English-speaking homestay and had
improved his spoken English by cooking the evening meal with his host.
Three others also consciously worked on their language skills by finding ways to
mix with English speakers both in and outside the university. They audited lectures
and attended all seminars and social occasions organised by their Faculty and by the
Postgraduate Students Association. One tried to read more of the newspaper each
day and they all listened to the radio, and watched films on television.
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 151
Since 2013, supervisors have been able to comment on their students’ language
development in the provisional year report. All four of the supervisors of our partici-
pants who completed their provisional year in 2013 added their reflections, with one
noting that his candidate’s speaking was fine and that her writing “keeps improv-
ing”. Another described his student’s achievements so far: a paper for which she
was the main author, several presentations in faculty seminars, and involvement in
the organising committee of a conference. He noted that she would be able to write
her thesis “comfortably”. Two comments came from supervisors whose students
had had language waivers: one commented that his student “writes better than many
students and will be able to write up his work”; while the other was “happy with the
improvement made”. These latter students were two of the five described above who
had consciously taken every opportunity within and outside the university to
improve their language skills during that time.
152 J. Read and J. von Randow
5 Discussion
The fact that newly-enrolled candidates were required to take part in DELNA was
accepted by 13 of the participants, while seven voiced certain concerns. The partici-
pant who thought IELTS meant English was “over” (cf. O’Loughlin 2008) soon
discovered for herself that this was not the case. The initial anxiety about the assess-
ment experienced by at least two of the participants could have been reduced if they
had been better informed about DELNA during the induction period, and this high-
lights the ongoing need to ensure that all candidates engage in advance with the
information resources available (see Read 2008, on the presentation of DELNA; see
also Yan et al. this volume). The DELNA team must also be in regular contact with
the Graduate School, supervisors, Student Learning Services and other stakehold-
ers, to ensure that the role of the assessment is clearly understood and it continues
to contribute to effective academic language development among those students
who need it.
Most of the participants commented that English as learnt in their home country
and English in their new environment were different. For some this was so difficult
on arrival that they were depressed, as Cotterall (2011) and Fotovatian (2012) have
also observed. Therefore, they were grateful to simply follow advice from DELNA
to take advantage of the academic enrichment activities, which all the participants
generally described as helpful and relevant. The one-on-one advisory session helped
dispel their concerns by offering sympathetic and targeted support in what
Manathunga (2007a) describes as “this often daunting process of becoming a
knowledgeable scholar” (p. 219). The main constraint for two of the participants,
whose DELNA results suggested that they needed considerable language develop-
ment, was that their research commitments from early on in their candidacy limited
the amount of time they could spend on language enrichment. This constraint needs
to be accommodated as far as possible in the DELNA advisory discussion.
As our participants reported, by taking up the language enrichment activities
recommended by the adviser and sharing the responsibility for improving their aca-
demic English ability, they improved their self-confidence and this facilitated more
interaction with their local peers and further language development. They thereby
created a virtuous circle which “tends to increase prospects of academic success”
(Sawir et al. 2012, p. 439). Those students who took courses as part of the provi-
sional year requirements generally benefitted from them. Although one of the stu-
dents with a language waiver found the credit course and the required pass a
particular burden, the others who were enrolled in courses found the input from a
lecturer and interaction with other students particularly useful. This interaction,
which was missing in the university experience of the six doctoral researchers in
Cotterall’s (2011) study, was one key to confidence building, thus leading to
increased use of English and an acknowledgement by the students that their skills
had improved (Seloni 2012).
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 153
The Let’s Talk groups at ELE were invaluable for improving confidence in listen-
ing and enabling conversation in English in an atmosphere that encouraged unem-
barrassed participation (Manathunga 2007a). Even though the discussions were not
particularly academic in nature, this led to greater confidence in academic contexts,
which was the reported outcome of a similar programme at another New Zealand
university as well (Gao and Commons 2014).
As has been shown in the research on doctoral learning, the supervisory relation-
ship tends to be stressful when the candidate is not highly proficient in English
(Knight 1999; Strauss 2012). For the majority of our participants the supervisory
relationship was being managed reasonably well with what appeared to be under-
standing of the language challenge faced by the students. The supervisors who did
not seem to have time for their students and the two who spoke the students’ L1
exclusively were a concern, whereas those who gave frequent writing practice and
then showed that they were taking it seriously by giving feedback on the writing
were doing their students a service, as Owens (2006) and Manathunga (2014) have
also observed. The same applied to supervisors who engaged in joint writing tasks
with their students at an early stage – something Paltridge and Starfield (2007) see
as critical.
One kind of evidence for the positive engagement of supervisors has come in the
form of responses to the reports sent by the DELNA Language Adviser after the
advisory session for each new candidate. One supervisor, who has English as an
additional language himself, recently bemoaned the fact that something similar had
not been in place when he was a doctoral candidate. Another supervisor expressed
full support for the DELNA and ELE programmes and was happy to work with his
student to improve all aspects of her English comprehension and writing because
she was:
one of the best postgraduate students I have ever supervised and I think her participation
in the English language enrichment programme will be of great benefit to her both in her
PhD and beyond. (PhD supervisor, Science)
This emphasises the point that assisting international doctoral students to rise to
the language challenge is a key step in allowing them to fulfil their academic
potential.
As previously acknowledged, this was a small study of 20 self-selected partici-
pants, who may have been proactive students already positively inclined towards
DELNA. In addition, the research covers only their provisional year of study and
particularly their reactions to the first months at an English-medium university.
Their experiences and opinions, however, mirror those reported in the studies cited
above. It would be desirable to track the students through to the completion of their
degree and conduct a further set of interviews in their final year of study. More
comprehensive input from the supervisors would also add significantly to the
evidence on the consequences of the DELNA-related academic language enrich-
ment that the students undertook.
154 J. Read and J. von Randow
6 Conclusion
The provisions for academic language enrichment at Auckland are not particularly
innovative. Similar courses, workshops and advisory services are provided by
research universities around the world for their doctoral candidates. What is more
distinctive at Auckland is, first, the administration of a post-admission English
assessment to all students, regardless of their language background, to identify
those at risk of language-related difficulties in their studies. Secondly, the assess-
ment does not function simply as a placement procedure to assign students to com-
pulsory language courses, but leads to an individual advisory session in which the
student’s academic language needs are carefully reviewed in the light of their
DELNA performance and options for language enrichment are discussed. The
resulting report, which is received by the supervisor and the Graduate School as
well as by the student, includes a blend of required and recommended actions, and
the student is held to account as part of the subsequent review of their provisional
year of doctoral registration. The process is intended to communicate to the student
that the University takes their academic language needs seriously and is willing to
devote significant resources to addressing those needs.
This study has presented evidence that participants appreciated this initiative by
the University and responded with effective actions to enhance their academic lan-
guage skills. At least among these students who were willing to be interviewed, the
indications are that the programme boosted the students’ confidence in communi-
cating in English in ways which would enhance their doctoral studies. However,
academic language development needs to be an ongoing process and, in particular,
the participants in this study had yet to face the language demands of producing a
thesis, communicating the findings of their doctoral research, and negotiating their
entry into the international community of scholars in their field. Further research is
desirable to investigate the longer-term impact of the University’s initiatives in lan-
guage assessment and enrichment, to establish whether indeed they contribute sub-
stantially to the ability of international doctoral students to rise to the language
challenge.
I think English study is most challenging. It is quite hard. I don’t think much in the degree
is as challenging – the language, yeah! (# 15, Engineering student, China)
References
Aitchison, C. (2014). Same but different: A 20-year evolution of generic provision. In S. Carter &
D. Laurs (Eds.), Developing generic support for doctoral students: Practice and pedagogy.
Abingdon: Routledge.
Aitchison, C., Catterall, J., Ross, P., & Burgin, S. (2012). ‘Tough love and tears’: Learning doctoral
writing in the sciences. Higher Education Research & Development, 31(4), 435–448.
Arkoudis, S., Baik, C., & Richardson, S. (2012). English language standards in higher education:
From entry to exit. Camberwell: ACER Press.
7 Extending Post-Entry Assessment to the Doctoral Level: New Challenges… 155
Basturkmen, H., East, M., & Bitchener, J. (2014). Supervisors’ on-script feedback comments on
drafts of dissertations: Socialising students into the academic discourse community. Teaching
in Higher Education, 19(4), 432–445.
Benzie, H. J. (2010). Graduating as a ‘native speaker’: International students and English language
proficiency in higher education. Higher Education Research & Development, 29(4), 447–459.
Braine, G. (2002). Academic literacy and the nonnative speaker graduate student. Journal of
English for Academic Purposes, 1(1), 59–68.
Carter, S., & Laurs, D. (Eds.). (2014). Developing generic support for doctoral students: Practice
and pedagogy. Abingdon: Routledge.
Chanock, K. (2007). Valuing individual consultations as input into other modes of teaching.
Journal of Academic Language & Learning, 1(1), A1–A9.
Cotterall, S. (2011). Stories within stories: A narrative study of six international PhD researchers’
experiences of doctoral learning in Australia. Unpublished doctoral thesis, Macquarie
University, Australia.
Dunworth, K. (2009). An investigation into post-entry English language assessment in Australian
universities. Journal of Academic Language and Learning, 3(1), 1–13.
Dunworth, K., Drury, H., Kralik, C., Moore, T., & Mulligan, D. (2013). Degrees of proficiency:
Building a strategic approach to university students’ English language assessment and devel-
opment. Sydney: Australian Government Office for Learning and Teaching. Retrieved February
24 2016, from, www.olt.gov.au/project-degrees-proficiency-building-strategic-approach-
university-studentsapos-english-language-ass
Edwards, B. (2006). Map, food, equipment and compass – preparing for the doctoral journey. In
C. Denholm & T. Evans (Eds.), Doctorates downunder: Keys to successful doctoral study in
Australia and New Zealand (pp. 6–14). Camberwell: ACER Press.
Elder, C., & von Randow, J. (2008). Exploring the utility of a web-based English language screen-
ing tool. Language Assessment Quarterly, 5(3), 173–194.
Fotovatian, S. (2012). Three constructs of institutional identity among international doctoral stu-
dents in Australia. Teaching in Higher Education, 17(5), 577–588.
Fox, R. (2008). Delivering one-to-one advising: Skills and competencies. In V. N. Gordon, W. R.
Habley, & T. J. Grites (Eds.), Academic advising: A comprehensive handbook (2nd ed.,
pp. 342–355). San Francisco: Jossey-Bass.
Gao, X., & Commons, K. (2014). Developing international students’ intercultural competencies. In
S. Carter & D. Laurs (Eds.), Developing generic support for doctoral students: Practice and
pedagogy (pp. 77–80). Abingdon: Routledge.
Grant, B. (2003). Mapping the pleasures and risks of supervision. Discourse Studies in the Cultural
Politics of Education, 24(2), 175–190.
Knight, N. (1999). Responsibilities and limits in the supervision of NESB research students in the
Social Sciences and Humanities. In Y. Ryan & O. Zuber-Skerrit (Eds.), Supervising postgradu-
ates from Non-English speaking backgrounds (pp. 15–24). Buckingham: The Society for
Research into Higher Education & Open University Press.
Knoch, U. (2012). At the intersection of language assessment and academic advising:
Communicating results of a large-scale diagnostic academic English writing assessment to
students and other stakeholders. Papers in Language Testing and Assessment, 1, 31–49.
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments.
Papers in Language Testing and Assessment, 2(2), 1–19.
Laurs, D. (2014). One-to-one generic support. In S. Carter & D. Laurs (Eds.), Developing generic
support for doctoral students: Practice and pedagogy (pp. 29–32). Abingdon: Routledge.
Lutz-Spalinger, G. (2010). Final proposal, English language proficiency: Doctoral candidates.
Unpublished report, The University of Auckland.
Manathunga, C. (2007a). Supervision as mentoring: The role of power and boundary crossing.
Studies in Continuing Education, 29(2), 207–221.
156 J. Read and J. von Randow
Abstract Research has shown that vocabulary recognition skill is a readily mea-
sured and relatively robust predictor of second language performance in university
settings in English-speaking countries. This study builds on that research by devel-
oping an understanding of the relationship between word recognition skill and
Academic English performance in English-medium instruction (EMI) university
programs in English-as-a-lingua-franca (ELF) contexts. The use of a Timed Yes/No
(TYN) test of vocabulary recognition skill was assessed as a screening tool in two
EMI university foundation programs in an Arab Gulf State: in a metropolitan state
university (N = 93) and a regional private institution (N = 71). Pearson correlation
coefficients between the TYN test and performance on university placement and
final test scores ranged between 0.3 and 0.6 across the two groups and by gender
within those groups. This study indicates the TYN test measures have predictive
value in university ELF settings for screening purposes. The trade-off between
validity, reliability, usability and the cost-effectiveness of the TYN test in academic
ELF settings are discussed with consideration of test-takers’ digital literacy levels.
T. Roche (*)
SCU College, Southern Cross University, Lismore, NSW, Australia
e-mail: Thomas.Roche@scu.edu.au
M. Harrington
School of Languages and Cultures, University of Queensland, Brisbane, Australia
e-mail: m.harrington@uq.edu.au
Y. Sinha
Department of English Language Teaching, Sohar University, Al Sohar, Oman
e-mail: Yogesh@soharuni.edu.om
C. Denman
Humanities Research Center, Sultan Qaboos University, Muscat, Oman
e-mail: denman@squ.edu.om
1 Introduction
1
The United Nations Educational, Scientific and Cultural Organization (UNESCO) predicts that
the number of internationally mobile students will increase from 3.4 million in 2009 to approxi-
mately 7 million by 2020, with a minimum of 50 % of these students (some 3.5 million students)
undertaking English language education (UNESCO 2012 in Chaney 2013). English is currently
the most frequently used language of instruction in universities around the globe (Ammon 2006;
Jenkins 2007; Tilak 2011).
2
Research also points to the importance of factors such as social connections (Evans and Morrison
2011), cultural adjustment (Fiocco 1992 cited in Lee and Greene 2007) and students’ understand-
ing of and familiarity with the style of teaching (Lee and Greene 2007) as significantly contribut-
ing to students’ academic success in English-medium university programs.
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 161
comprehensive nature of such tests makes them extremely time- and resource-inten-
sive, and the feasibility of this approach for many ELF settings is questionable.
The relative speed with which vocabulary knowledge can be measured recom-
mends it as a tool for placement decisions (Bernhardt et al. 2004). Although the
focus on vocabulary alone may seem to be too narrow to adequately assess levels of
language proficiency, there is evidence that this core knowledge can provide a sensi-
tive means for discriminating between levels of learner proficiency sufficient to
make reliable placement decisions (Harrington and Carey 2009; Lam 2010; Meara
and Jones 1988; Wesche et al. 1996); in addition, a substantial body of research
shows that vocabulary knowledge is correlated with academic English proficiency
across a range of settings (see Alderson and Banerjee 2001 for a review of studies
in the 1980-1990s). The emergence of the lexical approach to language learning and
teaching (McCarthy 2003; Nation 1983, 1990) reflects the broader uptake of the
implications of such vocabulary knowledge research in second language
classrooms.
Vocabulary knowledge is a complex trait (Wesche and Paribakht 1996). The size
of an L2 English user’s vocabulary, i.e. the number of word families they know, is
referred to as breadth of vocabulary knowledge in the literature (Wesche and
Paribakht 1996; Read 2004). Initial research in the field (Laufer 1992; Laufer and
Nation 1999) suggested that knowledge of approximately 3000 word families pro-
vides L2 English users with 95 % coverage of academic texts, a sufficient amount
for unassisted comprehension. More recent research has indicated that knowledge
of as many as 8000–9000 word families, accounting for 98 % coverage of English
academic texts, is required for unassisted comprehension of academic English
material (Schmitt et al. 2011, 2015). Adequate vocabulary breadth is a necessary but
not a sufficient precondition for comprehension.3
For L2 users to know a written word they must access orthographic, phonologi-
cal, morphological and semantic knowledge. Studies of reading comprehension in
both L1 (Cortese and Balota 2013; Perfetti 2007) and L2 (Schmitt et al. 2011) indi-
cate that readers must first be able to recognize a word before they can successfully
integrate its meaning into a coherent message. Word recognition performance has
been shown in empirical studies to correlate with L2 EAP sub-skills: reading
(Schmitt et al. 2011; Qian 2002); writing (Harrington and Roche 2014; Roche and
Harrington 2013); and listening (Al-Hazemi 2001; Stæhr 2009). In addition, Loewen
and Ellis (2004) found a positive relationship between L2 vocabulary knowledge
and academic performance, as measured by grade point average (GPA), in English
medium-university programs, with a breadth measure of vocabulary knowledge
accounting for 14.8 % of the variance in GPA.
In order to engage in skilled reading and L2 communication, (i.e. process spoken
language in real time) learners need not only the appropriate breadth of vocabulary,
but also the capacity to access that knowledge quickly (Segalowitz and Segalowitz
1993; Shiotsu 2001). Successful text comprehension requires lower level linguistic
3
Other factors have also been shown to affect reading comprehension, such as background knowl-
edge (e.g., Pulido, 2004 in Webb and Paribakht 2015).
162 T. Roche et al.
processes (e.g. word recognition) to be efficient, that is, fast and with a high degree
of automaticity, providing information to the higher level processes (Just and
Carpenter 1992). For the above reasons a number of L2 testing studies (e.g. Roche
and Harrington 2013) have taken response time as an index of L2 language profi-
ciency. In this study we focus on word recognition skill, as captured in a test of
written receptive word recognition (not productive or aural knowledge) measuring
the breadth of vocabulary knowledge and the ability to access that knowledge with-
out contextual cues in a timely fashion.
2 The Study
2.2 Participants
Participants in this study (N = 164) were Arabic L1 using students enrolled in the
English language component of general foundation programs at two institutions.
They were 17–25 years old. The programs serve as pathways to English-medium
undergraduate study at University A, a metropolitan national university (N = 93),
and University B, a more recently established regional private university (N = 71) in
Oman.
The primary data collection (TYN test, Screening Test) took place at the start of
a 15-week semester at the beginning of the academic year. Students’ consent to take
164 T. Roche et al.
part in the voluntary study was formally obtained in accordance with the universi-
ties’ ethical guidelines.
2.3 Materials
Recognition vocabulary skill was measured using an on-line TYN screening test.
Two versions of the test were used, each consisting of 62 test items. Items are words
drawn from the 1,001st-4,000th most frequently occurring word families in the
British National Corpus (BNC) in Part A; and from the 1 K, 2 K, 3 K, and 5 K fre-
quency bands in Test B. Test A therefore consists of more commonly occurring
words, while Test B includes lower frequency 5 K items thought to be facilitative in
authentic reading (Nation 2006), and an essential part of academic study (Roche
and Harrington 2013; Webb and Paribakht 2015). The composition of the TYN tests
used here differed from earlier TYN testing research by the authors (Roche and
Harrington 2013; Harrington and Roche 2014), which incorporated a set of items
drawn from even lower frequency bands, i.e., the 10 K band. Previous research
(Harrington 2006) showed that less proficient learners found lower frequency words
difficult, with recognition performance near zero for some individuals. In order to
make the test more accessible to the target group, the lowest frequency band (i.e. 10
k) was excluded. During the test, items are presented individually on a computer
screen. The learners indicate via keyboard whether they know each test item: click-
ing the right arrow key for ‘yes’, or the left arrow key for ‘no’. In order to control
for guessing, the test items consist of not only 48 real word prompts (12 each from
four BNC frequency levels) but also 12 pseudoword prompts presented individually
(Meara and Buxton 1987). The latter are phonologically permissible strings in
English (e.g. stoffels). The TYN test can be administered and scored quickly, and
provides an immediately generated, objectively scored measure of proficiency that
can be used for placement and screening purposes.
Item accuracy and response time data were collected. Accuracy (a reflection of
size) was measured by the number of word items correctly identified, minus the
number of pseudowords the participants claimed to know (Harrington 2006). Since
this corrected for guessing score can result in negative values, 55 points were added
to the total accuracy score (referred to as the vocabulary score in this paper).
Participants were given a short 12-item practice test before doing the actual test.
Speed of response (referred to as vocabulary speed here) for individual items was
measured from the time the item appeared on the screen until the student initiated
the key press. Each item remained on the screen for 3 s (3000 ms), after which it was
timed out if there was no response. A failure to respond was treated as an incorrect
response. Students were instructed to work as quickly and as accurately as possible.
Instructions were provided in a video/audio presentation recorded by a local native
speaker using Modern Standard Arabic. The test was administered using
LanguageMap, a web-based testing tool developed at the University of Queensland,
Australia.
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 165
Placement Tests. Both of the in-house placement tests included sections on read-
ing, writing and listening, using content that was locally relevant and familiar to the
students. A variety of question types, including multiple-choice, short answer, as
well as true-and-false items, were used.
Final Test. Overall performance was measured by results on end of semester
exams. These determine whether students need to take another semester of EAP, or
if they are ready to enter the undergraduate university programs. At both institutions
the Final Test mirrors the format of the Placement Tests. All test items are based on
expected outcomes for the foundation program as specified by the national accredi-
tation agency (Oman Academic Accreditation Authority 2008).
2.4 Procedure
Placement tests were administered in the orientation week prior to the start of the
15-week semester by university staff. The TYN tests were given in a computer lab.
Students were informed that the test results would be added to their record along
with their Placement Test results, but that the TYN was not part of the Placement
Test. The test was introduced by the first author in English and then video instruc-
tions were given in Modern Standard Arabic. Three local research assistants were
also present to explain the testing format and guide the students through a set of
practice items.
The Final Tests were administered under exam conditions at both institutions at
the end of the semester. Results for the Placement and Final Tests were provided by
the Heads of the Foundation programs at the respective universities with the partici-
pants’ formal consent. Only overall scores were provided.
3 Results
A total of 171 students were in the original sample, with seven removed for not hav-
ing a complete set of scores for all the measures. Students at both institutions com-
pleted the same TYN vocabulary test but sat different Placement and Final Tests
depending on which institution they attended. As such, the results are presented by
institution. The TYN vocabulary tests were administered in two parts, I and II, to
lessen the task demands placed on the test-takers. Each part had the same format but
contained different items (see above). Reliability for the vocabulary test was mea-
sured using Cronbach’s alpha, calculated separately for the two parts. Analyses
were done by word (N = 48), and pseudoword (N = 14) items for the vocabulary
score and vocabulary speed measures (see Harrington 2006). Each response type is
166 T. Roche et al.
Table 8.1 Descriptive statistics for TYN vocabulary, vocabulary speed, placement test and false
alarm scores by university and gender
Score confidence
Range Score intervals (95 %)
Low High M SD Lower Upper
Vocabulary test Uni A Female 72 128 98.97 11.56 96.05 101.88
(Total points = n = 63
155) Male 76 125 101.16 18.12 95.75 104.47
n = 30
Total 99.34 11.55 96.96 101.71
N = 93
Uni B Female 27 122 88.75 21.52 81.57 95.92
n = 37
Male 14 119 69.68 25.99 60.61 78.75
n = 34
Total 79.62 25.47 73.59 85.64
N = 71
Response time Uni A Female 967 1910 1335 213 1281 1388
Male 739 1648 1270 214 1179 1360
Total 1314 223 1268 1360
Uni B Female 953 2336 1390 294 1292 1488
Male 836 2185 1575 391 1439 1712
Total 1479 354 1395 1563
False alarms Uni A Female 4 68 28.69 14.24 25.10 32.27
Male 0 64 23.69 13.96 18.48 28.90
Total 27.07 14.27 24.13 30.01
Uni B Female 00 82 32.24 20.75 25.32 39.16
Male 11 86 49.69 19.47 42.98 56.48
Total 40.59 21.84 35.42 45.77
Placement score Uni A Female 41 65 54.93 5.64 53.26 56.60
% (Uni specific) Male 39 65 54.10 7.07 51.46 56.74
Total 54.66 6.75 53.27 56.05
Uni B Female 3 90 46.65 27.54 37.47 55.83
Male 0 78 24.50 26.53 15.24 33.76
Total 36.04 29.08 29.16 42.93
End of semester Uni A Female 60 86 75.80 5.35 74.44 78.00
test % (Uni Male 62 84 75.15 5.64 73.18 77.12
specific) Total 75.59 5.43 74.48 77.71
Uni B Female 35 86 73.37 10.99 70.50 76.24
Male 21 85 61.18 12.86 58.18 64.17
Total 67.53 13.33 64.37 70.68
Uni A University A, a national public university; Uni B University B, a private regional university
168 T. Roche et al.
The false alarm rates for both University A males and females (24 % and 29 %,
respectively), as well as the University B females (32 %) were comparable to the
mean false alarm rates of Arab university students (enrolled in first and fourth year
of Bachelor degrees) reported in previous research (Roche and Harrington 2013).
They were also comparable to, though higher than, the 25 % for beginners and 10 %
for advanced learners evident in the results of pre-university English-language path-
way students in Australia (Harrington and Carey 2009). As with other TYN studies
with Arabic L1 users in academic credit courses in ELF university contexts (Roche
and Harrington 2013; Harrington and Roche 2014), there were students with
extremely high false alarm rates that might be treated as outliers. The mean false
alarm rate of nearly 50 % by the University B males here goes well beyond this. The
unusually high false alarm rate for this group, and the implications it has for the use
of the TYN test for similar populations, will be taken up in the Discussion.
The extremely high false alarm level for the University B males and some indi-
viduals in the other groups means the implications for Placement and Final Test
performance must be interpreted with caution. As a result, two sets of tests evaluat-
ing the vocabulary measures as predictors of test performance were performed. The
tests were first run on the entire sample and then on a trimmed sample in which
individuals with false alarm rates that exceeded 40 % were removed. The latter itself
is a very liberal cut-off level, since other studies have removed any participants who
incorrectly identified pseudowords at a rate as low as 10 % (Schmitt et al. 2011).
The trimming process reduced the University A sample size by 16 %, representing
19 % fewer females and 10 % fewer males. The University B sample was reduced by
over half, reflecting a reduction of 17 % in the total for the females and a very large
68 % reduction for the males. It is clear that the males in University B handled the
pseudowords in a very different manner than either the University B females or both
the genders in University A, despite the use of standardised Arabic-language instruc-
tions at both sites.
The University A students outperformed the University B students on the vocab-
ulary size and vocabulary speed measures. Assumptions of equality of variance
were not met, so an Independent-Samples Mann-Whitney U test was used to test the
mean differences for the settings. Both vocabulary score and mean vocabulary
speed scores were significantly different: U = 4840, p = 0.001, U = 6863, p = 0.001,
respectively. All significance values are asymptotic, in that the sample size was
assumed to be sufficient statistically to validly approximate the qualities of the
larger population. For gender there was an overall difference on test score U = 4675,
p = 0.022, but not on vocabulary speed, U = 7952 (p = 0.315). The gender effect for
test score reflects the effect of low performance by the University B males. A com-
parison of the University A males and females showed no significant difference
between the two, while the University B female vocabulary scores (U = 948,
p = 0.001) and vocabulary speed (U = 1155) (p = 0.042) were both significantly
higher than those for their male peers.
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 169
Table 8.2 Bivariate correlations between vocabulary and test measures for University A and
University B, complete data set
University A n = 93
University B n = 71 University Vocab speed Placement test Final test
Vocabulary size A −0.12 0.34** 0.10
B −0.59** 0.65** 0.50**
Vocabulary speed A – −0.11 0.04
B – −0.48** −0.33**
Placement test A – – 0.08
B – – 0.53**
**p < 0.001 All tests two-tailed
Table 8.3 Bivariate correlations between vocabulary and test measures, data set trimmed for high
false alarm values (false alarm rates >40 % removed)
University A n = 78
University B n = 38 University Vocab speed Placement test Final test
Vocabulary size A −0.13 0.37** 0.10
B −0.37* 0.34* 0.24
Vocabulary speed A – −0.34* 0.03
B – −0.31 −0.19
Placement test A – – 0.12
B – – 0.27
*p < 0.05, **p < 0.001, All tests two-tailed
The sensitivity of the vocabulary measures as predictors of placement and final test
performance was evaluated first by examining the bivariate correlations among the
measures and then performing a hierarchical regression to assess how the measures
interacted. The latter permits the respective contributions of size and speed to be
assessed individually and in combination. Table 8.2 presents the results for the
entire data set. It shows the vocabulary measures were better predictors for
University B. These results indicate that more accurate word recognition skill had a
stronger correlation with Placement Test Scores and Final Test scores for University
B than for University A.
When the data are trimmed for high false-alarm rates, the difference between the
two universities largely disappears (see Table 8.3). The resulting correlation of
approximately 0.35 for both universities shows a moderate relationship between
vocabulary recognition skill test scores and Placement Test performance. In the
trimmed data there is no relationship between TYN vocabulary recognition test
scores and Final Test scores.
Regression models were run to assess how much overall variance the measures
together accounted for, and the relative contribution of each measure to this amount.
170 T. Roche et al.
Table 8.4 Hierarchical regression analyses of placement test scores with vocab score and speed
as predictors for complete and trimmed data sets
Predictor R2 Adjusted R2 R2 change B SEB β
Complete data set
University A
Vocab score 0.117 0.107 0.117**** 0.200 0.058 0.334**
Vocab speed 0.121 0.102 0.005 −6.051 8.813 −0.068
University B
Vocab score 0.419 0.411 0.419**** 0.739 0.130 0.567**
Vocab speed 0.432 0.415 0.012 0.642 32.14 −0.136
Trimmed data set
University A
Vocab score 0.134 0.123 0.134**** 0.212 0.064 0.359**
Vocab speed 0.139 0.116 0.005 −5.889 9.276 −0.068
University B
Vocab score 0.111 0.087 0.111*** 0.390 0.261 0.251
Vocab speed 0.152 0.104 0.031 −6.227 50.048 −0.219
t significant at *p < 0.05, **p < 0.001. F significant at ***p < 0.05 and ****p < 0.001
Table 8.4 reports on the contribution of vocabulary speed to predicting the Placement
Test score criterion after the vocabulary scores were entered. Separate models were
calculated for University A and University B.
As expected from the bivariate correlations, the regression analysis shows that
the word measures and reaction time predicted significant variance in Placement
Test scores. The University A model accounted for nearly 12 % of the Placement
Test score variance while the University B one accounted for over 40 %.
Table 8.5 shows the ability of the vocabulary score (and speed) to predict Final
Test performance. The vocabulary measures served as predictors of the Final Test
for University B, where there was a moderate correlation (0.5) between the vocabu-
lary scores and overall English proficiency scores at the end of the semester. There
was no significant correlation with TYN Test scores and Final Test scores at
University A, but it is of note that the Final Test scores for this group had a truncated
range, which may reflect the higher academic entry requirements and concomitant
English proficiency levels of students at the national metropolitan university as dis-
cussed below in 4.1. The results not only indicate that the TYN word recognition
test is a fair predictor of performance, but also reinforce the importance of the
vocabulary knowledge. The results of the study at University B are a confirmation
of the significant role the vocabulary knowledge of L2 English users plays in achiev-
ing success in higher education contexts, given what is already known about the
importance of other factors such as social connections (Evans and Morrison 2011),
cultural adjustment (Fiocco 1992 cited in Lee and Greene 2007) and students’
understanding of and familiarity with the style of teaching (Lee and Greene 2007).
The regression analyses for Final Test scores also reflect the results of the bivari-
ate correlations: the word measures and vocabulary speed predicted significant vari-
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 171
Table 8.5 Hierarchical regression analyses of final test scores with vocab score and speed as
predictors for complete and trimmed data sets
Predictor R2 Adjusted R2 R2 change B SEB β
Complete data set
University A
Vocab score 0.117 0.107 0.200 0.058 0.334**
Vocab speed 0.121 0.102 0.005 −6.05 8.81 −0.068
University B
Vocab score 0.251 0.240 0.248 0.068 0.567**
Vocab speed 0.252 0.230 0.001 −5.881 16.891 −0.136
Trimmed data set
University A
Vocab score 0.010 −0.004 0.010 0.052 0.058 0.102
Vocab speed 0.011 −0.015 0.005 3.048 8.504 0.041
University B
Vocab score 0.059 0.033 0.111 0.096 0.058 0.203
Vocab speed 0.068 0.015 −10.754 18.374 8.81 −0.103
t significant at *p < 0.05, **p < 0.001
ance in the Final Test results. The model based on the Vocabulary test measure
accounted for nearly 22 % of the Final Test variance (total adjusted R2 = 0.219) while
the vocabulary speed word model accounted for only 9 % (0.094).
4 Discussion
The findings are consistent with previous work that indicates vocabulary recogni-
tion skill is a stable predictor of academic English language proficiency, whether in
academic performance in English-medium undergraduate university programs in
ELF settings (Harrington and Roche 2014; Roche and Harrington 2013), or as a tool
for placement decisions in English-speaking countries (Harrington and Carey
2009). The current study extends this research, showing the predictive power of a
vocabulary recognition skill test as a screening tool for English-language university
foundation programs in an ELF context. It also identifies several limitations to the
approach in this context.
end of previous findings correlating TYN test performance with academic writing
skill, where correlations ranged from 0.3 to 0.5 (Harrington and Roche 2014; Roche
and Harrington 2013). Higher correlations were observed at University B when the
students with very high false alarm rates were included. Although the inclusion of
the high false alarm data improvespredictive validity in this instance, it also raises
more fundamental questions about this group’s performance on the computerised
test and therefore the reliability of the format for this group. This is discussed below
in 4.2. The vocabulary speed means for the present research were comparable to, if
not slightly faster than, those obtained in previous studies (Harrington and Roche
2014; Roche and Harrington 2013). Mean vocabulary speed was, however, found to
be a less sensitive measure of performance on Placement Tests, in contrast to previ-
ous studies where vocabulary speed was found to account for a unique amount of
variance in the criterion variables (Harrington and Carey 2009; Harrington and
Roche 2014; Roche and Harrington 2013).
Other TYN studies in university contexts (Harrington and Carey 2009; Roche
and Harrington 2013) included low frequency band items from the BNC (i.e. 10 k
band), whereas the current test versions did not. Given research highlighting the
instrumental role the 8000–9000 most commonly occurring words in the BNC play
in reading academic English texts (Schmitt et al. 2015, 2011), it is possible that
including such items would improve the test’s sensitivity in distinguishing English
proficiency levels between students. This remains a question for further placement
research with low proficiency English L2 students.
Results show a marked difference in performance between University A and
University B on all dimensions. This may reflect differences in academic standing
between the groups that are due to different admission standards at the two institu-
tions. University A is a prestigious state institution, which only accepts students
who score in the top 5 % of the annual graduating high school cohort, representing
a score of approximately 95 % on their graduating certificate. In contrast, University
B is a private institution, with lower entry requirements, typically attracting students
who score approximately 70 % and higher on their graduating high school certifi-
cate. The differences between the two groups indicate their relative levels of English
proficiency. It may also be the case that the difference between groups may be due
to the digital divide between metropolitan and regional areas in Oman, with better
connected students from the capital at University A more digitally experienced and
therefore performing better on the online TYN test. This issue is explored further in
4.2.
Participants in the study had much higher mean false-alarm rates and lower group
means (as a measure of vocabulary size) than pre-tertiary students in English-
speaking countries. A number of authors have suggested that Arabic L1 users are
likely to have greater difficulties with discrete-item English vocabulary tests than
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 173
users from other language backgrounds due to differences between the Arabic and
English orthographies and associated cognitive processes required to read those
systems (Abu Rabia and Seigel 1995; Fender 2003; Milton 2009; Saigh and Schmitt
2012). However, as research has shown that word recognition tests do serve as
effective indicators of Arabic L1 users’ EAP proficiency (Al-Hazemi 2001;
Harrington and Roche 2014; Roche and Harrington 2013), this is unlikely to be the
reason for these higher rates. As indicated in 2.3 the test format, in particular the
difference between words and pseudowords, was explained in instructions given in
Modern Standard Arabic. It is possible that some students did not fully understand
these instructions.
The comparatively high false-alarm rates at the regional institution may also
reflect relatively low levels of digital literacy among participants outside the capital.
The TYN test is an on-line instrument that requires the user to first supply biodata,
navigate through a series of computer screens of instructions and examples, and
then supply test responses. The latter involves using the left and right arrow keys to
indicate whether the presented item is a word or a pseudoword. It was noted during
the administration of the test at University B that male students (the group with the
highest false-alarm rates) required additional support from research assistants to
turn on their computers and log-in, as well as to start their internet browsers and
enter bio-data into the test interface. As recently as 2009, when the participants in
this study were studying at high school, only 26.8 % of the nation’s population had
internet access, in comparison to 83.7 % in Korea, 78 % in Japan, and 69 % in
Singapore (World Bank 2013); and, by the time the participants in the present study
reached their senior year of high school in 2011, there were only 1.8 fixed (wired)
broadband subscriptions per 100 inhabitants in Oman, compared to a broadband
penetration of 36.9/100 in Korea, 27.4/100 in Japan, and 25.5/100 in Singapore
(Broad Band Commission 2012). Test-takers in previous TYN test studies
(Harrington and Carey 2009) had predominantly come from these three highly con-
nected countries and were likely to have brought with them higher levels of com-
puter skills and digital literacy. It is also of note that broadband penetration is uneven
across Oman, with regional areas much more poorly served by the telecommunica-
tions network than the capital, where this study’s national university is located, and
this may in part account for the false-alarm score difference between the two univer-
sities. Previous studies in Arab Gulf States were with students who had already
completed foundation studies and undertaken between 1 and 4 years of undergradu-
ate study; they were therefore more experienced with computers and online testing
(Harrington and Roche 2014; Roche and Harrington 2013) and did not exhibit such
high false-alarm scores. The current study assumed test-takers were familiar with
computing and the Internet, which may have not been the case. With the spread of
English language teaching and testing into regional areas of developing nations, it
is necessary to be context sensitive in the application of online tests. Improved
results may be obtained with an accompanying test-taking video demonstration and
increased TYN item practice prior to the actual testing.
The poor performance by some of the participants may also be due to the fact
that the test was low-stakes for them (Read and Chapelle 2001). The TYN test was
174 T. Roche et al.
not officially part of the existing placement test suite at either institution and was
administered after the decisive Placement Tests had been taken; high-false alarm
rates may be due to some participants giving acquiescent responses (i.e. random
clicks that brought the test to an un-taxing end rather than reflecting their knowl-
edge or lack of knowledge of the items presented, see Dodorico-McDonald 2008;
Nation 2007). It is therefore important that test-takers understand not only how to
take the test but also see a reason for taking it. For example, we would expect fewer
false alarms if the test acted as a primary gateway to further study, rather than one
of many tests administered as part of a Placement Test suite, or if it was integrated
into courses and contributed towards students’ marks.
4.3 Gender
An unexpected finding to emerge in the study was the difference in performance due
to gender. The role of gender in second language learning remains very much an
open question, with support both for and against gender-based differences. Studies
investigating performance on discrete item vocabulary tests such as Lex30 (Espinosa
2010) and the Vocabulary Levels Test (Agustín Llach and Terrazas Gallego 2012;
Mehrpour et al. 2011) found no significant difference in test performance between
genders. Results reported by Jiménez Catalán (2010) showed no significant differ-
ence on the receptive tests between the male and female participants, though it was
noted that girls outperformed boys on productive tests. The stark gender-based dif-
ferences observed in the present study are not readily explained. Possible reasons
include the low-stakes nature of the test (Nation 2007; Read and Chapelle 2001),
other personality or affective variables or, as noted, comparatively lower levels of
digital literacy among at least some of the male students.
5 Conclusion
Results here show that the TYN test is a fair measure of English proficiency in
tertiary ELF settings, though with some qualification. It may serve an initial screen-
ing function, identifying which EAP levels (beginner, pre-intermediate and interme-
diate) students are best placed in, prior to more comprehensive in-class diagnostic
testing, but further research is needed to identify the TYN scores which best identify
placement levels. The tests’ predictive ability could be improved, potentially through
adding lower-frequency test items or adding another component, such as grammar
or reading items, to replace the less effective higher-frequency version I of the two
vocabulary tests trialled in this study.
The use of the TYN test with lower proficiency learners in a context like the one
studied here requires careful consideration. Experience with implementing the test
points to the importance of comprehensible instructions and the test-taker’s sense of
investment in the results. The findings also underscore the context-sensitive nature
of testing and highlight the need to consider test-takers’ digital literacy skills when
using computerised tools like the TYN test. As English continues to spread as the
language of tertiary instruction in developing nations, issues of general digital lit-
eracy and internet penetration become educational issues with implications for test-
ing and assessment.
Finally, the findings here contribute to a growing body of literature emphasising
the fundamental importance of vocabulary knowledge for students studying in ELF
settings. In particular, it shows the weaker a student’s vocabulary knowledge, the
poorer they are likely to perform on measures of their academic English proficiency
and subsequently the greater difficulties they are likely to face achieving their goal
of completing English preparation courses on their pathway to English-medium
higher education study.
Acknowledgements This research was supported by a grant from the Omani Research Council
[Grant number ORG SU HER 12 004]. The authors would like to thank the coordinators and Heads
of the Foundation Programs at both institutions for their support.
References
Abu Rabia, S., & Seigel, L. S. (1995). Different orthographies, different context effects: The
effects of Arabic sentence context in skilled and poor readers. Reading Psychology, 16, 1–19.
Agustín Llach, M. P., & Terrazas Gallego, M. (2012). Vocabulary knowledge development and
gender differences in a second language. Estudios de Lingüística Inglesa Aplicada, 12, 45–75.
Alderson, J. C., & Banerjee, J. (2001). Language testing and assessment (part 1). State-of-the-art
review. Language Testing, 18, 213–236.
Al-Hazemi, H. (2001). Listening to the Y/N vocabulary test and its impact on the recognition of
words as real or non-real. A case study of Arab learners of English. IRAL, 38(2), 81–94.
Ammon, U. (2006). The language of tertiary education. In E. K. Brown (Ed.), Encyclopaedia of
languages & linguistics (pp. 556–559). Amsterdam: Elsevier.
Bayliss, A., & Ingram D. E. (2006). IELTS as a predictor of academic language performance.
Paper presented at the Australian International Education Conference. Retrieved from http://
www.aiec.idp.com.
176 T. Roche et al.
Bernhardt, E., Rivera, R. J., & Kamil, M. L. (2004). The practicality and efficiency of web-based
placement testing for college-level language programs. Foreign Language Annals, 37(3),
356–366.
Broad Band Commission. (2012). The state of broadband 2012: Achieving digital inclusion for all.
Retrieved from www.boradbandcommission.org.
Chaney, M. (2013). Australia: Educating globally. Retrieved from https://aei.gov.au/IEAC2/
theCouncilsReport.
Chui, A. S. Y. (2006). A study of the English vocabulary knowledge of university students in Hong
Kong. Asian Journal of English Language Teaching, 16, 1–23.
Cortese, M. J., & Balota, D. A. (2013). Visual word recognition in skilled adult readers. In M. J.
Spivey, K. McRae, & M. F. Joanisse (Eds.), The Cambridge handbook of psycholinguistics
(pp. 159–185). New York: Cambridge University Press.
Cotton, F., & Conrow, F. (1998). An investigation of the predictive validity of IELTS amongst a
sample of international students studying at the University of Tasmania (IELTS Research
Reports, Vol. 1, pp. 72–115). Canberra: IELTS Australia.
Dodorico-McDonald, J. (2008). Measuring personality constructs: The advantages and disadvan-
tages of self-reports, informant reports and behavioural assessments. Enquire, 1(1). Retrieved
from www.nottingham.ac.uk/shared/McDonald.pdf.
Elder, C., & von Randow, J. (2008). Exploring the utility of a web-based English language screen-
ing tool. Language Assessment Quarterly, 5(3), 173–194.
Elder, C., Bright, C., & Bennett, S. (2007). The role of language proficiency in academic success:
Perspectives from a New Zealand university. Melbourne Papers in Language Testing, 12(1),
24–28.
Espinosa, S. M. (2010). Boys’ and girls’ L2 word associations. In R. M. Jiménez Catalán (Ed.),
Gender perspectives on vocabulary in foreign and second languages (pp. 139–163).
Chippenham: Macmillan.
Evans, S., & Morrison, B. (2011). Meeting the challenges of English-medium higher education:
The first-year experience in Hong Kong. English for Specific Purposes, 30(3), 198–208.
Fender, M. (2003). English word recognition and word integration skills of native Arabic- and
Japanese-speaking learners of English as a second language. Applied Psycholinguistics, 24(3),
289–316.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical proficiency. EUROSLA
Yearbook, 6, 147–168.
Harrington, M., & Carey, M. (2009). The on-line yes/no test as a placement tool. System, 37,
614–626.
Harrington, M., & Roche, T. (2014). Post–enrolment language assessment for identifying at–risk
students in English-as-a-Lingua-Franca university settings. Journal of English for Academic
Purposes, 15, 37–47.
Humphreys, P., Haugh, M., Fenton-Smith, B., Lobo, A., Michael, R., & Walkenshaw, I. (2012).
Tracking international students’ English proficiency over the first semester of undergraduate
study. IELTS Research Reports Online Series, 1. Retrieved from http://www.ielts.org.
Jenkins, J. (2007). English as a Lingua Franca: Attitudes and identity. Oxford: Oxford University
Press.
Jenkins, J. (2012). English as a Lingua Franca from the classroom to the classroom. ELT Journal,
66(4), 486–494. doi:10.1093/elt/ccs040.
Jiménez Catalán, R. M. (2010). Gender tendencies in EFL across vocabulary tests. In R. M.
Jiménez Catalán (Ed.), Gender perspectives on vocabulary in foreign and second languages
(pp. 117–138). Chippenham: Palgrave Macmillan.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual differences
in working memory. Psychological Review, 98(1), 122–149.
Kokhan, K. (2012). Investigating the possibility of using TOEFL scores for university ESL
decision-making: Placement trends and effect of time lag. Language Testing, 29(2), 291–308.
8 Vocabulary Recognition Skill as a Screening Tool in English-as-a-Lingua-Franca… 177
Lam, Y. (2010). Yes/no tests for foreign language placement at the post-secondary level. Canadian
Journal of Applied Linguistics, 13(2), 54–72.
Laufer, B. (1992). Reading in a foreign language: How does L2 lexical knowledge interact with the
reader’s general academic ability? Journal of Research in Reading, 15(2), 95–103.
Laufer, B., & Nation, I. S. P. (1999). A vocabulary size test of controlled productive ability.
Language Testing, 16(1), 33–51.
Lee, Y., & Greene, J. (2007). The predictive validity of an ESL placement test: A mixed methods
approach. Journal of Mixed Methods Research, 1, 366–389.
Loewen, S., & Ellis, R. (2004). The relationship between English vocabulary skill and the aca-
demic success of second language university students. New Zealand Studies in Applied
Linguistics, 10, 1–29.
McCarthy, M. (2003). Vocabulary. Oxford: Oxford University Press.
Meara, P., & Buxton, B. (1987). An alternative multiple-choice vocabulary test. Language Testing,
4, 142–145.
Meara, P., & Jones, G. (1988). Vocabulary size as a placement indicator. In P. Grunwell (Ed.),
Applied linguistics in society (pp. 80–87). London: Centre for Information on Language
Teaching and Research.
Mehrpour, S., Razmjoo, S. A., & Kian, P. (2011). The relationship between depth and breadth of
vocabulary knowledge and reading comprehension among Iranian EFL learners. Journal of
English Language Teaching and Learning, 53, 97–127.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol: Multilingual
Matters.
Nation, I. S. P. (1983). Learning vocabulary. New Zealand Language Teacher, 9(1), 10–11.
Nation, I. S. P. (1990). Teaching and learning vocabulary. Boston: Heinle & Heinle.
Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? The Canadian
Modern Language Review/La Revue Canadienne des Langues Vivantes, 63(1), 59–81.
Nation, I. S. P. (2007). Fundamental issues in modelling and assessing vocabulary knowledge. In
H. Daller, J. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing vocabulary knowl-
edge (pp. 35–43). Cambridge: Cambridge University Press.
Oman Academic Accreditation Authority. (2008). The Oman academic standards for general foun-
dation programs. Retrieved from http://www.oac.gov.om/.
Perfetti, C. A. (2007). Reading ability: Lexical quality to comprehension. Scientific Studies of
Reading, 11(4), 357–383.
Qian, D. D. (2002). Investigating the relationship between vocabulary knowledge and academic
reading performance: An assessment perspective. Language Learning, 52(3), 513–536.
Read, J. (2004). Plumbing the depths: How should the construct of vocabulary knowledge be
defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a second language: Selection,
acquisition and testing (pp. 209–227). Amsterdam: John Benjamins.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabulary assessment.
Language Testing, 18(1), 1–32.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a predictor of aca-
demic performance in an English-as-a-foreign language setting. Language Testing in Asia,
3(12), 133–144.
Saigh, K., & Schmitt, N. (2012). Difficulties with vocabulary word form: The case of Arabic ESL
learners. System, 40, 24–36. doi:10.1016/j.system.2012.01.005.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text and reading
comprehension. The Modern Language Journal, 95, 26–43.
doi:10.1111/j.1540-4781.2011.01146.x.
Schmitt, N., Cobb, T., Horst, M., & Schmitt, D. (2015). How much vocabulary is needed to use
English? Replication of Van Zeeland & Schmitt (2012), Nation (2006) and Cobb (2007).
Language Teaching, 1–15.
Segalowitz, N., & Segalowitz, S. (1993). Skilled performance, practice, and the differentiation of
speed-up from automatization effects: Evidence from second language word. Applied
Psycholinguistics, 14, 369–385.
178 T. Roche et al.
Abstract For several reasons, the construct underlying post-entry tests of academic
literacy in South Africa such as the Test of Academic Literacy Levels (TALL) and
its postgraduate counterpart, the Test of Academic Literacy for Postgraduate
Students (TALPS), deserves further scrutiny. First, the construct has not been fur-
ther investigated in close to a decade of use. Second, acknowledging the typicality
of academic discourse as a starting point for critically engaging with constructs of
academic literacy may suggest design changes for such tests. This contribution sur-
veys and critiques various attempts at identifying the typical features of academic
discourse and concludes that the uniqueness of academic discourse lies in the pri-
macy of the logical or analytical mode that guides it. Using this characteristic fea-
ture as a criterion is potentially productive in suggesting ways to add components to
the current test construct of academic literacy tests that are widely used in South
Africa, such as TALL, TAG (the Afrikaans counterpart of TALL), and TALPS, as
well as a new test of academic literacy for Sesotho. Third, a recent analysis of the
diagnostic information that can be gleaned from TALPS (Pot 2013) may inform
strategies of utilising post-entry tests of language ability (PELAs) more efficiently.
This contribution includes suggestions for modifications and additions to the design
of current task types in tests of academic literacy. These tentative suggestions allow
theoretically defensible modifications to the design of the tests, and will be useful to
those responsible for developing further versions of these tests of academic
literacy.
The effects of apartheid on education in South Africa range from the unequal alloca-
tion of resources to the continuing contestation about which language or languages
should serve as medium of instruction, and at what level or levels. Inequality in
education has effects, too, at the upper end of education provision, when students
enter higher education.
Though increasing access to higher education since 1994 has been the norm in
South Africa, the resulting accessibility has not been devoid of problems, including
problems of language proficiency and preparedness of prospective new enrolments.
What is more, low levels of ability in handling academic discourse are among the
prime – though not the only – reasons identified as potential sources of low overall
academic performance, resultant institutional risk, and potential financial wastage
(Weideman 2003).
Responses to such language problems usually take the form of an institutional
intervention by providers of tertiary education: either a current sub-organisational
entity is tasked, or a new unit established, to provide language development courses.
Conventionally, such an entity (e.g. a unit for academic literacy) would not only be
established, but the arrangements for its work would be foreseen and regulated by
an institutional language policy. So in such interventions two of the three prime
applied linguistic artefacts, a language policy and a set of language development
courses (Weideman 2014), normally come into play. How to identify which students
need to be exposed to the intervention brings to the fore the third kind of applied
linguistic instrument, a language assessment in the form of an adequate and accept-
able test of academic language ability. These are administered either as high-stakes
language tests before access is gained, or as post-entry assessments that serve to
place students on appropriate language courses.
A range of post-entry tests of language ability (PELAs) in South Africa has been
subjected to detailed analytical and empirical scrutiny over the last decade. These
assessments include the Test of Academic Literacy Levels (TALL), its Afrikaans
counterpart, the Toets van Akademiese Geletterdheidsvlakke (TAG), and a post-
graduate test, the Test of Academic Literacy for Postgraduate Students (TALPS).
Further information and background is provided in Rambiritch and Weideman
(Chap. 10, in this volume).
For post-entry tests of language ability, as for all other language assessments,
responsible test design is a necessity. Responsible language test developers are
required to start by examining and articulating with great care the language ability
that they will be assessing (Weideman 2011). That is so because their definition of
9 Construct Refinement in Tests of Academic Literacy 181
this ability, the formulation of the hypothesized competence that will be measured,
is the first critically important step to ensure that they will be measuring fairly and
appropriately. What is more, it is from this point of departure – an articulation of the
construct – that the technical effectiveness or validity of the design process will be
steered in a direction that might make the results interpretable and useful. Without
a clearly demarcated construct, the interpretation of the results of a test is impossi-
ble, and the results themselves practically useless. What is measured must inform
the interpretation and meaning of the results of the measurement. These notions –
interpretability and usefulness of results – are therefore two essential ingredients in
what is the currently orthodox notion of test validation (Read 2010: 288; Chapelle
2012). An intelligible construct will also help to ensure that the instrument itself is
relevant, appropriate, and reliable, and that its uses and impact are beneficial (Knoch
and Elder 2013: 54f.).
The quest for a clear definition of the ability to be measured is complicated,
however, and the ultimate choice of that definition is not devoid of compromise
(Knoch and Elder 2013: 62f.). One reason for that is that some definitions of lan-
guage ability may be more easily operationalisable than others. A construct has to
be translated by test designers into specifications that include, amongst other things,
the determination of which task types and assessment formats will be used (Davidson
and Lynch 2002). It follows that test specifications must align with the definition if
the test design is to be theoretically and technically defensible. Language tasks that
are typical of the kind of discourse that is the target of the assessment should pre-
dominate. Yet some compromises may have to be made, not the least because test
developers are constrained by any number of administrative, logistical, financial
and other resource limitations, and might have to choose for their test those task
types that best operationalise the construct within those constraints (Van Dyk and
Weideman 2004b). The result of this may be that parts or components of a theoreti-
cally superior definition of language ability may either be overlooked or under-
emphasised. Messick (1998: 11) refers to this as construct under-representation,
observing that “validity is compromised when the assessment is missing something
relevant to the focal construct…” While a tight formulation of test specifications
may curb this, some difficult design decisions might still need to be made.
A further complication that presents itself is that test designers may, once the test
has been administered and used, realise that parts of it may be providing less useful
or beneficial information on the ability of the candidates who sat for it, so that they
require adjustment or redesign (McNamara and Roever 2006: 81f.). For example, if
subtest intercorrelations are calculated (Van der Walt and Steyn 2007; Myburgh
2015: 34, 85f.), it may become evident that a pair of them (e.g. an assessment of text
comprehension and interpreting a graph) may be highly correlated, and may well be
testing the same component of the construct, which raises the question of whether
both should be retained in subsequent designs. Or a test may have both multiple-
choice questions, and an open-ended response section, that needs to be written
afresh. If the results of these two sections are closely correlated, test designers may
ask: how much more information is provided by the much more labour-intensive
182 A. Weideman et al.
assessment of the open-ended response section? In such cases, they are faced with
a choice, therefore, of retaining the latter, perhaps for reasons offace validity, or of
excluding the more labour-intensive and usually less reliable assessment, since that
will cost less, without having to give up much as regards additional information
about the level of language ability of candidates. In the case of the tests being
referred to here, selected-response item formats were preferable for reasons of
resource constraints, the necessity for a rapid turnaround in producing the results,
and reliability.
The point, however, is that even the most deliberate design and careful piloting
of a test is no guarantee that it will be perfect the first or even the twelfth time it is
administered. As the validation processes of any test might reveal, redesign may be
needed at any time, but what is being argued here is that the starting point is always
the construct.
There is a third potential complication in trying to stay true to the definition of
the language ability being tested, which is that new insight into the workings of
language may allow us to gauge that ability better. The turn in language testing
towards looking at language as communicative interaction instead of a merely for-
mal system of knowledge (Green 2014: 173ff.; Chapelle 2012) constitutes an exam-
ple of this. New perspectives on language must of necessity have an effect on what
is tested, and that has certainly been the case in the tests being referred to here
(Weideman 2003, 2011).
A fourth difficulty that we have encountered in test design is where the test of
language ability depends on the curriculum of a national school language syllabus,
as in South Africa (Department of Basic Education 2011a, b). Here the high-stakes
examinations that make up the Grade 12 school exit examinations for “Home
Languages” have patently over time drifted away from the original intentions of the
official curricula. The intention of not only the syllabus that preceded the current
syllabus for Home Languages, but also of the current version, is to emphasize com-
municative function. Very little of that emphasis is evident today in the three papers
through which the differential, ‘high-level’ language ability that the curriculum calls
for is being assessed (see report to Umalusi by Du Plessis et al. 2016). As this report
makes clear, the best possible way to restore the integrity of these language assess-
ments is to reinterpret the assessment with reference to the curriculum, which speci-
fies the language construct to be tested. Without that reinterpretation, the
misalignment between curriculum and the final, high-stakes assessment will endure.
Despite the possible need for compromise referred to above, or the undesirable
potential of moving away from what by definition must be assessed in a test of
language ability, this contribution takes as its point of departure that the clear artic-
ulation of a construct remains the best guarantee for responsible test design. This
is indeed also one of the limitations of the discussion, since responsible test design
patently derives from much more than a (re-) consideration of the construct. There
are indeed many other points of view and design components that might well be
useful points of entry for design renewal, and that will, with one exception, not be
considered here, given the focus of this discussion on a single dimension: the theo-
9 Construct Refinement in Tests of Academic Literacy 183
In Van Dyk and Weideman (2004a, b) there are detailed descriptions of the process
of how the construct underlying the current tests referred to above was developed,
and how the specifications for the blueprint of the test were arrived at. The test
designers of TALL, TAG and TALPS were looking for a definition of academic lit-
eracy that was current, relevant, and reflected the use of academic discourse in a
way that aligned with the notions that academics themselves have about that kind of
language.
Yet finding such a construct involved a long process. The construct eventually
adopted therefore derives from a developmental line that looks at language as dis-
closed expression (Weideman 2009), as communication, and not as an object
restricted to a combination of sound, form and meaning. Moreover, as Weideman
(2003) points out, it was required to take a view of the development of the ability to
use academic language as the acquisition of a secondary discourse (Gee 1998).
Becoming academically literate, as Blanton (1994: 230) notes, happens when
… individuals whom we consider academically proficient speak and write with something
we call authority; that is one characteristic — perhaps the major characteristic — of the
voice of an academic reader and writer. The absence of authority is viewed as powerless-
ness …
How does one assess ‘authority’ as a measure of proficiency, however? And how
does one characterise the ‘academic’ that stamps this authority as a specific kind?
Even though the elaboration of this ability to use academic language fluently was
wholly acceptable to the designers of the tests we are referring to here, the question
184 A. Weideman et al.
While the tests that were based on this construct have now been widely scrutinised
and their results subjected to empirical and critical analyses of various kinds (the
‘Research’ tab of ICELDA 2015 lists more than five dozen such publications), the
construct referred to above has not been further investigated in close to a decade of
use. In two recent studies of the construct undertaken to remedy this lack of critical
engagement, Patterson and Weideman (2013a, b) take as a starting point the
9 Construct Refinement in Tests of Academic Literacy 187
typicality of academic discourse as a kind of discourse distinct from any other. They
begin by tracing the idea of the variability of discourse to the sociolingual idea of a
differentiated ability to use language that goes back to notions first introduced by
Habermas (1970), Hymes (1971), and Halliday (1978), noting at the same time how
those ideas have persisted in more current work (Biber and Conrad 2001; Hasan
2004; Hyland and Bondi 2006). Specifically, they consider how acknowledging that
academic discourse is a specific, distinctly different kind of language will benefit
construct renewal.
The typicality of academic discourse, viewed as a material lingual sphere
(Weideman 2009: 40f.), is found to be closely aligned with the views espoused by
Halliday (1978, 2002, 2003), specifically the latter’s ideas of “field of discourse”,
genre, rhetorical mode, and register. Halliday’s claim (1978: 202; cf. too Hartnett
2004: 183) that scientific language is characterised by a high degree of nominalisa-
tion, however, is deficient in several respects. First, there are other kinds of dis-
course (e.g. legal and administrative uses of language) in which nominalisation is
also found. Second, it gives only a formal criterion for distinctness, neglecting the
differences in content that can be acknowledged when one views types of discourse
as materially distinct.
When one turns to a consideration of various current definitions of academic
literacy, Patterson and Weideman (2013a) find that there are similar problems. In the
‘critical’ features of academic discourse identified by scholars such as Flower
(1990), Suomela-Salmi and Dervin (2009), Gunnarsson (2009), Hyland (2011; cf.
too Hyland and Bondi 2006), Livnat (2012), Bailey (2007: 10–11), and Beekman
et al. (2011: 1), we find either circular definitions that identify ‘academic’ with ref-
erence to the academic world itself, or features that are shared across a number of
discourse types. As Snow and Uccelli (2009) observe, the formally conceptualised
features of academic language that they have articulated with reference to a wide
range of commentators are not sufficient to define academic discourse.
Patterson and Weideman (2013a: 118) conclude that an acknowledgement of the
typicality of academic discourse that is most likely to be productive is one that
acknowledges both its leading analytical function and its foundational formative
(‘historical’) dimension:
Academic discourse, which is historically grounded, includes all lingual activities associ-
ated with academia, the output of research being perhaps the most important. The typicality
of academic discourse is derived from the (unique) distinction-making activity which is
associated with the analytical or logical mode of experience.
What is more, if one examines the various components of the construct referred
to above, it is clear that the analytical is already prominent in many of them. For
example, in logical concept formation, which is characterised by abstraction and
analysis (Strauss 2009: 12–14), we proceed by comparing, contrasting, classifying
and categorising. All of these make up our analytical ability to identify and
distinguish.
188 A. Weideman et al.
As Patterson and Weideman (2013b) point out, a number of the components of the
current construct already, as they should, foreground the analytical qualifying aspect
of academic discourse. Which components should in that case potentially be added?
Having surveyed a range of current ideas, these investigators identify the following
possible additions as components of a command of academic language that can be
demonstrated through the ability to:
• think critically and reason logically and systematically in terms of one’s own
research and that of others;
• interact (both in speech and writing) with texts: discuss, question, agree/dis-
agree, evaluate, research and investigate problems, analyse, link texts, draw logi-
cal conclusions from texts, and then produce new texts;
• synthesize and integrate information from a multiplicity of sources with one’s
own knowledge in order to build new assertions, with an understanding of aca-
demic integrity and the risks of plagiarism;
• think creatively: imaginative and original solutions, methods or ideas which
involve brainstorming, mind-mapping, visualisation, and association;
• understand and use a range of academic vocabulary, as well as content or
discipline-specific vocabulary in context;
• use specialised or complex grammatical structures, high lexical diversity, formal
prestigious expressions, and abstract/technical concepts;
• interpret and adapt one’s reading/writing for an analytical/argumentative pur-
pose and/or in light of one’s own experience; and
• write in an authoritative manner, which involves the presence of an “I” address-
ing an imagined audience of specialists/novices or a variety of public audiences.
The first two additions may indicate the need not so much for a new task type as
for a new emphasis on comparing one text with another, which is already acknowl-
edged as a component of the construct. In some versions of TALL, for example, test
takers are already expected to identify clearly different opinions in more than one
text. Undoubtedly, more such comparisons are necessary to test critical insight into
points of agreement and disagreement, for example. Perhaps shorter texts with con-
trasting opinions might also be considered, but if this ability were to be properly
tested, it would add considerably to the length of a test. Otherwise, test designers
might be required to ask more questions such as the following (similar to those in
existing tests), which ask test takers to compare one part of a longer text with another:
The further explanation of exactly what the author means by using the term ‘development’
in the first paragraph we find most clearly in paragraphs
A. 2 & 3.
B. 3 & 4.
C. 5 & 7.
D. 6 & 8.
or
9 Construct Refinement in Tests of Academic Literacy 189
The author discusses two divergent opinions about tapping into wind power. These opposite
views are best expressed in paragraphs
A. 2 & 3.
B. 3 & 4.
C. 5 & 7.
D. 6 & 8.
We return below to the third addition: the ability to synthesize and integrate
information, when we discuss another way of handling an additional writing task,
based on the work of Pot (2013). A skill that, even at entry level to the academic
world, relates to the notion of avoiding plagiarism and maintaining academic integ-
rity, namely the ability to refer accurately to a multiplicity of sources, can at that
lower level perhaps be measured in a task type such as the following:
References
Imagine that you have gone to the library to search for information in the form of books,
articles and other material, on the topic of “Making effective presentations”. You have
found a number of possible sources, and have made notes from all of them for use in your
assignment on this topic, but have not had the time to arrange them in proper alphabetical
and chronological sequence.
Look at your notes below, then place the entry for each source in the correct order, as for
a bibliography, by answering the questions below:
Herds Lifespan
Social (2)
(4) Death
10-20 Gestation
(1) 22 months at 60
100+
Matriarch
(5)
The African
Sisters elephant
(3) offspring 100kg/day 100 litre
food water
Habitat Size
Population
1 2
Where has the word been deleted? Which word has been left out here?
A. At position (i). A. indeed
B. At position (ii). B. very
C. At position (iii). C. former
D. At position (iv). D. historically
Where has the word been deleted? Which word has been left out here?
A. At position (i). A. historical
B. At position (ii). B. latter
C. At position (iii). C. now
D. At position (iv). D. incontrovertibly
The last two additions proposed above by Patterson and Weideman (2013b),
namely the adaptation of one’s reading or writing for the purposes of an academic
argument, and the authoritative manner in which it should be delivered, may be less
relevant for post-entry assessments at undergraduate level. At higher levels they
may profitably be combined, however, so that both ‘authority’ and audience differ-
ence are allowed to come into play. We therefore turn next to a further consideration
of how these additions might be assessed.
The last two additions proposed to the construct concern not only reading (finding
information and evidence for the academic argument), but also writing and more
specifically, writing persuasively (Hyland 2011: 177) and with authority either for a
specialist or lay audience. As in the case of some of the other proposals, these addi-
tions appear to be more relevant for the discourse expected from seasoned academ-
ics than from entry-level beginners, who are normally the prime targets of post-entry
language assessments. It should be noted that initially all versions of TALL did
include a writing component, but the resources required to mark it reliably, as well
as the high correlation between its results and that of the rest of the test resulted in
9 Construct Refinement in Tests of Academic Literacy 193
From a design angle, the examination of the construct underlying post-entry tests of
academic literacy in South Africa is potentially highly productive. In addition, tap-
ping the diagnostic information the tests yield more efficiently, as well as making
modifications and additions to current test task types, will provide theoretically
defensible changes to their design.
The possible additions to the design of the tests referred to in this article will
benefit not only the current set of assessments, but are likely to have a beneficial
effect on the design of similar tests, in more languages than the current English and
Afrikaans versions of the instruments. Butler and his associates at North-West
University have, for example, already begun to experiment with translated versions
of these post-entry assessments into Sesotho, a language widely used as first lan-
guage by large numbers of students on some of their campuses, but that remains
underdeveloped as an academic language (Butler 2015). A greater range of test task
types will enhance the potential of the tests to provide results that are useful and
interpretable, all of which may have further beneficial effects in informing policy
decisions at institutions of higher learning in South Africa about an expansion of the
current two languages of instruction to three, at least at some already multilingual
universities.
Tests are therefore never neutral instruments, and neither is their refinement. In
examining components of their design critically, and discussing modifications to
them on the basis of such an examination, our goal is to continue to enhance their
worth and impact. The one respect – the re-articulation of the construct – in which
possible changes might be made, and that was discussed here, therefore also needs
to be augmented in future research by other considerations and design principles. If
Read (2010: 292) is correct in observing that the “process of test development …
does not really count as a research activity in itself”, we have little issue with that.
Test development processes, however, are never fully complete. Once constructed,
194 A. Weideman et al.
they are subject to redesign and refinement. Since the creativity and inventiveness
of test designers take precedence over the theoretical justifications of our designs, it
would be a pity if the history of the rather agonising process of how to best assess a
given construct were not recorded, for if that goes unrecorded, we miss the oppor-
tunity of sharing with others a potentially productive design ingredient for making
or re-making tests. In more closely examining that which initially may have been
secondary, namely the theoretical defence of our design, designers are, once they
again scrutinise that theoretical basis, stimulated to bring their imaginations to bear
on the redesign and refinement of their assessment instruments. There is a recipro-
cal relationship between the leading design function of an assessment measure and
its foundational analytical base (Weideman 2014). This discussion has therefore
aimed to provide such a record of test design, and redesign, which has been prompted
by a reconsideration of the theoretical definition of what gets tested – that is, the
construct.
References
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford: Oxford University Press.
Bailey, A. L. (Ed.). (2007). The language demands of school: Putting academic English to the test.
New Haven: Yale University Press.
Beekman, L., Dube, C., & Underhill, J. (2011). Academic literacy. Cape Town: Juta.
Biber, D., & Conrad, S. (2001). Register variation: A corpus approach. In D. Schiffrin, D. Tannen,
& H. E. Hamilton (Eds.), The handbook of discourse analysis (pp. 175–196). Malden:
Blackwell Publishing.
Blanton, L. L. (1994). Discourse, artefacts and the Ozarks: Understanding academic literacy. In
V. Zamel & R. Spack (Eds.), Negotiating academic literacies: Teaching and learning across
languages and cultures (pp. 219–235). Mahwah: Lawrence Erlbaum Associates.
Butler, G. (2007). A framework for course design in academic writing for tertiary education. PhD
thesis, University of Pretoria, Pretoria.
Butler, G. (2013). Discipline-specific versus generic academic literacy intervention for university
education: An issue of impact? Journal for Language Teaching, 47(2), 71–88. http://dx.doi.
org/10.4314/jlt.v47i2.4.
Butler, G. (2015). Translating the Test of Academic Literacy Levels (TALL) into Sesotho. To
appear in Southern African Linguistics and Applied Language Studies.
Chapelle, C. A. (2012). Conceptions of validity. In G. Fulcher & F. Davidson (Eds.), The Routledge
handbook of language testing (pp. 21–33). Abingdon: Routledge.
Cliff, A. F., & Hanslo, M. (2005). The use of ‘alternate’ assessments as contributors to processes
for selecting applicants to health sciences’ faculties. Paper read at the Europe Conference for
Medical Education, Amsterdam.
Cliff, A. F., Yeld, N., & Hanslo, M. (2006). Assessing the academic literacy skills of entry-level
students, using the Placement Test in English for Educational Purposes (PTEEP).
Mimeographed MS.
Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language
test specifications. New Haven: Yale University Press.
Department of Basic Education. (2011a). Curriculum and assessment policy statement (CAPS) for
English Home Language, Further Education and Training phase, grades 10–12. Pretoria:
Department of Basic Education.
9 Construct Refinement in Tests of Academic Literacy 195
Department of Basic Education. (2011b). Curriculum and assessment policy statement (CAPS) for
English first additional language, further education and training phase, grades 10–12. Pretoria:
Department of Basic Education.
Du Plessis, C., Steyn, S., & Weideman, A. (2016). Towards a construct for assessing high level
language ability in grade 12. Report to the Umalusi Research Forum, 13 March 2013.
Forthcoming on LitNet.
Flower, L. (1990). Negotiating academic discourse. In L. Flower, V. Stein, J. Ackerman, M. Kantz,
K. McCormick, & W. C. Peck (Eds.), Reading-to-write: Exploring a cognitive and social pro-
cess (pp. 221–252). Oxford: Oxford University Press.
Gee, J. P. (1998). What is literacy? In V. Zamel & R. Spack (Eds.), Negotiating academic litera-
cies: Teaching and learning across languages and cultures (pp. 51–59). Mahwah: Lawrence
Erlbaum Associates.
Green, A. (2014). Exploring language assessment and testing: Language in action. London:
Routledge.
Gunnarsson, B. (2009). Professional discourse. London: Continuum.
Habermas, J. (1970). Toward a theory of communicative competence. In H. P. Dreitzel (Ed.),
Recent sociology 2 (pp. 41–58). London: Collier-Macmillan.
Halliday, M. A. K. (1978). Language as social semiotic: The social interpretation of language and
meaning. London: Edward Arnold.
Halliday, M. A. K. (2002). Linguistic studies of text and discourse, ed. J. Webster. London:
Continuum.
Halliday, M. A. K. (2003). On language and linguistics, ed. J. Webster. London: Continuum.
Hartnett, C. G. (2004). What should we teach about the paradoxes of English nominalization? In
J. A. Foley (Ed.), Language, education and discourse: Functional approaches (pp. 174–190).
London: Continuum.
Hasan, R. (2004). Analysing discursive variation. In L. Young & C. Harrison (Eds.), Systemic
functional linguistics and critical discourse analysis: Studies in social change (pp. 15–52).
London: Continuum.
Hyland, K. (2011). Academic discourse. In K. Hyland & B. Paltridge (Eds.), Continuum compan-
ion to discourse analysis (pp. 171–184). London: Continuum.
Hyland, K., & Bondi, M. (Eds.). (2006). Academic discourse across disciplines. Bern: Peter Lang.
Hymes, D. (1971). On communicative competence. In J. B. Pride & J. Holmes (Eds.),
Sociolinguistics: Selected readings (pp. 269–293). Harmondsworth: Penguin.
ICELDA (Inter-Institutional Centre for Language Development and Assessment). (2015). [Online].
Available http://icelda.sun.ac.za/. Accessed 10 May 2015.
Knoch, U., & Elder, C. (2013). A framework for validating post-entry language assessments
(PELAs). Papers in Language Testing and Assessment, 2(2), 48–66.
Livnat, Z. (2012). Dialogue, science and academic writing. Amsterdam: John Benjamins.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Oxford: Blackwell.
Messick, S. (1998). Consequences of test interpretation and use: The fusion of validity and values
in psychological assessment. Princeton: Educational Testing Service. [Online]. Available http://
ets.org/Media/Research/pdf/RR-98-48.pdf. Accessed 15 Apr 2015.
Myburgh, J. (2015). The assessment of academic literacy at pre-university level: A comparison of
the utility of academic literacy tests and Grade 10 Home Language results. MA dissertation,
University of the Free State, Bloemfontein.
Patterson, R., & Weideman, A. (2013a). The typicality of academic discourse and its relevance for
constructs of academic literacy. Journal for Language Teaching, 47(1), 107–123. http://dx.doi.
org/10.4314/jlt.v47i1.5.
Patterson, R., & Weideman, A. (2013b). The refinement of a construct for tests of academic liter-
acy. Journal for Language Teaching, 47(1), 125–151. http://dx.doi.org/10.4314/jlt.v47i1.6.
Pot, A. (2013). Diagnosing academic language ability: An analysis of TALPS. MA dissertation,
Rijksuniversiteit Groningen, Groningen.
196 A. Weideman et al.
Pot, A., & Weideman, A. (2015). Diagnosing academic language ability: Insights from an analysis
of a postgraduate test of academic literacy. Language Matters, 46(1), 22–43. http://dx.doi.org/
10.1080/10228195.2014.986665.
Read, J. (2010). Researching language testing and assessment. In B. Paltridge & A. Phakiti (Eds.),
Continuum companion to research methods in applied linguistics (pp. 286–300). London:
Continuum.
Snow, C. E., & Uccelli, P. (2009). The challenge of academic language. In D. R. Olson &
N. Torrance (Eds.), The Cambridge handbook of literacy (pp. 112–133). Cambridge: Cambridge
University Press.
Strauss, D. F. M. (2009). Philosophy: Discipline of the disciplines. Grand Rapids: Paideia Press.
Suomela-Salmi, E., & Dervin, F. (Eds.). (2009). Cross-linguistic and cross-cultural perspectives
on academic discourse. Amsterdam: John Benjamins.
Van der Walt, J. L., & Steyn, H. (2007). Pragmatic validation of a test of academic literacy at ter-
tiary level. Ensovoort, 11(2), 138–153.
Van Dyk, T., & Weideman, A. (2004a). Switching constructs: On the selection of an appropriate
blueprint for academic literacy assessment. Journal for Language Teaching, 38(1), 1–13.
Van Dyk, T., & Weideman, A. (2004b). Finding the right measure: From blueprint to specification
to item type. Journal for Language Teaching, 38(1), 15–24.
Weideman, A. (2003). Assessing and developing academic literacy. Per Linguam, 19(1 and 2),
55–65.
Weideman, A. (2006). Assessing academic literacy in a task-based approach. Language Matters,
37(1), 81–101.
Weideman, A. (2007). Academic literacy: Prepare to learn (2nd ed.). Pretoria: Van Schaik
Publishers.
Weideman, A. (2009). Beyond expression: A systematic study of the foundations of linguistics.
Grand Rapids: Paideia Press.
Weideman, A. (2011). Academic literacy tests: Design, development, piloting and refinement.
SAALT Journal for Language Teaching, 45(2), 100–113.
Weideman, A. (2013). Academic literacy interventions: What are we not yet doing, or not yet doing
right? Journal for Language Teaching, 47(2), 11–23. http://dx.doi.org/10.4314/jlt.v47i2.1.
Weideman, A. (2014). Innovation and reciprocity in applied linguistics. Literator, 35(1), 1–10.
http://dx.doi.org/10.4102/lit.v35i1.1074.
Weideman, A., & Van Dyk, T. (Eds.). (2014). Academic literacy: Test your competence.
Potchefstroom: ICELDA.
Yeld, N., et al. (2000). The construct of the academic literacy test (PTEEP) (Mimeograph). Cape
Town: Alternative Admissions Research Project, University of Cape Town.
Chapter 10
Telling the Story of a Test: The Test
of Academic Literacy for Postgraduate
Students (TALPS)
Abstract This chapter will follow Shohamy’s exhortation “to tell the story of the
test” (2001). It begins by highlighting the need for the Test of Academic Literacy for
Postgraduate Students (TALPS), for use primarily as a placement and diagnostic
mechanism for postgraduate study, before documenting the progress made from its
initial conceptualisation, design and development to its trial, results and its final
implementation. Using the empirical evidence gathered, assertions will be made
about the reliability and validity of the test. Documenting the design process ensures
that relevant information is available and accessible both to test takers and to the
public. This telling of the story of TALPS is the first step in ensuring transparency
and accountability. The second is related to issues of fairness, especially the use of
tests to restrict and deny access, which may occasion a negative attitude to tests.
Issues of fairness dictate that test designers consider the impact of the test;
employ effective ways to promote the responsible use of the test; be willing to miti-
gate the effects of mismeasurement; consider potential refinement of the format of
the test; and ensure alignment between the test and the teaching/intervention that
follows. It is in telling the story of TALPS, and in highlighting how issues of fair-
ness have been considered seriously in its design and use that we hope to answer a
key question that all test designers need to ask: Have we, as test designers, suc-
ceeded in designing a socially acceptable, fair and responsible test?
A. Rambiritch (*)
Unit for Academic Literacy, University of Pretoria, Pretoria, South Africa
e-mail: Avasha.Rambiritch@up.ac.za
A. Weideman
Office of the Dean: Humanities, University of the Free State, Bloemfontein, South Africa
e-mail: WeidemanAJ@ufs.ac.za
Twenty years after the demise of apartheid, South Africa is still reeling from its
effects, as was indicated in the previous chapter (Weideman et al. 2016). The trauma
of Bantu education, a separatist system of education, continues to reverberate
through the country. Unfair and unequal distribution of resources, poor and/or
unqualified teachers, and overcrowded classrooms have had long-term effects on
the education system, and on the actual preparedness of (historically disadvantaged)
students for tertiary study, even today. Despite the progress made in many areas, the
policy of racial segregation has left fractures that will take years still to heal. One
area most in need of healing remains the historically contentious issue of language
and its use as a medium of instruction, especially in institutions of higher
education.
Since 1994, tertiary institutions have had to deal with the challenge of accepting
students whose language proficiency may be at levels that would place them at risk,
leading to low pass rates and poor performance. This is a problem not specific only
to students from previously disadvantaged backgrounds. Language proficiency is
low even amongst students whose first languages are English or Afrikaans, which
are still the main languages of teaching and learning at tertiary level in South Africa.
Low levels of proficiency in English generally mean that students are not equipped
to deal with the kind of language they encounter at tertiary level. For many students
who have been taught in their mother tongue, entering university is their first experi-
ence of being taught in English.
Tertiary institutions, including those considered previously advantaged, today
need contingency measures to deal with this situation. Not accepting students
because of poor language proficiency would simply have meant a repetition of the
past, since the issue of being denied access is one that is rooted in the history of the
country. The trend has been to set up specific programmes to assist these students.
Different institutions have, however, taken different routes. Some have set up aca-
demic support programmes, departments and units, while others have offered
degrees and diplomas on an extended programme system, where the programme of
study is extended by a year to ensure that the relevant academic support is
provided.
At the University of Pretoria, as at other universities in the country, poor pass
rates and low student success are issues of concern. The university also attracts
students from other parts of Africa and the world, lending even more diversity to an
already diverse environment. Mdepa and Tshiwula (2012: 27) note that in 2008
more than 9500 students from African countries outside of the 15-state Southern
African Development Community (SADC) studied in South Africa. Very often
these students do not have English as a first language, requiring that measures have
to be put in place to assist them to succeed academically. At the University of
Pretoria one such measure was put in place at the first two levels of postgraduate
study (honours and master’s level), by offering as intervention a module that focused
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 199
on helping students to develop their academic writing. Over time, there has been an
increasing demand for the course, as supervisors of postgraduate students have rec-
ognised the inadequate academic literacy levels of their students.
Butler’s (2007) study focused on the design of this intervention, a course that
provides academic writing support for postgraduate students. He found that the stu-
dents on the course were mostly unfamiliar with academic writing conventions,
were often unable to express themselves clearly in English, and had not “yet fully
acquired the academic discourse needed in order to cope independently with the
literacy demands of postgraduate study” (Butler 2007: 10). Information elicited
from a questionnaire administered to students and supervisors and from personal
interviews with supervisors as part of Butler’s study confirmed that these students
experienced serious academic literacy problems, and that as a result might not com-
plete their studies in the required time. Worrying also is the fact that the survey
showed that some students (20 %) had never received any formal education in
English and that a large group of them (30 % for their first degree and 44 % for hon-
ours) did not use English as a medium of instruction for their previous degrees
(Butler 2009: 13–14). What became clear from the results of the study was the need
for a “reliable literacy assessment instrument” that would “provide one with accu-
rate information on students’ academic literacy levels” (Butler 2007: 181).
This contribution is focused specifically on telling the story of the development
and use of that test, i.e. on the design and development of what is now called the Test
of Academic Literacy for Postgraduate Students (TALPS). In so doing, it takes the
first step towards ensuring transparency and accountability in the design and devel-
opment of such tests. We turn below first to an exposition of these two ideas, and
subsequently to how they relate to interpreting and using test results, as well as the
interventions – in the form of language development courses – that the use of their
results implies. Finally, we consider what other tests that aim to earn a reputation of
being responsibly designed might be able to learn from the intentions and the objec-
tives of the designers of TALPS.
Unfair tests, unfair testing methods and the use of tests to restrict or deny access
contribute to a negative attitude to tests. The move in the recent past (Shohamy
2001, 2008; Fulcher and Davidson 2007; McNamara and Roever 2006; Weideman
2006, 2009) has therefore been to promote the design and development of fair tests,
by test developers who are willing to be accountable for their designs. This can be
seen as part of a broader social trend whereby the concept of transparency has
become the watchword in government and politics, in the corporate world, in the
media and even in the humanities and social sciences. Naurin (2007) states that
transparency literally means that it is possible to look into something, to see what is
going on:
200 A. Rambiritch and A. Weideman
In explaining his use of the term accountability in the field of language testing,
Weideman (2006: 72) turns to the definition provided by Schuurman (2005), which
stresses the need for actors to be aware of their actions and to “give account of the
same to the public” (Schuurman 2005: 42).
It is clear that test developers should therefore be concerned with making infor-
mation about their tests available to those most affected, and be willing to take
responsibility for their designs. These issues become relevant when one works
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 201
within a framework such as that proposed by Weideman (2009), which calls for a
responsible agenda for applied linguistics, to ensure that the notions of responsibility,
integrity, accessibility and fairness can be articulated in a theoretically coherent and
systematic way. The framework he refers to is based on a “representation of the
relationship among a select number of fundamental concepts in language testing”
(Weideman 2009: 241; for a more detailed exposition, cf. Weideman 2014).
Weideman points out that the technical unity of multiple sources of evidence, relat-
ing for example to the reliability of a test, its validity and its rational justification,
and brought together systematically in a validation argument, utilises several such
foundational or constitutive applied linguistic concepts (2009: 247). These may also
be designated necessary requirements for tests, and in what follows these require-
ments will again be highlighted with reference to the process of the design of
TALPS.
In the framework of requirements employed here, the design of a test also links
with ideas of its public defensibility or accountability, and the fairness or care for
those taking a test (Weideman 2009: 247). In employing a set of design conditions
that incorporates reference to the empirical properties and analyses of a test, as well
as a concern for the social dimensions of language testing, one is able to ensure that
transparency and accountability can be taken into consideration in the testing pro-
cess. This contribution takes the telling of the story of the design and development
of TALPS as a starting point for ensuring transparency and accountability.
A first step for the developers of TALPS was to find an appropriate construct on
which to base the test and the intervention. Bachman and Palmer (1996: 21) define
a construct as the “specific definition of an ability that provides the basis for a given
test or test task and for interpreting scores derived from this task.” The developers
chose to base TALPS on the same construct as the Test of Academic Literacy Levels
(TALL) (see Patterson and Weideman 2013; Van Dyk and Weideman 2004). The
TALL, an undergraduate level test, was in that sense a sounding board for TALPS –
the success of TALL was in fact one of the most important motivations for the
development of the postgraduate assessment. Both these tests are designed to test
the academic literacy of students, the difference being that one is directed at first
year students while the other is intended for postgraduate students. For the blueprint
of the test, we refer to Weideman et al. (2016 – this volume). It should be noted,
however, that while the components of the construct were considered to be ade-
quate, the developers took into account that in its operationalisation – the specifica-
tion and development of task types to measure the construct – care had to be taken
with both the level (postgraduate) and format (including the consideration of more
and other subtests than in an undergraduate test).
202 A. Rambiritch and A. Weideman
2.2 Specification
The next step for the developers of TALPS was to align the construct of the test with
specifications. Davidson and Lynch (2002: 4; cf. too Davies et al. 1999: 207) state
that “the chief tool of language test development is a test specification, which is a
generative blueprint from which test items or tasks can be produced”. They observe
that a well-written test specification can generate many equivalent test tasks. The
discussion of specifications at this point is focused specifically on item type specifi-
cation and how they align with the construct of academic literacy used for this test.
In addition to the task types (subtests) employed in TALL, it was decided to
include in TALPS a section on argumentative writing. At postgraduate level it is
essential that students follow specific academic writing conventions and it is impor-
tant to test whether students are equipped with this knowledge. Butler (2009: 294)
states: “In the development of TALPS we have also considered the importance of
testing students’ productive writing ability specifically (in the production of an
authentic academic text), as well as their editing ability”. Important for Butler
(2009: 11) was the fact that if the test did not contain a section on writing, it would
affect face validity. In addition to the question on writing there is a question that
tests students’ editing skills. Table 10.1 outlines the eight sections that now appear
in TALPS, as well as a brief summary of what aspect of academic literacy each
tests:
With the exception of Sect. 8, all other subtests are in multiple-choice format, as
for TALL, and for the same reasons: ease of marking, reliable results, early avail-
ability of results, economical use of resources, and the availability of imaginative
and creative design capacity (Van Dyk and Weideman 2004: 16). Important, as well,
as pointed out by Du Plessis (2012), and which ties in closely with the need for
transparency and accountability, is that test takers have the opportunity to consult
sample tests before attempting to write TALPS, so gaining an understanding of how
multiple-choice test items work in language tests at this level.
The process that was followed in developing TALPS saw the refinement of three
drafts, with the third draft version of the test becoming the first post-refinement
pilot. This (third draft) version of the test was made up of the following task types,
items and marks, as indicated in Table 10.2.
Between draft one and draft three most changes were made in the Understanding
texts section. In draft one this section had 45 items, in draft two it had 28 items and
in the third and final draft version of the test it had 21 items, some weighted more
heavily in order to achieve better alignment with what were considered to be criti-
cally important components of the construct. The four items in this section that were
weighted more measured aspects relating to the student’s ability to understand and
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 203
The statistics package used (TiaPlus: cf. CITO 2006) provides us with two mea-
sures of test consistency: Cronbach’s alpha and Greatest Lower Bound (GLB). All
pilots of the test rendered very impressive reliability measures. The first pilot had a
reliability of 0.85 (Cronbach’s alpha) and 0.92 (GLB). One pre-final draft had mea-
sures of 0.93 (Cronbach’s alpha) and 1.00 (GLB). The final version of the test had
measures of 0.92 (Cronbach’s alpha) and 0.99 (GLB). In the TALPS final version,
the standard error of measurement for the combined group of students is at 3.84.
One other statistical measure rendered by the package used is the average Rit-
values or the discriminative ability of the test items. One of the main purposes of a
test is to be able to discriminate between the test-takers (Kurpius and Stafford 2006:
115). The mean Rit-values for the third pilot are relatively stable at 0.40, which is
well above the 0.30 benchmark chosen. In addition, the variance around the mean
seems to be quite stable, suggesting a normal or even distribution of scores around
the mean.
The need for validation at the a priori stage of test development (Weir 2005) is
provided for here in this initial evidence that TALPS is a highly reliable test What is
more, the test is based on a theoretically defensible construct, and an argument can
be made for the close alignment of that construct and the test items. The Rit-values
of the items further indicate that the test discriminates well between test-takers.
Finally, the internal correlations of the different test sections satisfy specific criteria
and the face validity of the test meets the expectations of potential users. As regards
the latter, the observations supporting Butler’s (2009) study persuaded the test
designers that it would be difficult to promote a test, especially among postgraduate
supervisors, a prime group of users of the results of the test, if it did not have a sub-
section assessing academic writing. While acknowledging that this open-ended for-
mat would be less easy to administer than a selected-response format, the test
developers believed that the evidence indicated that the inclusion of a writing sub-
test in the test would without doubt enhance its intuitive technical appeal or face
validity (for a further discussion, see Du Plessis 2012: 65). These initial observa-
tions – on reliability, alignment with the construct, ability to discriminate, subtest
intercorrelations, and face validity - provide the first pieces of evidence for a sys-
tematic and integrated argument in the process of validating TALPS.
However, this is just the first part of the tale. A fair and responsible test is one that
is not only valid and reliable, and for which validation evidence can be systemati-
cally presented, but socially acceptable as well. McNamara and Roever’s (2006: 2)
observation that a “psychometrically good test is not necessarily a socially good
test” is relevant here, because a core concern in test design is the social responsibil-
ity that the test developers have, not just to the test takers (postgraduate students) but
to everyone affected by the test – supervisors, parents, test administrators and soci-
ety at large. Experts acknowledge that language testing cannot take place in isola-
tion, without reference to context, or excluding the very people whose lives are most
affected by the use of test scores. Shohamy (1997, 2001), McNamara and Roever
(2006), Bachman and Palmer (1996), and Hamp-Lyons (2000a, b, 2001), among
others, have pointed out the negative ways in which tests have been used, and the
negative effects these have had on test takers. In addition, they have discussed what
206 A. Rambiritch and A. Weideman
they believe test developers (and test takers) should do to ensure the development
and use of fair and responsible tests. In the hope of contributing to our understand-
ing of responsible test design, the next part of this narrative deals with issues related
to the social dimensions of language testing.
experts who may be tempted to hide behind the “scientific” authority of their designs
(Weideman 2006: 80). Full contact details of the designers are available on the
ICELDA website, allowing any prospective user or test taker the opportunity to
make contact should they choose to. In Du Plessis’s (2012) study, there is a fairly
comprehensive survey of students’ perceptions of TALPS, and she finds that,
although “much can be done to increase transparency about the nature and purpose
of the postgraduate literacy test, the results of the survey do not support the initial
hypotheses that students would be predominantly negative towards the TALPS and
that its face validity would be low” (Du Plessis 2012: 122). The test takers over-
whelmingly agreed (directly after they had taken the test, but before their results
were known) that they thought the test would measure accurately and fairly. In addi-
tion to this, and in line with the further suggestions made in this study for increasing
transparency, the test has been promoted effectively within the institutions where it
is used. Presentations have been made to faculties about the value of the test and the
intervention programme. As was the case with TALL, the test developers have pub-
lished articles about TALPS in scholarly journals (see Rambiritch 2013), in addition
to presenting papers/seminars at national and international conferences. By doing
this, the test developers have sought the opinions of other experts in the field. As has
been pointed out above, the opinions of prime users of the test, postgraduate super-
visors (cf. Butler 2009), had already been sought at the design stage, especially
when considering the face validity of the test. A test of this nature will constantly
need refinement, and such refinement is stimulated by its being evaluated by others
working in the same field.
The level of transparency and openness about TALPS – whether it is as yet ade-
quate or not - that has been achieved is a prerequisite for satisfying concerns related
to accountability and the need to take responsibility for its design. In the case of
applied linguists and test designers the challenge, however, is always to be “doubly
accountable” (Bygate 2004: 19), that is, accountable to peers within the discipline
within which they work, and accountable to the communities they serve.
This need to publicly defend the design, or public accountability in language
testing, has been referred to by many in the field. Boyd and Davies (2002) call for
the profession of language testing to have high standards, with members who are
conscious of their responsibilities and open to the public (2002: 312), noting that it
is not too late for language testers to “build in openness to its professional life”
(2002: 312). Rea-Dickins (1997: 304) sees a need for healthy and open relationships
among all stakeholders (learners, teachers, parents, testers and authorities) in the
field of language testing. She states that “a stakeholder approach to assessment has
the effect of democratising assessment processes, of improving relationships
between those involved, and promoting greater fairness” (1997: 304). As can be
seen, the accountability of the language tester must extend to the public being
served. Defining public accountability, however, is a fairly easy task; ensuring
accountability to the public less so. Public accountability starts with transparency,
of being aware of the kind of information that is made available (Bovens 2005: 10).
In the case of TALPS, the websites and the pamphlets distributed to students will
go a long way in ensuring that users are provided with information regarding the
208 A. Rambiritch and A. Weideman
test. Since that information should be available not only to users, but also to the
larger public, the challenge is to translate the technical concepts that are embodied
in the test and its assessment procedures into more readily accessible, non-specialist
language, while at the same time relating their theoretical meaning to real or per-
ceived social and political concerns. Test practices in South Africa are often exam-
ined more closely in radio interviews, newspaper reports and interviews, or formal
or informal talks. While there is unfortunately no comprehensive record of these,
the ‘News’ tab on the ICELDA website, which dates back to September 2011, none-
theless provides insight into some of the public appearances by ICELDA officials,
or of media coverage of test-related issues. In academic and non-academic settings,
however, public explanations of how test results can be used must be quite open
about both the value and the limitations of language tests, for example that tests are
highly useful to identify risk, but still cannot predict everything. Language tests lose
their predictive value of future performance, for instance, with every subsequent
year of a student’s study at university (Visser and Hanslo 2005). Humility about
what a test result can tell us, and what it cannot tell us, relates directly to openness
about the limited scope of such assessment (Weideman 2014: 8).
If we wish to demonstrate further the required care and concern for others
(Weideman 2009: 235) in our test designs, we should also look beyond the assess-
ment, to what happens after the test, and even after the students have completed
their studies (Kearns 1998: 140). The reality is that testing the academic literacy of
students but doing nothing to help them subsequently may be considered a futile
exercise. Issues of accountability dictate that if we test students, we should do
something to help them improve the ability that has been measured, if that is indi-
cated by the test results. The responsibility of ethical language testers extends into
the teaching that follows, a point that we shall return to below, in the discussion of
one institutional arrangement for this. Decisions related to possible interventions
are therefore not made in isolation. Rather, these concerns should be uppermost in
the minds of test designers, and from an early stage in the design process. The ear-
lier these concerns are acknowledged, even as early on as the initial conceptualisa-
tion of the test and its use, the more likely they are to be productively addressed.
One of the key considerations of the test designers when designing a test of this
nature is the question of how the results of the test will be used. In a responsible
conception of what test design and use involves, there is a shift in responsibility
from those who use tests to the test designers themselves. Once the need for TALPS
had been established, the next important consideration was, therefore, the interpre-
tation and use of the results of the test.
The use of tests to deny access has been well documented in the literature on
language testing (see Fulcher and Davidson 2007; Shohamy 2001). If the focus in
education and testing should be on granting rather than denying access (Fulcher and
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 209
Davidson 2007: 412), a test like TALPS can be used to do exactly that – facilitate
access. As is clear from the introduction, without an intervention nurturing the
development of academic literacy that follows the administration of the test, many
students may not successfully complete their studies. In discussing the SATAP (The
Standardised Assessment Test for Selection and Placement), Scholtz and Allen-lle
(2007) observe that an academic literacy test is essential in providing “insight into
the intellectual profile and academic readiness of students” and that subsequent
interventions have positive and financial implications: the individual becomes economi-
cally productive, it improves through-put rates and subsidies for institutions, and contrib-
utes to economic advancement in South Africa. (Scholtz and Allen-lle 2007: 921)
It is clear that the negative social and other consequences of language assess-
ments that have in the past affected such tests (as discussed by McNamara and
Roever 2006: 149 f. under the rubric of “Language tests and social identity”, as well
as by Du Plessis 2012: 122–124), have been mitigated in tests like TALL and
TALPS. These have purposely been designed to assist rather than disadvantage the
test taker. The test developers of TALPS insist that should users choose to use the
test for access – gaining entry or admission to postgraduate study - rather than
placement (post-admission referral or compulsion to enrol for an academic literacy
intervention), they should use at least three other criteria or instruments to measure
students’ abilities rather than rely solely on TALPS. The results of the test should be
used with care, since language can never predict all of a candidate’s future perfor-
mance. This is in keeping with AERA’s (1999: 146) advice that, in educational set-
tings, a decision or characterisation that will have a major impact on a student
should not be made on the basis of a single test score. Other relevant information
should be taken into account if it will enhance the overall validity of the decision
(AERA 1999: 146). A mix of information on which to base access decisions has
therefore been proposed, with a weighting of 60 % to prior academic performance,
and, in line with other findings by South African researchers of fairly low correla-
tions between academic literacy and academic performance (cf. Maher 2011: 33,
and the discussion of such investigations), not more than 10–20 % to language abil-
ity, with one possible exception:
This is where the ability is so low (usually in the lowest 7½ % of testees) that it raises ethical
questions about allowing those in who so obviously fall short of requirements that they will
waste their time and resources on a hopeless venture. (Weideman 2010, Personal
communication)
The test developers of TALPS have attempted to consider, from as early as the
design stage, to what use the results of the test would be put. This concern with the
consequences of the use of the test and its results to make judgements about test
takers points directly to issues of responsibility on the part of the test developers. It
also links to the issue of the responsible interpretation of test results. We return
below to further strategies for minimising the potentially negative impact of test
results. In the next section, however, we first consider the impact of the interpreta-
tion of test results.
210 A. Rambiritch and A. Weideman
It is also the responsibility of test designers to stipulate how to interpret the results
of a test. Because test results almost always have effects (positive and negative) on
test takers, it is imperative that a test is administered and scored according to the test
developers’ instructions (AERA 1999: 61). Thus, an important consideration on the
part of the test designers of TALPS was to provide advice to test users and test takers
on how to interpret the results of the test. The concern here was that allocating a
‘Pass’ or ‘Fail’, or a test score that was difficult to interpret, would stigmatise the
students and the test. Instead the test developers of TALL and TALPS use a scale
that does not distinguish between a ‘Pass’ or ‘Fail’, but rather indicates the level of
risk the student has in relation to language, as well as the measures that should be
taken to minimise or eliminate such risk. Table 10.3 presents the scoring scale for
TALPS, as well as advice to students on how this should be interpreted in the insti-
tutional context of the University of Pretoria:
The decision to use this scale, as pointed out by Du Plessis (2012: 108), has been
based on years of research undertaken by the test developers of the TALL and
TALPS, and the examination of test and average performance scores obtained at
different levels of study (Du Plessis 2012). By making the results available in risk
bands, they become more interpretable, and, since only the risk bands (in the
‘Interpretation’ column) are made public, and not the numerical range of the band
or the raw mark, the results are less likely to stigmatise. It is also more meaningful,
both for test takers and their supervisors, to know the degree of risk (high, clear,
less, etc.), rather than to be given a less easily interpretable raw mark.
This interpretation scale opens up another potential refinement to the administra-
tion of TALPS, to be discussed in more detail in the next section. This is that, should
the test not be fully reliable (which is always the case), those who are possible bor-
derline cases (as in the narrow band of code 3 cases above) could potentially be
required to write a second-chance test.
Table 10.3 Guidelines for interpreting the test scores for the TALPS
Code Interpretation
Code 1 (0–33 %) High risk: an academic writing intervention (EOT 300) is
compulsory
Code 2 (34–55 %) Clear risk: EOT 300 is compulsory
Code 3 (56–59 %) Risk: EOT 300 is compulsory
Code 4 (60–74 %) Less risk: you do not need to enrol for EOT 300
Code 5 (75+) Little to no risk: you do not need to enrol for EOT 300
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 211
This section sets out possible refinements that might be made to enhance the effec-
tiveness of TALPS (and by implication other tests of this nature). It does not yet
apply to the current version of TALPS, discussed above, but to designs that still
have to be thought through further before being realised in practice. The further
possibility of refining TALPS derives from a useful and potentially productive pro-
posal of how to deal effectively with, among other elements, the writing component
of the current tests. Pot (2013: 53) found that, in the case of TALPS, “the greatest
challenge for students is to present a coherent, well-structured academic argument,
and to do so by making use of the appropriate communicative functions used in
academic discourse. In essence, students fail to grasp the main concept of present-
ing an academic argument in written form.”
It is interesting to note that what she suggests in the case of postgraduate tests of
literacy can both equally and profitably be applied to undergraduate tests, which as
a rule do not have a separate writing section, as TALPS does. Her proposal may be
less resource-intensive than the addition to the undergraduate tests of a full-blown
writing task. It is instructive to note, moreover, where this author’s engagement with
the proposals she makes derives from. As post-entry tests like TALL, TAG, and
TALPS began to be widely used, over time the demand from course designers to
derive diagnostic information from them has grown. So the initial goal of her
research was to untangle, from the mass of information yielded by the test results,
that which could assist subsequent course design. The identification of the benefits
to be gained from unlocking the diagnostic information of TALPS is the focus of
another report (Pot and Weideman 2015), but we wish to focus here only on a num-
ber of proposals she makes in the conclusion of her investigation of the diagnostic
information to be gleaned from TALPS - proposals that might nonetheless enhance
the design of all the post-entry tests discussed in this volume.
Building on design ideas already used, for example, in post-entry tests developed
in Australia and New Zealand, Pot’s (2013: 54 ff.) proposal is that the designers
consider the introduction of a two-tier test. This would mean splitting the test in
two, first testing all candidates sitting for the TALPS on the first seven subsections
of the test, all of which are in multiple choice format: Scrambled text, Interpreting
graphs and visual information, Academic vocabulary, Text types, Understanding
texts, Grammar and text relations, and Text editing. Subsequently, should candi-
dates have scored below the cut-off point (currently at 60 %) for this first test, which
is an indication of risk, or if they have scored low on the two subtests (Sects. 6 and
7, Grammar and text relations and Text editing) that have in the past shown very
high correlations with the writing section, or if they are borderline cases identified
through empirical analyses of potential misclassifications related to the reliability of
the test (Van der Slik and Weideman 2005: 28), they are given a second opportunity
to have their ability to handle academic discourse assessed. In the case of the risk
bands outlined in the previous section (Table 10.3), this might be those in the code
3 category.
212 A. Rambiritch and A. Weideman
For those in risk bands that indicate that they should enrol for an academic literacy
development course, however, specific provision must be made. At the University of
Pretoria, students whose test results show them to be ‘at risk’ in terms of their aca-
demic literacy levels are in the first instance provided with support in the form of a
course in academic writing. The intervention that is relevant in this specific instance
is the Postgraduate Academic Writing Module, which was developed by the Unit for
Academic Literacy (UAL). The test and the course work hand-in-hand. The test is
used to determine the academic literacy levels of postgraduate students. Students
who are shown to be at risk may be expected by their faculties (larger organisational
units, binding together the humanities departments, or engineering, or business sci-
ences, and so forth) at the University of Pretoria to enrol for this module. Having
students take the test before the course means that students who are not at risk do
not have to sit through a module they may not need. The positive effects are that the
test increases awareness among students of their academic literacy levels. In addi-
tion, students see the link between the intervention and succeeding in their studies.
Table 10.4 highlights the alignment between the sub-tests in TALPS and the tasks
students have to complete in the writing course (EOT 300).
There is alignment between assessment and language development: the test and
the course are based on the same definition of academic literacy (see Rambiritch
2012).
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 213
8 Conclusion
In this narrative of the design and development of TALPS, a key question that we
hope to have answered is whether, as test designers, we have succeeded in designing
a socially acceptable, fair and responsible test. The need to ask such questions
becomes relevant when one works within a framework that incorporates due consid-
eration of the empirical analyses of a test, as well as a concern for the social dimen-
sions of language testing. As fair and responsible test developers it is our objective
to ensure that all information about the test, its design, and its use, is freely available
to those affected by or interested in its use. It should be the aim of test developers to
design tests that are effective, reliable, accessible and transparent, by test developers
who are willing to be accountable for their designs. In attempting to satisfy these
conditions, the designers of TALPS have intended to ensure that:
• The test is highly reliable, that a systematic validation argument can be proposed
for it, and that it is appropriate to be used for the purpose for which it was
designed;
• Information about the test is available and accessible to those interested in its
design and use;
• The test that can be justified, explained and defended publicly;
• In ensuring transparency they have opened up a dialogue between all those
involved in the testing process, as is evidenced in the numerous internet-derived
enquiries fielded by ICELDA, the public debate referred to above, as well as in
the two perception studies of Butler (2009) and Du Plessis (2012); and
• They have designed a test that is widely perceived, not only by the ever increas-
ing number of users of results, but also by the test takers, to have positive effects;
214 A. Rambiritch and A. Weideman
Being committed to the test takers they serve, and ensuring that their responsibil-
ity does not end with a score on a sheet, provide starting points for test developers
who wish to design language tests responsibly. It is pleasing to note, in the present
case, that the test is followed in almost every case we know of by effective teaching
and learning focused on developing those academic literacy abilities that may have
put these students at risk of either not completing their studies or not completing
their studies in the required time.
Though in having good intentions, designers’ rhetoric may well outstrip practice,
these intentions have provided a starting point for the designers of TALPS, whose
endeavours and goals are perhaps best summarized in the observation that
our designs are done because we demonstrate through them the love we have for others: it
derives from the relation between the technical artefact that is our design and the ethical
dimension of our life. In a country such as ours, the desperate language needs of both adults
and children to achieve a functional literacy that will enable them to function in the econ-
omy and partake more fully of its fruits, stands out as possibly the biggest responsibility of
applied linguists. (Weideman 2007: 53)
Our argument has been that in designing TALPS there is a conscious striving to
satisfy the requirements for it to be considered a socially acceptable, fair and respon-
sible test. Of course, this narrative does not end here but is intended as the beginning
of many more narratives about the test. Ensuring interpretability and the beneficial
use of results, the accessibility of information, transparent design and the willing-
ness to defend that design in public – all contribute to responsible test design. It is
likely that most of the lessons learned in the development and administration of this
postgraduate assessment will spill over into the development or further refinement
of other tests of language ability. Tests need to be scrutinised and re-subjected to
scrutiny all the time. Each new context in which they are administered calls for
further scrutiny and refinement, for determining whether or how the test continues
to conform to principles or conditions for responsible test design. As test designers
we need to continue to ask questions about our designs, about how trustworthy the
measurement is, and how general/specific the trust is that we can place in them.
References
American Educational Research Association (AERA). (1999). Standards for educational and psy-
chological testing. Washington, DC: American Educational Research Association.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing
useful language tests. Oxford: Oxford University Press.
Beu, D. S., & Buckley, M. R. (2004). Using accountability to create a more ethical climate. Human
Resource Management Review, 14, 67–83.
Bovens, M. (2005). Public accountability: A framework for the analysis and assessment of account-
ability arrangements in the public domain. In E. Ferlie, L. Lynne, & C. Pollitt (Eds.), The
Oxford handbook of public management (pp. 1–36). Oxford: Oxford University Press.
Boyd, K., & Davies, A. (2002). Doctors’ orders for language testers. Language Testing, 19(3),
296–322.
10 Telling the Story of a Test: The Test of Academic Literacy for Postgraduate… 215
Butler, H.G. (2007). A framework for course design in academic writing for tertiary education.
PhD thesis, University of Pretoria, Pretoria.
Butler, H. G. (2009). The design of a postgraduate test of academic literacy: Accommodating stu-
dent and supervisor perceptions. Southern African Linguistics and Applied Language Studies,
27(3), 291–300.
Bygate, M. (2004). Some current trends in applied linguistics: Towards a generic view. AILA
Review, 17, 6–22.
CITO. (2006). TiaPlus, classical test and item analysis ©. Arnhem: Cito M. and R. Department.
Davidson, F., & Lynch, B. K. (2002). Testcraft. New Haven: Yale University Press.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (Eds.). (1999). Dictionary
of language testing. Studies in Language Testing, 7. Cambridge: Cambridge University Press.
Du Plessis, C. (2012). The design, refinement and reception of a test of academic literacy for post-
graduate students. MA dissertation, University of the Free State, Bloemfontein.
Frink, D. D., & Klimoski, R. J. (2004). Advancing accountability theory and practice: Introduction
to the human resource management review special edition. Human Resource Management
Review, 14, 1–17.
Fulcher, G., & Davidson, F. (2007). Language testing and assessment: An advanced resource
book. New York: Routledge.
Geldenhuys, J. (2007). Test efficiency and utility: Longer or shorter tests. Ensovoort, 11(2), 71–82.
Hamp-Lyons, L. (2000a). Fairnesses in language testing. In A.J. Kunnan (Ed.), Fairness and vali-
dation in language assessment. Studies in Language Testing, 9, (pp. 30–34). Cambridge:
Cambridge University Press.
Hamp-Lyons, L. (2000b). Social, professional and individual responsibility in language testing.
System, 28, 579–591.
Hamp-Lyons, L. (2001). Ethics, fairness(es), and developments in language testing. In C. Elder
et al. (Eds.), Experimenting with uncertainty: Essays in honour of Alan Davies, Studies in
Language Testing, 11, (pp. 222–227). Cambridge: Cambridge University Press.
Inter-Institutional Centre for Language Development and Assessment (ICELDA). (2015). [Online].
Available http://icelda.sun.ac.za. Accessed 7 May 2015.
Kearns, K. P. (1998). Institutional accountability in higher education: A strategic approach. Public
Productivity & Management Review, 22(2), 140–156.
Kurpius, S. E. R., & Stafford, M. E. (2006). Testing and measurement: A user-friendly guide.
Thousand Oaks: Sage Publications.
Maher, C. (2011). Academic writing ability and performance of first year university students in
South Africa. Research report for the MA dissertation, University of the Witwatersrand,
Johannesburg. [Online]. Available http://wiredspace.wits.ac.za/bitstream/. Accessed 20 July
2015.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Malden: Blackwell
Publishing.
Mdepa, W., & Tshiwula, L. (2012). Student diversity in South African Higher Education. Widening
Participation and Lifelong Learning, 13, 19–33.
Naurin, D. (2007). Transparency, publicity, accountability – The missing links. Unpublished paper
delivered at the CONNEX-RG 2 workshop on ‘Delegation and mechanisms of accountability
in the EU’, 8–9 March, Uppsala.
Norton, B. (1997). Accountability in language assessment. In C. Clapham & D. Corson (Eds.),
Language testing and assessment: Encyclopaedia of language and education 7 (pp. 323–333).
Dordrecht: Kluwer Academic.
Patterson, R., & Weideman, A. (2013). The typicality of academic discourse and its relevance for
constructs of academic literacy. Journal for Language Teaching, 47(1), 107–123. http://dx.doi.
org/10.4314/jlt.v47i1.5.
Pot, A. (2013). Diagnosing academic language ability: An analysis of TALPS. MA dissertation,
Rijksuniversiteit Groningen, Groningen.
Pot, A., & Weideman, A. (2015). Diagnosing academic language ability: Insights from an analysis
of a postgraduate test of academic literacy. Language Matters, 46(1), 22–43. Retrieved from:
http://dx.doi.org/10.1080/10228195.2014.986665
Rambiritch, A. (2012). Transparency, accessibility and accountability as regulative conditions for
a postgraduate test of academic literacy. PhD thesis, University of the Free State, Bloemfontein.
216 A. Rambiritch and A. Weideman
Rambiritch, A. (2013). Validating the test of academic literacy for postgraduate students (TALPS).
Journal for Language Teaching, 47(1), 175–193.
Rea-Dickins, P. (1997). So, why do we need relationships with stakeholders in language testing? A
view from the U.K. Language Testing, 14(3), 304–314.
Scholtz, D., & Allen-lle, C. O. K. (2007). Is the SATAP test an indicator of academic preparedness
for first year university students? South African Journal of Higher Education, 21(7),
919–939.
Schuurman, E. (2005). The technological world picture and an ethics of responsibility: Struggles
in the ethics of technology. Sioux Center: Dordt College Press.
Second Language Testing Inc. (2013). Pilot testing and field testing. [Online]. Available: http://2lti.
com/test-development/pilot-testing-and-field-testing/
Shohamy, E. (1997). Testing methods, testing consequences: Are they ethical? Are they fair?
Language Testing, 14(3), 340–349.
Shohamy, E. (2001). The power of tests: A critical perspective on the uses of language tests.
London: Longman.
Shohamy, E. (2008). Language policy and language assessment: The relationship. Current Issues
in Language Planning, 9(3), 363–373.
Sinclair, A. (1995). The chameleon of accountability: Forms and discourses. Accounting,
Organisations and Society, 20(2/3), 219–237.
Van der Slik, F., & Weideman, A. (2005). The refinement of a test of academic literacy. Per
Linguam, 21(1), 23–35.
Van Dyk, T., & Weideman, A. (2004). Finding the right measure: From blueprint to specification
to item type. SAALT Journal for Language Teaching, 38(1), 15–24.
Visser, A. J., & Hanslo, M. (2005). Approaches to predictive studies: Possibilities and challenges.
South African Journal of Higher Education, 19(6), 160–1176.
Weideman, A. (2006). Transparency and accountability in applied linguistics. Southern African
Linguistics and Applied Language Studies, 24(1), 71–86.
Weideman, A. (2007). A responsible agenda for applied linguistics: Confessions of a philosopher.
Per Linguam, 23(2), 29–53.
Weideman, A. (2009). Constitutive and regulative conditions for the assessment of academic lit-
eracy. South African Linguistics and Applied Language Studies, 27(3), 235–251.
Weideman, A. (2014). Innovation and reciprocity in applied linguistics. Literator, 35(1), 1–10.
[Online]. Available doi: http://dx.doi.org/10.4102/lit.v.35i1.1074.
Weideman, A., & Van Dyk, T. (Eds.). (2014). Academic literacy: Test your competence.
Potchefstroom: Inter-Institutional Centre for Language Development and Assessment
(ICELDA).
Weideman, A., Patterson, R., & Pot, A. (2016). Construct refinement in tests of academic literacy.
In J. Read (Ed), Post-admission language assessment of university students. Dordrecht:
Springer.
Weir, C. J. (2005). Language testing and validation: An evidence-based approach. Houndmills:
Palgrave Macmillan.
Part V
Conclusion
Chapter 11
Reflecting on the Contribution
of Post-Admission Assessments
John Read
Abstract This chapter examines a number of issues arising from the earlier contri-
butions to this volume. It considers the decision by a university about whether to
introduce a post-admission language assessment in terms of the positive and nega-
tive messages such a decision may convey, as well as the costs versus the benefits.
There is some discussion of the need to develop professional communication skills
as attributes to enhance the employability of graduates and how such skills can be
fostered, along with the development of academic literacy in the disciplines, through
various forms of collaboration between English language specialists and academic
teaching staff. Finally, it explores ideas related to the concept of English as a lingua
franca and what implications they may have for the assessment of university stu-
dents from different language backgrounds.
As specialists in the field, the authors of this volume have naturally focused on the
design and delivery of the assessment programme in their respective institutions,
with a concern for improving the quality of the measurement of academic language
abilities and reporting the results in a meaningful fashion to the various stakehold-
ers. However, this obviously represents a narrow perspective. No matter how good
an assessment may be, it will not achieve its desired objectives unless there is strong
institutional support at the policy level as well as adequate resourcing – not just for
the assessment itself but for effective follow-up action through advising of students
and provision of opportunities for academic language development.
J. Read (*)
School of Cultures, Languages and Linguistics, University of Auckland,
Auckland, New Zealand
e-mail: ja.read@auckland.ac.nz
In societies like Hong Kong, Oman and South Africa, where a high proportion if not
all students entering English-medium universities come from non-English-using
backgrounds, the need to further enhance their English language skills is obvious –
even if they have had some form of English-medium schooling previously. The
language enhancement may be in the form of a foundation programme, compulsory
English language courses in the first year of study and beyond, a learning and study
skills centre, or (as in the case of Hong Kong) a fourth year added to what has tra-
ditionally been a 3-year undergraduate degree.
On the other hand, universities in the major English-speaking countries vary
widely in the extent to which they have made provision for the language and learn-
ing needs of incoming students, as noted briefly in the Introduction. Universities in
the US have a long tradition, going back at least to the 1950s, of freshman composi-
tion programmes to develop the academic writing skills of first-year domestic
students, and the growth in foreign student numbers from the 1960s led to the
parallel development of ESL courses, in the form of both intensive pre-admission
programmes and credit courses for degree students. In the UK, the impetus for
addressing these issues came initially from the need to ensure that students with
English as their second language from Commonwealth countries who were recipi-
ents of scholarships and study awards had adequate proficiency in academic English
to benefit from their studies in Britain, and summer pre-sessional courses have
become an institution in British universities, serving the much broader range of
international students who are now admitted. In other English-speaking countries, it
has been the liberalising of immigration regulations to allow the recruitment of fee-
paying international students which has led to a variety of pre- and post-admission
programmes to enhance their academic English skills. The same liberalisation has
seen an influx of immigrant families with children who work their way as “English
language learners” through the school system to higher education without necessar-
ily acquiring full proficiency in academic English. For such students and for many
other domestic students who are challenged by the demands of academic literacy at
the tertiary level, there are learning centres offering short courses, workshops, peer
tutoring, individual consultations, online resources and so on.
Thus, in a variety of ways universities in the English-speaking countries already
offer study support and opportunities for academic language enrichment to their
students, at least on a voluntary basis. A proposal to introduce a post-admission
language assessment represents a significant further step by seeking to identify stu-
dents who would benefit from – or perhaps have an obvious need to access – such
services in meeting the language demands of their studies. This then leads to the
question of whether the assessment and any follow-up action on the student’s part
should be voluntary or mandatory. It also raises the issue of whether the language
and literacy needs revealed by the assessment results may be greater than can be
accommodated within existing provisions, meaning that substantial additional fund-
ing may be required.
11 Reflecting on the Contribution of Post-Admission Assessments 221
In the cases we have seen in this book, some universities are subject to external
pressures to address these matters. The controversy over English language standards
in Australian universities has already been discussed in the Introduction. In 2012,
the Tertiary Education Quality and Standards Agency (TEQSA) announced that its
audits of universities in Australia would include comprehensive quality assessments
of English language proficiency provisions (Lane 2012). However, a change of
government and vigorous lobbying by tertiary institutions asserting that such assess-
ments imposed onerous demands on them led to a ministerial decision that TEQSA
would abandon this approach in favour of simply ensuring that minimum standards
were being met (Lane 2014a). In the most recent version of the Higher Education
Standards Framework, the statutory basis for TEQSA audits, there is just a single
explicit reference to English language standards, right at the beginning of the
document:
1 Student Participation and Attainment
1.1 Admission
1. Admissions policies, requirements and procedures are documented, are applied fairly
and consistently, and are designed to ensure that admitted students have the academic prep-
aration and proficiency in English needed to participate in their intended study, and no
known limitations that would be expected to impede their progression and completion.
(Australian Government 2015)
The change in TEQSA’s role was seen as reducing the pressure on tertiary institu-
tions to take specific initiatives such as implementing a post-entry language assess-
ment (PELA), and some such moves at particular universities stalled as a result.
Although it is generally recognised that the English language needs of students
should be addressed, there is ongoing debate about the most suitable strategy for
ensuring that universities take this responsibility seriously (Lane 2014b).
Another kind of external pressure featured in Chap. 6 (this volume). The Oral
English Proficiency Test (OEPT) at Purdue University is one example of an assess-
ment mandated by legislation in US states to ensure that prospective International
Teaching Assistants (ITAs) have sufficient oral proficiency in English to be able to
perform their role as instructors in undergraduate courses. This of course is a some-
what different concern from that of most other post-admission assessments, where
the issue is whether the test-takers can cope with the language and literacy demands
of their own studies.
In contrast to these cases of external motivation, other post-admission assess-
ments have resulted from internal pressure, in the form of a growing recognition
among senior management and academic staff that there were unmet language
needs in their linguistically diverse student bodies which could no longer be ignored,
particularly in the face of evidence of students dropping out of their first year of
study as a result of language-related difficulties. This applies to the original moves
towards a PELA at the University of Melbourne (Chap. 2, this volume; see also
Elder and Read 2015) in the 1990s, as well as the introduction of the Diagnostic
222 J. Read
addresses common questions and concerns. Through all these means, the university
seeks to ensure that the purpose of the assessment is understood, and that students
take advantage of the opportunities it offers.
students when they have been attending university to little purpose rather than working—
probably $20,000 per student. Then there are all the non-financial costs—angst, frustrated
expectations and so on. (John Morrow, personal communication, 8 March 2016)
This quote refers specifically to the costs of the assessment, but the same line of
argument can be extended to the funding needed for a programme of academic lan-
guage development, much of which was already in place at the time that DELNA
was introduced.
University of Sydney:
5. Communication
Graduates of the University will recognise and value communication as a
tool for negotiating and creating new understanding, interacting with others,
and furthering their own learning.
• use oral, written, and visual communication to further their own learning
• make effective use of oral, written and visual means to critique, negotiate,
create and communicate understanding
• use communication as a tool for interacting and relating to others (http://
www.itl.usyd.edu.au/graduateAttributes/policy_framework.pdf)
However, as with other graduate attributes, there is a lack of university-wide strate-
gies to determine whether graduating students have acquired such communication
skills, except through the assessment of the courses they have taken for their degree.
As Arkoudis & Kelly put it,
institutional graduate attribute statements that refer to the communication skills of gradu-
ates are merely claims until evidenced. Institutional leaders need to be able to point to evi-
dence demonstrating that the oral and written communication skills of their students are
developed, assessed, monitored and measured through the duration of a qualification.
(2016, p. 6)
They go on to note the need for research to articulate exit standards and to produce
an explicit framework which could guide academic staff to develop the relevant
skills through the teaching of their courses.
As a step in this direction, Murray (2010, 2016) proposes that the construct of
English language proficiency for university study should be expanded to include
professional communication skills, of the kind that students will require both for
work placements and practicums during their studies and in order to satisfy the
expectations of future employers and professional registration bodies once they
graduate. Murray identifies these skills as follows:
• Intercultural competence
• A cultural relativistic orientation
• Interpersonal skills
• Conversancy in the discourses and behaviours associated with particular domains
• Non-verbal communication skills
• Group and leadership skills
The one language testing project which has sought to produce a measure of at
least some of these skills is the Graduating Students’ Language Proficiency
Assessment (GSLPA), developed in the 1990s at Hong Kong Polytechnic University
(PolyU), with funding from the University Grants Committee (UGC) in Hong Kong
(Qian 2007). It is a task-based test of professional writing and speaking skills
designed in consultation with business leaders in Hong Kong. Although the test has
been administered to PolyU students since 1999 (see http://gslpa.polyu.edu.hk/eng/
web/), it was not accepted by the other Hong Kong universities and, as an alternative,
11 Reflecting on the Contribution of Post-Admission Assessments 227
the UGC ran a scheme from 2002 to 2013 to pay the fee for students to take the
Academic Module of IELTS on a voluntary basis when they were completing their
degree. Two Australian universities (the University of Queensland and Griffith
University) have adopted a similar policy of subsidising the IELTS test fee as a
service to their graduating international students (Humphreys and Mousavi 2010).
While this strategy provides the students with a broad, internationally recognised
assessment of their academic language proficiency at the time of graduation, it can
scarcely be regarded as a valid measure of their professional communication skills.
Indeed, O’Loughlin (2008) has questioned the ethics of using IELTS for such a
purpose without proper validation.
A quite different approach involves embedding these skills, along with other aspects
of English language development, into the students’ degree programmes. This
already happens to varying degrees in professional faculties, like Engineering,
Business, Medical Sciences and Education, where students need to demonstrate the
application of relevant communication skills in order to be registered to practise
their chosen profession. The same strategy can in principle be applied to degree
programmes across the university. Numerous English language specialists in higher
education – notably Arkoudis et al. (2012) in Australia and Wingate (2015) in the
United Kingdom – strongly advocate the embedded delivery of academic language
development to all students as current best practice. In support of this position,
Arkoudis and Kelly cite studies which document “the limitations of communication
skills programs which sit outside the disciplinary curricula and are supported by
staff who are not recognised by students as disciplinary academics” (2016, p. 4).
This quote highlights the point that academic English programmes are typically
delivered as adjuncts to degree courses by tutors with low (and maybe insecure)
status within the institution who may not have the relevant knowledge of discourse
norms to address issues of academic literacy or professional communication skills
within the disciplines. On the other hand, subject lecturers and tutors tend to shy
away from dealing with problems with language and genre in their students’ writ-
ing, claiming a lack of expertise. In their influential study of academic literacies in
undergraduate courses in the UK, Lea and Street (1998) reported that tutors could
not adequately articulate their understanding of concepts like “critical analysis”,
“argument” or “clarity”. As Murray (2016) puts it, although academic teaching staff
have procedural knowledge of academic discourse norms in their discipline, they
lack the declarative (or metalinguistic) knowledge needed to give the kind of feed-
back on student writing that would allow the students to understand how they can
better meet the appropriate disciplinary norms.
This suggests that the way forward is to foster more collaboration between learn-
ing advisors and English language tutors on the one hand and academic teaching
staff on the other. Murray (2016) proposes as a starting point that the practice in
228 J. Read
they also acknowledged that the level of support offered by their university to
non-native English speakers was inadequate, with consequent negative effects on
students’ confidence in their ability to meet the standards.
The latter view received support in a series of “conversations” Jenkins (2013)
conducted at a UK university with international postgraduate students, who
expressed frustration at the lack of understanding among their supervisors, lecturers
and native-speaking peers concerning the linguistic challenges they faced in under-
taking their studies. This included an excessive concern among supervisors with
spelling, grammar and other surface features as the basis for judging the quality of
the students’ work – often with the rationale that a high level of linguistic accuracy
was required for publication in an academic journal.
Jenkins (2013; see also Jenkins 2006a; Jenkins and Leung 2014) is particularly
critical of the role of the international English proficiency tests (IELTS, TOEFL,
Pearson Test of English (PTE)) in their gatekeeping role for entry to EMI degree
programmes. She and others (e.g., Canagarajah 2006; Clyne and Sharifian 2008;
Lowenberg 2002) argue that these and other tests of English for academic purposes
serve to perpetuate the dominance of standard native-speaker English, to the detri-
ment of ELF users, by requiring a high degree of linguistic accuracy, by associating
an advanced level of proficiency with facility in idiomatic expression, and by not
assessing the intercultural negotiating skills which are a key component of commu-
nication in English across linguistic boundaries, according to the ELF research.
These criticisms have been largely articulated by scholars with no background in
language assessment, although Shohamy (2006) and McNamara (2011) have also
lent some support to the cause.
Several language testers (Elder and Davies 2006; Elder and Harding 2008; Taylor
2006) have sought to respond to the criticisms from a position of openness to the
ideas behind ELF. Their responses have been along two lines. On the one hand, they
have discussed the constraints on the design and development of innovative tests
which might more adequately represent the use of English as a lingua franca, if the
tests were to be used to make high-stakes decisions about students. On the other
hand, these authors have argued that the critics have not recognised ways in which,
under the influence of the communicative approach to language assessment, con-
temporary English proficiency tests have moved away from a focus on native-
speaker grammatical and lexical norms towards assessing a broader range of
communicative abilities, including those documented in ELF research. The replies
from the ELF critics to these statements (Jenkins 2006b; Jenkins and Leung 2014)
have been disappointingly dismissive, reflecting an apparent disinclination to
engage in constructive debate about the issues.
This is not to say that the international proficiency tests are above criticism.
Language testers can certainly point to ways in which these testing programmes
11 Reflecting on the Contribution of Post-Admission Assessments 231
fall into this category, particularly if they have already had the experience of using
English for purposes like presenting their work at conferences or writing for publi-
cation in English. On the other hand, a diagnostic assessment may reveal that such
students read very slowly, lack non-technical vocabulary knowledge, have difficulty
in composing cohesive and intelligible paragraphs, and are hampered in other ways
by limited linguistic competence. This makes it more arguable whether such stu-
dents should be considered proficient users of the language.
A similar kind of issue arises with first-year undergraduates in English-speaking
countries matriculating from the secondary school system there. Apart from interna-
tional students who complete 2 or 3 years of secondary education to prepare for
university admission, domestic students cover a wide spectrum of language back-
grounds which make it increasingly problematic to distinguish non-native users
from native speakers in terms of the language and literacy skills required for aca-
demic study. In the United States English language learners from migrant families
have been labelled Generation 1.5 (Harklau et al. 1999; Roberge et al. 2009) and are
recognised as often being in an uncomfortable in-between space where they have
not integrated adequately into the host society, culture and education system.
Linguistically, they may have acquired native-like oral communication skills, but
they lack the prerequisite knowledge of the language system on which to develop
good academic reading and writing skills. Such considerations strengthen the case
for administering a post-admission assessment to all incoming students, whatever
their language background; this is the position of the University of Auckland with
DELNA, but not many universities have been able to adopt a comprehensive policy
of this kind.
At the same time, there are challenging questions about how to design a post-
admission assessment to cater for the diverse backgrounds of students across the
native – non-native spectrum. It seems that the ELF literature has little to offer at
this point towards the definition of an alternative construct of academic language
ability which avoids reference to standard native-speaker norms and provides the
basis for a practicable assessment design. The work of Weideman and his colleagues
in South Africa, on defining and assessing the construct of academic literacy, as
reported in Chaps. 9 and 10, represents one stimulating model of test design, but
others are needed, especially if post-admission assessments are to operationalise an
academic literacies construct which takes account of the discourse norms in particu-
lar academic disciplines, as analysed by scholars such as Swales (1990), Hyland
(2000, 2008), and Nesi and Gardner (2012). At the moment the closest we have to a
well-documented assessment procedure of this type is the University of Sydney’s
Measuring the Academic Skills of University Students (MASUS) (Bonanno and
Jones 2007), as noted in the Introduction.
Nevertheless, the chapters of this volume show what can be achieved in a variety
of English-medium universities to assess the academic language ability of incoming
students at the time of admission, as a prelude to the delivery of effective pro-
grammes for language and literacy development. It is important to acknowledge that
all of the institutions represented here have been able to draw on their own applied
linguists and language testers in designing their assessments. As Murray noted in
11 Reflecting on the Contribution of Post-Admission Assessments 233
identifying universities “at the vanguard” of PELA provision in Australia and New
Zealand, “It is certainly not coincidental that a number of these boast resident exper-
tise in testing” (2016, p. 121). The converse is that institutions lacking such capabil-
ity may implement assessments which do not meet professional standards. However,
by means of publications and conference presentations, as well as consultancies and
licensing arrangements, the expertise is being more widely shared, and we hope that
this book will contribute significantly to that process of dissemination.
References
Arkoudis, S., & Kelly, P. (2016). Shifting the narrative: International students and communication
skills in higher education (IERN Research Digest, 8). International Education Association of
Australia. Retrieved March 1, 2016, from: www.ieaa.org.au/documents/item/664
Arkoudis, S., Baik, C., & Richardson, S. (2012). English language standards in higher education.
Camberwell: ACER Press.
Australian Government. (2015). Higher education standards framework (threshold standards)
2015. Retrieved March 7, 2016, from: https://www.legislation.gov.au/Details/F2015L01639
Birrell, B. (2006). Implications of low English standards among overseas students at Australian
universities. People and Place, 14(4), 53–64. Melbourne: Centre for Population and Urban
Research, Monash University.
Bonanno, H., & Jones, J. (2007). The MASUS procedure: Measuring the academic skills of univer-
sity students. A resource document. Sydney: Learning Centre, University of Sydney. http://
sydney.edu.au/stuserv/documents/learning_centre/MASUS.pdf
Canagarajah, A. S. (2006). Changing communicative needs, revised assessment objectives: Testing
English as an international language. Language Assessment Quarterly, 3(3), 229–242.
Clyne, M., & Sharifian, F. (2008). English as an international language: Challenges and possibili-
ties. Australian Review of Applied Linguistics, 31(3), 28.1–28.16.
Dunworth, K. (2009). An investigation into post-entry English language assessment in Australian
universities. Journal of Academic Language and Learning, 3(1), 1–13.
Dunworth, K., Drury, H., Kralik, C., Moore, T., & Mulligan, D. (2013). Degrees of proficiency:
Building a strategic approach to university students’ English language assessment and devel-
opment. Sydney: Australian Government Office for Learning and Teaching. Retrieved February
24, 2016, from: www.olt.gov.au/project-degrees-proficiency-building-strategic-approach-
university-studentsapos-english-language-ass
Elder, C., & Davies, A. (2006). Assessing English as a lingua franca. Annual Review of Applied
Linguistics, 26, 282–304.
Elder, C., & Harding, L. (2008). Language testing and English as an international language:
Constraints and contributions. Australian Review of Applied Linguistics, 31(3), 34.1–34.11.
Elder, C., & Read, J. (2015). Post-entry language assessments in Australia. In J. Read (Ed.),
Assessing English proficiency for university study (pp. 25–39). Basingstoke: Palgrave
Macmillan.
Harklau, L., Losey, K. M., & Siegal, M. (Eds.). (1999). Generation 1.5 meets college composition:
Issues in the teaching of writing to U.S.-educated learners of ESL. Mahwah: Lawrence
Erlbaum.
Humphreys, P., & Mousavi, A. (2010). Exit testing: A whole-of-university approach. Language
Education in Asia, 1, 8–22.
Hyland, K. (2000). Disciplinary discourses: Social interactions in academic writing. Harlow:
Longman.
234 J. Read
Hyland, K. (2008). Genre and academic writing in the disciplines. Language Teaching, 41(4),
543–562.
Jenkins, J. (2006a). The spread of EIL: A testing time for testers. ELT Journal, 60(1), 42–50.
Jenkins, J. (2006b). The times they are (very slowly) a-changin’. ELT Journal, 60(1), 61–62.
Jenkins, J. (2013). English as a lingua franca in the international university: The politics of aca-
demic English language policy. London: Routledge.
Jenkins, J., & Leung, C. (2014). English as a lingua franca. In A. J. Kunnan (Ed.), The companion
to language assessment (pp. 1–10). Chichester: Wiley. Chap. 95.
Jones, J., Bonanno, H., & Scouller, K. (2001). Staff and student roles in central and faculty-based
learning support: Changing partnerships. Paper presented at Changing Identities, 2001
National Language and Academic Skills Conference. Retrieved April 17, 2014, from: http://
learning.uow.edu.au/LAS2001/selected/jones_1.pdf
Kirkpatrick, A. (2010). English as a Lingua Franca in ASEAN: A multilingual model. Hong Kong:
Hong Kong University Press.
Lane, B. (2012, August 22 ). National regulator sharpens focus on English language standards. The
Australian. Retrieved March 7, 2016, from: www.theaustralian.com.au/higher-education/
national-regulator-sharpens-focus-onenglish-language-standards/story-e6frgcjx-
1226455260799
Lane, B. (2014a, March 12). English proficiency at risk as TEQSA bows out. The Australian,
March 12. Retrieved March 7, 2016, from: http://www.theaustralian.com.au/higher-education/
english-proficiency-at-risk-as-teqsa-bows-out/story-e6frgcjx-1226851723984
Lane, B. (2014b, August 22). Unis and language experts at odds over English proficiency. The
Australian. Retrieved March 7, 2016, from: http://www.theaustralian.com.au/higher-education/
unis-and-language-experts-at-odds-over-english-proficiency/news-story/
d3bc1083caa28eb8924e94b0d40b0928
Lea, M. R., & Street, B. V. (1998). Student writing in higher education: An academic literacies
approach. Studies in Higher Education, 29, 157–172.
Lowenberg, P. H. (2002). Assessing English proficiency in the expanding circle. World Englishes,
21(3), 431–435.
Mauranen, A. (2012). Exploring ELF: Academic English shaped by non-native speakers.
Cambridge: Cambridge University Press.
McNamara, T. (2011). Managing learning: Authority and language assessment. Language
Teaching, 44(4), 500–515.
Murray, N. (2010). Considerations in the post-enrolment assessment of English language profi-
ciency: From the Australian context. Language Assessment Quarterly, 7(4), 343–358.
Murray, N. (2016). Standards of English in higher education: Issues, challenges and strategies.
Cambridge: Cambridge University Press.
Nesi, H., & Gardner, S. (2012). Genres across the disciplines: Student writing in higher education.
Cambridge: Cambridge University Press.
O’Loughlin, K. (2008). The use of IELTS for university selection in Australia: A case study. IELTS
Research Reports, Volume 8 (Report 3). Retrieved March 11, 2016, from: https://www.ielts.
org/~/media/research-reports/ielts_rr_volume08_report3.ashx
Qian, D. (2007). Assessing university students: Searching for an English language exit test. RELC
Journal, 38(1), 18–37.
Read, J. (2008). Identifying academic language needs though diagnostic assessment. Journal of
English for Academic Purposes, 7(2), 180–190.
Read, J. (2015a). Assessing English proficiency for university study. Basingstoke: Palgrave
Macmillan.
Read, J. (2015b). The DELNA programme at the University of Auckland. In J. Read (Ed.),
Assessing English proficiency for university study (pp. 47–69). Basingstoke: Palgrave
Macmillan.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabulary assessment.
Language Testing, 18(1), 1–32.
11 Reflecting on the Contribution of Post-Admission Assessments 235
Roberge, M., Siegal, M., & Harklau, L. (Eds.). (2009). Generation 1.5 in college composition:
Teaching academic writing to U.S.-educated learners of ESL. New York: Routledge.
Seidlhofer, B. (2011). Understanding English as a Lingua Franca. Oxford: Oxford University
Press.
Shohamy, E. (2006). Language policy: Hidden agendas and new approaches. London: Routledge.
Spolsky, B. (1995). Measured words: The development of objective language testing. Oxford:
Oxford University Press.
Spolsky, B. (2008). Language assessment in historical and future perspective. In E. Shohamy &
N. H. Hornberger (Eds.), Encyclopedia of language and education (Language testing and
assessment 2nd ed., Vol. 7, pp. 445–454). New York: Springer.
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge:
Cambridge University Press.
Taylor, L. (2006). The changing landscape of English: Implications for language assessment. ELT
Journal, 60(1), 51–60.
University of Edinburgh. (2011). Employability initiative at Edinburgh. Retrieved March 9, 2016,
from: http://www.employability.ed.ac.uk/GraduateAttributes.htm
Widdowson, H. G. (1994). The ownership of English. TESOL Quarterly, 28(2), 377–389.
Wingate, U. (2015). Academic literacy and student diversity: The case for inclusive practice.
Bristol: Multilingual Matters.
Index
A Assessment design
Abu Rabia, S., 175 bias for best, 124, 130
Academic discourse, nature of, 13 cloze-elide, 46, 144
Academic English Screening Test (AEST) C-test, 26, 33
(South Australia), 26 graph-based speaking items, 130–131
Academic language development programmes. graph-based writing tasks, 46
See also Uptake of language support multiple-choice task types, 46
conversation groups, 70, 150 Australia, 4, 6–8, 24, 26, 28, 59, 68, 140, 142,
embedded in subject courses, 7, 69 166, 170, 213, 223, 227, 229, 233, 235
learning centres, 58, 222 Australian Education International (AEI), 5
online resources, 12, 222 Australian Universities Quality Agency
peer mentoring, 7, 14 (AUQA), 68
taught courses, 118
workshops, 7, 12, 14, 51, 70, 150, 222, 230
Academic literacy, 5, 6, 8, 10, 12–16, 18, 47, B
59, 69, 142, 145, 182–196, 200–216, Bachman, L.F., 10, 15, 24, 58, 90,
222, 229, 230, 233, 234 186, 203, 207
Access to higher education, 24, 182, 208 Baik, C., 5, 142, 229, 230
Accountability of test developers, 14 Bailey, A.L., 189
ACTFL Oral Proficiency Interview (OPI), 115 Bailey, K.M., 114
Activity theory, 47, 58, 61 Balota, D.A., 163
Advising of students (post-assessment), 221 Banerjee, J., 163
Administration conditions. See Test Baptist University (Hong Kong), 92, 93
administration Basturkmen, H., 141
Affective variables, 176 Bayliss, A., 162
Afrikaans, 13, 182, 185, 195, 200, 208 Beekman, L., 189
Agustín Llach, M.P., 176 Bennett, S., 162
Aitchison, C., 141, 142 Benzie, H.J., 140–142
Alderson, J.C., 12, 17, 18, 45, 46, 51, 55, Bernhardt, E., 163
71, 92, 163 Berry, V., 9
Al-Hazemi, H., 163 Beu, D.S., 202
Allen-lle, C.O.K., 211 Biber, D., 189
Ammon, U., 162 Birrell, B., 4, 227
Anderson, T., 44 Bitchener, J., 141
Arkoudis, S., 5, 142, 228–230 Black, P., 88
Artemeva, N., 8, 46, 47, 49, 55, 60, 63 Blanton, L.L., 185, 186
Bonanno, H., 5, 8, 25, 59, 230, 234 Cost-benefit analysis, 53, 116, 226–227
Bondi, M., 189 Cotterall, S., 140, 141, 154
Bovens, M., 202, 209 Cotton, F., 162
Boyd, K., 209 Creswell, J.W., 16, 53, 56
Braine, G., 140–142 Crystal, D., 4
Bright, C., 162 Cut scores. See Standards setting
Brown, A., 204, 206
Brown, J.D., 204, 206
Brown, J.S., 46 D
Browne, S., 44 Davidson, F., 7, 183, 201, 210
Brunfaut, T., 17, 18 Davies, A., 204, 206, 209, 232
Buck, G., 92 Degrees of Proficiency website, 5, 224
Buckley, M.R., 202 Design of test formats. See Assessment design
Burgin, S., 141, 142 Dervin, F., 189
Butler, H.G., 201, 204, 207, 209, 215 Diagnostic assessment, 16, 17, 25, 44–63, 88,
Buxton, B., 166 93, 103, 224, 234
Bygate, M., 209 Diagnostic English Language Assessment
(DELA) (Melbourne), 5, 8, 25, 33
Diagnostic English Language Needs
C Assessment (DELNA)
Cai, H., 46 (Auckland), 6, 11, 25, 224–226
Cambridge English Language Assessment, 16 Diagnostic English Language Tracking
Canagarajah, A.S., 232 Assessment (DELTA) (Hong Kong), 10
Carey, M., 163, 170, 173–175 DiCiccio, T.J., 96
Carleton University, 8, 13, 224, 226 Discipline-specific assessment, 8, 69, 142. See
Elsie MacGill Centre, 60–63 also Measuring the Academic Skills of
Carpenter, P.A., 164 University Students (MASUS)
Carter, S., 142 commerce/business students, 78, 148–151
Catterall, J., 141, 142 Doctoral students. See Postgraduate
Chan, J.Y.H., 9 students
Chaney, M., 162 Dodorico-McDonald, J., 176
Chanock, K., 148 Doyle, H., 44
Chapelle, C.A., 15, 117, 176, Drury, H., 5, 142, 224
183, 184 Du Plessis, C., 184, 204, 205, 207, 209, 211,
Cheng, L., 44, 49, 55, 63 212, 215
Chui, A.S.Y., 164 Dube, C., 189
City University of Hong Kong, 92 Dunworth, K., 24, 142, 224, 226
Cliff, A.F., 186
Clyne, M., 232
Cobb, T., 163, 174 E
Collins, A., 46 EALTA Guidelines for Good Practice, 133
Commons, K., 155 East, M., 141
Computer-based assessment, 10 Educational Testing Service (ETS), 114
technical problem, 128 Edwards, B., 141
Conferences with students. See Advising of Efron, B., 96
candidates (post-assessment) Eignor, D., 162
Congdon, P., 35 Elder, C., 15, 24–27, 30, 31, 34–36, 39,
Coniam, D., 99 46, 68, 142, 144, 162, 164, 183,
Conrad, S., 189 204, 206, 232
Conrow, F., 162 Ellis, S., 163
Construct definition, 13, 14, 16 Embedded assessments, 16
Cortese, M.J., 163 Engelhard, G. Jr., 96
Index 239
Native English-speaking Q
students, 35 Qian, D., 163, 228
Naurin, D., 201 Quality management, 16
Nesi, H., 234
Nieminen, L., 17
North-West University, 195, 206 R
Norton, B., 202 Rambiritch, A., 14, 18, 182, 195,
209, 215
Ransom, L., 8
O Raquel, M., 18
OʹHagan, S., 27, 31, 32, 36 Rasch Model
Oman, 7, 12, 164, 165, 174, 175, 222 test equating, 33
Oman Academic Accreditation Authority, WINSTEPS, 90, 95
165, 167 Razmjoo, S. A., 176
OPI. See ACTFL Oral Proficiency Interview Read, J., 7, 11, 13, 14, 25, 46, 47, 51, 58, 68,
(OPI) 163, 175, 176, 183, 195, 222–235
Oral English Proficiency Program (OEPP) Reading assessment, 71, 90
(Purdue), 11, 16, 118, 131, 132 Rea-Dickins, P., 209
Oral English Proficiency Test (OEPT) Reliability of assessments, 13
(Purdue), 11, 18, 114–133, 223 Reporting of assessment results, 18, 84
Owens, R., 142, 155 performance descriptors, 18, 84
Retention of students, 53
Richardson, S., 5, 142, 229, 230
P Rivera, R.J., 163
Palmer, A.S., 10, 15, 24, 58, 90, 186, 203, 207 Roberge, M., 234
Paltridge, B., 155 Roche, T., 12, 162–164, 166, 170,
Paribakht, T.S., 163, 166 173–175
Patterson, R., 13, 188–190, 194, 203 Roever, C., 183, 201, 207, 211
Pearson Test of English (PTE), 232, 233 Ross, P., 141, 142
Peer mentors, 8, 47, 49, 55–57, 59–61, 63 Rowling, L., 49
Perfetti, C.A., 163
Phillipson, R., 4
Placement testing, 7, 162 S
English Placement Test at the University of Saigh, K., 175
Illinois at Urbana-Champaign (UIUC), Saville, N., 16
162 Schmitt, D., 163, 174
Plake, B., 36 Schmitt, N., 163, 170, 174, 175
Poon, A.Y.K., 9 Scholtz, D., 211
Post-entry language assessment (PELA) in Schuurman, E., 202
Australia, 4, 7 Scouller, K., 230
Postgraduate students Screening assessment, 25–27, 144, 164–165
doctoral candidates, 10 Second-chance testing, 212, 214
role of doctoral supervisors, 146, 153 Second Language Testing, Inc., 206
Pot, A., 13, 16, 195, 203, 213, 214 Segalowitz, N., 163
Presentation of assessments to stakeholders, Segalowitz, S., 163
33 Seidlhofer, B., 162
Prior, P., 168 Seigel, L.S., 175
Professional communication skills, 227–229 Self-study. See Independent language learning
Punamäki, R.I., 47 Semi-direct speaking tests. See Speaking
Purdue University, 11, 16, 223 assessment
Sharifian, F., 232
242 Index
X
Xi, X., 11