Gabrielatos, C. (2007). Selecting query terms to build a specialised corpus from a restricted-access database. ICAME Journal, 31, 5-43.

This paper proposes an accessible measure of the relevance of additional terms to a given query, describes and comments on the steps leading to its development, and discusses its utility. The measure, termed relative query term relevance (RQTR), draws on techniques used in information retrieval, and can be combined with a technique used in creating corpora from the world wide web, namely keyword analysis. It is independent of reference corpora, and does not require knowledge of the number of (relevant) documents in the database. Although it does not make use of user/expert judgements of document relevance, it does allow for subjective decisions. However, subjective decisions are triangulated against two objective indicators: keyness and, mainly, RQTR.

Selecting query terms to build a specialised corpus from a restricted-access database* Costas Gabrielatos Lancaster University Abstract This paper proposes an accessible measure of the relevance of additional terms to a given query, describes and comments on the steps leading to its development, and discusses its utility. The measure, termed relative query term relevance (RQTR), draws on techniques used in information retrieval, and can be combined with a technique used in creating corpora from the world wide web, namely keyword analysis. It is independent of reference corpora, and does not require knowledge of the number of (relevant) documents in the database. Although it does not make use of user/expert judgements of document relevance, it does allow for subjective decisions. However, subjective decisions are triangulated against two objective indicators: keyness and, mainly, RQTR. 1 Motivation and central issues The primary motivation for examining issues related to query formulation and expansion was the need to compile a corpus for the ESRC funded project entitled Discourses of refugees and asylum seekers in the UK Press 1996–2006, which aims to explore the discourses surrounding these groups and to account for the construction of their identities in the UK press.1 Further motivation was provided by the idiosyncrasies of the online database from which the texts would be retrieved. Specifically, the database interface imposed certain access limitations, such as the number of documents returned for each query, and information regarding the number of database documents matching a given query (more details follow later in this section). Although the discussion will draw on the work carried out as part of the project, the technique presented in this paper can be employed in a wider set of circumstances, for example, in instances when the idiosyncrasies and restrictions outlined in this section do not apply. 5 ICAME Journal No. 31 When compiling a specialised corpus from a text database by use of a query, there is a trade-off between precision and recall (e.g. Chowdhury 2004: 170). That is, there is a tension between, on the one hand, creating a corpus in which all the texts are relevant, but which does not contain all relevant texts available in the database, and, on the other, creating a corpus which does contain all available relevant texts, albeit at the expense of irrelevant texts also being included. Seen from a different perspective, the trade-off is between a corpus that can be deemed incomplete, and one which contains noise (i.e. irrelevant texts). In the former case, some aspects of the use of, or relations between, terms and/or concepts may be underrepresented or missed – depending on the size of the corpus in relation to the body of relevant available data. In the latter case, statistical results may be skewed (notably keyness),2 and the corpus building, as well as any mark-up and annotation, can become unduly time consuming. It would be helpful, therefore, to use objective indicators of the degree to which a candidate query term is expected to return relevant documents, or, to be more precise, the degree to which the addition of a term to the query results in the addition of relevant documents. Such indicators would then inform decisions regarding the terms to be included in the query. In order for the term ‘relevant document’ to have any meaning, the compilers of a specialised corpus need to define what the corpus would ideally contain, and then “adjust [their] parameters” according to what is feasible under the particular circumstances (Sinclair 2004: 81). An obvious starting point for the compilation of a query is lexis denoting the entities, concepts, states, relations or processes that are to be investigated (e.g. Chowdhury 2004: 169). With regard to the particular project, the best starting point seemed to be the title and description of aims – which also settled the question of the source of the texts. In this light, two core query terms seemed to suggest themselves, refugee(s) and asylum seeker(s), leading to the following core query: ‘refugee* OR asylum seeker*’. The decision to use these two terms as the core query may be considered subjective; however, given the clearly defined purpose for the corpus compilation, their selection was at least inescapable, and arguably objective within the project parameters (issues of subjectivity/objectivity are revisited in sections 2 and 3). Of course, a corpus built using only this core query would yield very useful insights (see Baker and McEnery 2005), particularly given the ten-year span of the texts comprising the corpus. It is estimated that the core query alone, used on a database of twelve UK national newspapers from 1996 to 2005, would yield a corpus of 35–40 million words. However, since one of the aims of the project is to build on existing research, it seems appropriate to examine the feasibility of compiling a richer corpus. 6 Selecting query terms to build a specialised corpus from a restricted-access database One argument for a richer corpus is that some terms may have overlapping uses. Baker and McEnery (2005: 201) report that although the website of the Office of the United Nations High Commissioner for Refugees “is focussed around refugees … there were still a number of references to asylum seekers …, suggesting that the two identities share a common ground”. This observation seems to be supported by an aspect of measuring query term relevance (see section 2 and note 9). Terms may also be related in terms of sequence or change of state. For instance, the status of persons may change from asylum seekers to refugees, or vice versa, according to the definition adopted. Dictionary definitions present an asylum seeker as a refugee who has applied for asylum, and so imply the sequence ‘refugee asylum seeker’, whereas the definitions of the Refugee Council3 imply the opposite sequence (see Table 1:). Table 1: Definitions of refugee and asylum seeker refugee asylum seeker Longman dictionary of contemporary English on CD-ROM (2003) Someone who has been forced to leave their country, especially during a war, or for political or religious reasons. Someone who leaves their own country because they are in danger, especially for political reasons, and who asks the government of another country to allow them to live there. Refugee Council Someone whose asylum application has been successful and who is allowed to stay in another country having proved they would face persecution back home. Someone who has fled persecution in their homeland, has arrived in another country, made themselves known to the authorities and exercised the legal right to apply for asylum. Conversely, terms with almost identical dictionary definitions may be used less interchangeably than expected, as is the case of immigrant and migrant. More importantly, the terms refugee(s) and asylum seeker(s) are frequently used interchangeably with the terms immigrant(s) and, less so, migrant(s) (e.g. Greenslade 2005: 5). Thus, one newspaper may describe a person as an asylum seeker, whereas another may refer to him/her as an (illegal) immigrant. The subsequent collocational analysis has showed considerable overlap of the collocates of refugees/asylum seekers and those of immigrants/migrants, which indicates an overlap in usage (Gabrielatos and Baker 2006). Wilson (2006: 13), employing critical discourse analysis to examine 242 articles from Scottish newspapers, came 7 ICAME Journal No. 31 to the same conclusion. In that light, it seems worthwhile to add such related terms to the query. The addition of query terms relevant to the core terms seems to also be supported by the observed tendency for representations of groups in the press to “include or exclude social actors to suit their interests and purposes in relation to the readers for whom they are intended” (van Leeuwen 1996: 38). In other words, even if an article reports on or discusses issues related, directly or indirectly, to refugees or asylum seekers, these two groups may not necessarily be referred to explicitly. If, however, the query string includes as many other terms as possible referring to the same or similar groups, then it is expected to capture a large proportion of those articles in which the groups in question are not mentioned explicitly. Further support for the addition of relevant query terms comes from the methodology to be used in the data analysis. The analysis involves the examination of the collocations and resulting lexical networks of the core terms refugee(s) and asylum seeker(s), and the interrelations in meaning/use that they may reveal. Arguably, these interrelations will potentially become clearer if the study could also take into account the collocational patterns and lexical networks of the related terms. For example, terrorism registers as a very strong key word when two sample corpora drawn from the database using the core query are compared to the written BNC Sampler.4 That is, terrorism seems to be strongly associated with topics related to the terms refugee(s) or asylum seeker(s), or, at the very least, to be present in texts containing one or both of these core query terms. It would be helpful, therefore, to examine what other terms (i.e. entities, concepts, states or processes) terrorism tends to be associated with in the corpus. Another example is the case of asylum. As one of the groups in focus is those who seek asylum, it seems beneficial to examine its collocational networks in the corpus to be constructed, in order to examine possible links between its different uses. These relations can, of course, also be examined in a representative general corpus, but there are also arguments for examining such associations within the same corpus. The collocational relations established within the specialised corpus can yield additional insights, as they would reveal the use of the term terrorism not in a diverse (albeit representative) range of genres and text types, but in the same clearly specified range of texts in which the associations of the core terms themselves were also established (see McEnery 2006). To put it simply, the associations would be compared against the same background. In sum, additional query terms would ideally return articles which do not contain the core query terms, but are either about the groups denoted by the core terms, or about groups, processes, etc. which are treated as being related to 8 Selecting query terms to build a specialised corpus from a restricted-access database them. However, if such related terms also return a disproportionate number of articles irrelevant to the core query terms, then their addition to the query would render the compilation of a specialised corpus unnecessarily time consuming, or, in the case of the present project, impracticable. For example, the addition of terrorism alone to the core query results in a six-fold increase in the size of a sample corpus spanning thirty days, which translates into a 50–100 per cent increase in the time needed to collect the documents (see also section 2). It seems clear, then, that, desirable as it may be, compiling a corpus containing all terms related, to any degree, to the core query is impossible under the circumstances. This brings us back to the issue of the principled selection of query terms, to which we will now turn. 2 Query term selection There are simple formulas which calculate the degree of precision and recall of a query, as Figures 1 and 2 show (Baeza-Yates and Ribeiro-Neto 1999: 75). |Ra| is the set of retrieved relevant documents, |A| is the set of retrieved documents, and |R| is the set of relevant documents in the database. |Ra| Precision = |A| Figure 1: Calculation of Precision |Ra| Recall = |R| Figure 2: Calculation of Recall However, these formulas are not applicable to the present case, as the number of relevant documents in the database is unknown. The same applies to more complex models, such as best match searching and relevance feedback (see Chowdhury 2004: 180–182). Also, establishing the relevance of the additional documents retrieved by the candidate terms is exactly what is sought here. Assessing the relevance of each candidate term by reading (a sample of) the documents returned by the addition of each candidate term to the core query, as in the case of user relevance feedback (e.g. Buckly, Salton and Allan 1994: 292; 9 ICAME Journal No. 31 Baeza-Yates and Ribeiro-Neto 1999: 75) introduces more subjective decisions, irrespective of whether a number of judges are involved (e.g. Belew and Hutton 1996), or, as in the vector processing model, the documents are returned in order of relevance, either based on the number of query terms in the returned documents, or on the indexing of documents in the database (Chowdhury 2004: 176– 180). In fact, reliance on indexing can exclude relevant documents, not only because it is unlikely that the indexing was carried out with the particular project in mind, but also because even metaphorical uses of the core terms, which may not be indexed as relevant, are considered relevant for the purposes of the present project. For example, although refugees is a database index term, and documents are returned with a weight on the index term relevance, not all documents containing the word ‘refugees’ are so indexed, presumably because it was decided that this group was not one of the main topics in the document. Similarly, approaches to establishing the probability of relevance of query terms also rely on knowledge which, in the present case, was unavailable, or would be prohibitively time-consuming to acquire, such as the number of documents in the database, the number of words in the collection, the number of relevant documents for a given query term, or the frequency of each term in each document examined for relevance (e.g. Roberstson and Sparck Jones 1976; Boughanem et al. 2006). The approaches outlined above would also be impractical in view of the interconnected project-specific constraints relating to the number of candidate terms (more than 100), as well as the available time, finances and human resources (see also Baroni and Bernardini 2003) – particularly as the corpus was a means to an end. Another reason why techniques developed within the field of information retrieval are not entirely helpful in this case may lie in the need to use Boolean queries, which the database interface operates with, as “the Boolean model is in reality more a data (instead of information) retrieval model” (Baeza-Yates and Ribeiro-Neto 1999: 26). The distinction between data and information seems pertinent to the project, as it is data that is sought, data which will be analysed for the information described in the project aims. That is, the corpus needs to contain articles relevant to refugees and asylum seekers (and related groups) without any selection bias regarding the content of articles (i.e. the information given or the stance adopted in them). Attempting to retrieve documents containing specific information can impose bias on the data collection and, consequently, on the study outcomes. A further reported drawback of Boolean queries is that they do not allow for relevance ranking of the retrieved documents (Chowdhury 2004: 174). However, this does not pose a problem for our purposes; on the contrary, it simplifies matters. Once a term is deemed relevant, any 10 Selecting query terms to build a specialised corpus from a restricted-access database article containing it is also deemed relevant, and a document containing a single relevant query term is considered as relevant as one containing two or more (another reason why indexing is not helpful in this case). This is because even metaphorical or humorous uses of relevant terms are desirable, in that they can provide insights into the representation of the two groups.5 At this juncture, the literature on compiling corpora from the web seems worth investigating. The compilation of corpora from the world wide web involves the use of an API (Application Programming Interface), that is, a service which allows thirdparty software access to a search engine’s index of web pages. The first step is to decide on an initial set of terms (or seeds) which are expected to return relevant texts, irrespective of whether relevance is defined in terms of genre, topic or language (e.g. Ghani et al. 2001; Baroni and Bernardini 2004; Baroni and Sharoff 2005). These initial terms are combined randomly in equal sets (e.g. pairs or triplets) to be used as queries (Baroni and Bernardini 2004: 1314). The documents retrieved from each query (or a portion of them) are used to compile a pilot corpus. The corpus derived thus is compared to a reference corpus to establish keywords in the pilot corpus – Baroni and Sharoff (2005) suggest using the 40 top keywords. A random sub-set of these keywords is used to form new sets of queries. This procedure is repeated as required, although Baroni and Bernardini (2004: 1314) report that they did not have to repeat the procedure more than two or three times. This technique seems to be an adaptation of relevance feedback. Instead of users or experts reading (a sample of) the documents to assign a relevance score, the decision is largely reached through successive keyword comparisons. However, Baroni and Bernardini (2003: 4, 2004: 1314) acknowledge that the number of initial terms, the cut-off point for the use of key words as interim query terms, and the selection of documents for each pilot corpus are subjective decisions, sometimes based on trial and error. In sum, the procedure may not be entirely objective; it is, however, free from the decisions of human readers. Given the availability of a set of software tools (BootCaT) which would automate the procedure (Baroni and Bernardini 2003; Baroni et al. 2006), this technique would be considered promising for our purposes. However, the tools cannot be applied to the particular database. The techniques used here (see section 3) adapt and combine elements of the procedures outlined above. Candidate terms are selected through a keyword comparison of a pilot corpus of database documents returned by the core query and a representative corpus of British English;6 however, introspectively selected candidate terms were also tested for relevance. As regards relevance, the focus was shifted from the relevance of documents to the relevance of additional query terms, that is, the degree to which they are found in the same docu- 11 ICAME Journal No. 31 ments containing one or more of the core query terms. In other words, readingbased decisions were replaced by an indicator (RQTR) reflecting the number of additional documents returned by the addition of each candidate term to the core query. The use of pilot corpora, rather than a corpus compiled by applying the core query to the whole sub-section of the database required for the project (in terms of newspapers and time span), was dictated by the constraints imposed by the database interface, which do not appear to be unique to the database used to compile the corpus, and are not dissimilar to those imposed by search engine APIs.7 The database interface imposed restrictions on the number of documents returned for each query, as well as the number of documents that could be downloaded at a time, and gave no information about the number of documents that would be retrieved in absence of the restrictions. Combined, these restrictions make it impossible to establish the number of documents when a query returns more documents than the limit, without breaking down the query time span into smaller units (a hit-and-miss affair). As an indication of the time investment that working under those restrictions would entail, consider the case of using only the core query, refugee* OR asylum seeker*, as initial seeds. Extrapolating from the frequency of the two terms and the number of documents in the corpus, it was estimated that repeating the procedure with combinations of the 40 top keywords even only twice, which is the minimum number of repetitions that Baroni and Bernardini (2004: 1314) report, would take up at least one-third of the time available for the project – clearly, an inordinate amount of time. Due to these restrictions, the pilot corpora used in the process of determining the query to be used for the corpus compilation were not based on all the texts available in the database over the period in question. The first pilot corpus (henceforth UK1) contained articles published between 11 September and 10 October 2005 (342,590 words); the second (henceforth UK6) contains articles published during six random months spanning the duration of the intended corpus: October 1996, December 1998, February 2000, April 2002, June 2004, August 2005 (2,658,184 words). Both corpora comprised texts from twelve UK national newspapers returned using the core query string. UK6 is balanced more towards the present in order to preserve the balance between broadsheets and tabloids, as the database contains a higher proportion of broadsheets before 2000. The reference corpora used for the keyword comparison were derived from the BNC (Aston and Burnard 1998): the written BNC Sampler (henceforth BNC-S; 1,082,171 words), and the newspaper sub-corpus of the written BNC (henceforth BNC-N; 9,670,226 words). 12 Selecting query terms to build a specialised corpus from a restricted-access database A second reason for adapting the process used to build corpora from the world wide web is the ephemeral nature of many entities, events, etc. referred to in newspaper articles. For many of the top forty keywords it is not apparent that they have registered statistical significance because of their relation to the perception of the nature/status of refugees and asylum seekers. Rather, these terms seem to have been important during the time period covered in the pilot corpus (e.g. darfur, hurricane, iraq, orleans, wolfgang), referred to groups with a longstanding relation to the core terms (e.g. jewish, palestinian), referred to political entities relevant to the UK (e.g. blair, eu, labour, tony), or were not specifically related to the core terms (e.g. killed, police, war).8 In the same vein, keyword comparisons with a reference corpus that is not contemporary with the pilot corpus are bound to favour words referring to entities, concepts, etc. which were not current in the period represented by the reference corpus. In our case, the two corpora do not even overlap: the BNC contains documents up to 1994, whereas UK6 spans 1996–2005. We also need to consider whether, irrespective of their time-specificity, some words registered keyness not because they are related to refugees or asylum seekers, but because UK6 comprises newspaper texts, whereas BNC-S is a general corpus. To establish whether keywords are news-specific rather than specific to the core query terms, UK6 was also compared to BNC-N. The comparison showed considerable overlap between the two keyword lists: 60 per cent in the top 100 content keywords, rising to 80 per cent in the top 40. This seems to indicate that keyness is more the result of the key terms’ relation to the core terms than their specificity to newspaper articles. However, this does not diminish the possibility that the keyness of a large number of words was mainly due to the different time spans. Finally, a keyword analysis effectively treats the compared corpora as single texts. As a result, some words may register keyness because they have very high frequencies in a relatively small number of documents, even if this clustering is not representative of the majority of documents in the corpus. In addition, and particularly for this project, this characteristic also tends to boost the keyness of words in broadsheets, as, on average, articles in them are usually much longer than in tabloids (in the corpus, articles in broadsheets are on average 46.6% longer than in tabloids). A technique which can solve these problems is the calculation of key-keywords (Scott 2004: 115), that is, words which are key in a number of texts in a corpus, in order to establish associates, that is, “key-words associated with a key key-word” (ibid.: 109). This would be a helpful technique if it were not prohibitively time-consuming in the present context, as it would entail downloading one document at a time (UK6 contains almost 4,000 docu- 13 ICAME Journal No. 31 ments), as files were downloaded in batches of up to 200 documents. More importantly, the technique does not bypass the problem of time-specific keywords outlined above. These considerations suggest that keyness alone may not always be a good indicator of the suitability of candidate query terms. For instance, including any single one of the terms mentioned above (e.g. blair, hurricane, palestinian) would decrease recall and create an overlarge corpus, without necessarily increasing precision enough for the inclusion to be justified. Furthermore, using the iterative process described above to objectively discard these terms would require an investment in time which does not seem justified within the constraints of this project, particularly when considering the document download restrictions of the database. In this light, it might not seem unreasonable to opt for examining the lists of key n-grams and choosing those which are consistent with our subjective assessment of candidate term relevance. However, since introspection, on its own, is not a reliable indicator, it would be best to introduce a second objective method of measuring candidate query term relevance, which would then be used as a means to triangulate decisions regarding additions to the query. 3 Measuring candidate query term relevance The procedure applies to decisions on the inclusion of additional query terms, after a core query has been formulated. The objective is to establish whether query terms can be added which will return a sufficient number of relevant documents not containing the core terms, without creating undue noise. The addition of query terms should return the minimum possible number of unrelated documents. The underlying principle is that helpful additional terms are those which can be shown to be associated with the core terms in a sufficient number of contexts; that is, the term should demonstrate preference for texts containing the core query. The first step in quantifying that preference is establishing the ratio of the number of texts returned by the query ‘core query AND9 candidate term’ (henceforth, CQ&T) to the number of texts returned by a query containing only the candidate term. The query term relevance score (henceforth, QTR) is calculated as shown in Figure 3: CQ&T QTR = T Figure 3: Calculation of Relevance 14 Selecting query terms to build a specialised corpus from a restricted-access database QTR is, in essence, a global technique, in that it examines word co-occurrences in a corpus in order to expand a given query (see Xu and Croft 1996: 4–5), albeit using a sample of the available documents. Also, the nature of QTR is not unlike that of scores in the vector processing model, in which “the similarity between two objects [i.e. documents] is computed as a function of the number of properties [i.e. index terms] that are assigned to both objects; in addition, the number of properties that is jointly absent from both the objects may also be taken into account” (Chowdhury 2004: 176). In the QTR score, these properties are the relative co-occurrence of candidate and core-query terms in documents, or, in other words, the relative frequency of the presence or absence of a candidate term in documents containing one or more of the core query terms. QTR also has aspects in common with the best match searching model, which is “a term weighing scheme that reflects the importance of a term” (Chowdhury 2004: 180). It must be clarified that, as it stands, the QTR score means very little on its own, and its main utility is to help establish the baseline score (see below). As will be seen later in this section, QTR scores are sensitive to the make-up of the pilot corpus. Therefore, QTR should be interpreted in relation to three other scores: that of clearly relevant terms, for example, those which are relevant by definition (in our case, the core query terms), that of clearly irrelevant ones (see below), and the baseline. The baseline for an acceptable level of relevance is indicated by the lowest QTR derived for one of the core terms when the rest are used as the core query. In other words, the threshold marking preference is set by the baseline score. At the same time, we should also take into consideration the distance between the scores of candidate terms and clearly unrelated terms. Let us use UK1 to demonstrate how the baseline relevance is calculated. Having established the core query ‘refugee* OR asylum seeker*’, we will now calculate QTR for each of its constituent terms, treating the other as a candidate term (see Tables 2 and 3). Table 2: Relevance of asylum seeker* with refugee* as the core query CQ&T (refugee* AND asylum seeker*) T (asylum seeker*) 39 QTR 125 0.312 15 ICAME Journal No. 31 Table 3: Relevance of refugee* with asylum seeker* as the core query CQ&T (asylum seeker* AND refugee*) T (refugee*) 39 QTR 349 0.112 Following the premise that terms are deemed good candidates if their relevance score is at least equal to that of the lowest-scoring core term, the baseline for candidate term relevance would be QTR=0.112.10 If we take into account corpus-based research which strongly indicates that different forms of a lemma may enter into different collocational patterns and demonstrate different semantic prosodies/preferences11 (e.g. Sinclair 1991: 53– 65, 154–156), then it seems appropriate to calculate the relevance score of the different forms of a candidate term separately, rather than only the *stem* (stem plus affix wildcards), as different forms may yield different scores. On the other hand, it may be desirable to treat synonymous terms as a single query item. For example, although emigrant shows QTR below the baseline, its relevance increases if we calculate QTR for the query ‘*migrant’ (i.e. emigrant OR immigrant OR migrant), which almost equals the baseline (QTR=0.108). This seems to be consistent with the findings of Baker (2004), who suggests, in relation to keywords, that corpus-based research would be wise to also examine the keyness of groups of notionally related low-frequency keywords. Table 4 shows the result of the initial examination, which compared the keyness12 (or lack of it) and the QTR score of candidate terms, as well three check terms, that is, clearly irrelevant terms: dvd, guitar, lemon.13 Keywords were derived from the comparison of UK1 and the written BNC Sampler. Some candidate terms were selected because they were among the strongest keywords, others because they were introspectively deemed to be closely related to the core terms (some of the latter are lower-ranking keywords, others are non-key). In the LL column, bold indicates that the term is one of the top 40 keywords in the comparison of unigrams, bigrams and trigrams (as appropriate), or, in the case of wildcarded terms, that it has a score which would place it among the top 40; the symbol indicates that a term is not key. In the QTR column, bold indicates that the relevance score is above the baseline. Candidate terms are listed in alphabetical order. 16 Selecting query terms to build a specialised corpus from a restricted-access database Table 4: Keyness and relevance of candidate and check terms in UK1 Candidate Terms abuse abused LL QTR 28.1 0.015 22.4 0.017 396.3 0.015 deportation 90.6 0.208 deported 68.9 0.216 deportees 30.4 0.292 deporting 19.7 0.182 deport*14 147.5 0.137 blair 0 displace(s) displaced 0.113 displacement 0.034 displacing 0.154 displac* 0.076 dvd 34.2 0.008 emigrant 0.037 emigrated 0.081 emigration 0.063 emigr* 0.071 ethnic minorit* 0.064 evacuate 22.8 evacuated 23.6 0.062 0.051 0.059 evacuating 0.049 evacuation(s) evacuee(s) 62.7 evacu* 80.5 0.082 0.045 0.048 expelled 0.083 expulsion extradition OR expulsion 36.8 0.040 extradition 18.5 0.024 0 firm but fair fugitive(s) genocide guitar 0.014 125.4 0.087 0.052 17 ICAME Journal No. 31 LL QTR human rights 123.6 0.057 hurricane 217.2 0.017 Candidate Terms 0 illegal alien(s) illegal entry illegal immigrant(s) immigr* 0.154 60.8 0.150 457.7 0.098 immigr* OR emigr* 442.0 0.097 immigr* OR emigr* OR migrant 532.9 0.095 *migrant 281.2 0.108 immigrant(s) 191.6 0.114 0 immigrate* immigration 265.4 0.132 leave to remain 0.143 lemon 0.006 migrant(s) 92.6 0.156 policy 0.018 settler(s) 0.160 stranded 22.4 0.028 terrorism 160.3 0.025 16.1 0.013 threat unemployed 0.024 unemployment 0.023 An initial observation is that keyness does not tend to coincide with relevance. Fewer than one-third of the top-40 key terms, and of all the key terms examined, establish relevance above the baseline (30.8% and 31% respectively), whereas almost one in five (19.2%) of non-key terms have relevance higher than B. This is quite interesting given the very low baseline score (B = 0.112). Conversely, lack of keyness does seem to coincide with lack of relevance (80.8% of the cases examined). What is more, there are cases when keywords have a QTR score very close to that of clearly irrelevant terms, a discrepancy that becomes more striking when such terms are among the top-40 keywords (blair, hurricane, genocide, terrorism). At this point, we need to consider whether these discrepancies are due to the fact that the pilot corpus only spanned one month, as the relevance score need 18 Selecting query terms to build a specialised corpus from a restricted-access database not be static, but may well be dynamic. That is, a candidate term may not have been closely related to the core terms towards the beginning of the period in question (i.e. 1996–2005), but it may have been increasingly treated as relevant in recent years (or vice versa). A further indicator, then, of the relevance of a candidate term is the consistency of its score over time. Similarly, the keyness of terms can also be expected to change over time, particularly the further the publication date of newspaper articles is removed from the period covered in the reference corpus. For this reason, a second pilot corpus (UK6) was compiled, which included articles spanning the period 1996–2005. The objective was to establish whether, and to what extent, the keyness or relevance of candidate terms would be affected by the different composition of the two pilot corpora in terms of the publication dates of the texts they contained. It has to be clarified that what is of interest in this comparison is not so much the strength of keyness, but whether the keyness of a term, or the relation of QTR to the baseline, would be consistent in the two pilot corpora. As Tables 5 and 6 show, the calculation of baseline relevance in UK6 confirms that the QTR score is indeed sensitive to the make-up of the corpus, as the baseline score (B) is different from the one calculated for UK1. Table 5: Relevance of asylum seeker* with refugee* as the core query in UK6 CQ&T (refugee* AND asylum seeker*) T (asylum seeker*) 593 QTR 0.423 1403 Table 6: Relevance of refugee* with asylum seeker* as the core query in UK6 CQ&T (asylum seeker* AND refugee*) T (refugee*) 593 QTR 2596 0.228 It is clear that the QTR score does not lend itself to comparisons between two corpora, and, consequently, it does not allow for reliability checks, as the baseline score changes with the corpus make-up. However, this can be easily remedied if instead of the absolute relevance we calculate the relative relevance (RQTR) of a term. The RQTR score measures the relative distance of the QTR score of a candidate term from the baseline (Figure 4). The introduction of the baseline score in the calculation of RQTR seems compatible with the best match 19 ICAME Journal No. 31 searching model, in that “a best match search matches a set of query words against the set of words corresponding to each item in the database, calculates a measure of similarity between the query and the item, and then sorts the retrieved items in order of decreasing similarity” (Chowdhury 2004: 180). However, best match searching requires the database documents to be indexed, which would be undesirable, even if the database to be used did include pertinent index terms (see section 2). When comparing the RQTR scores of a candidate query term in two pilot corpora, we are examining whether, and to what extent, the term scores are higher or lower than B, as established in each corpus – effectively neutralising inter-corpus fluctuations in the baseline score. (QTR-B) 100 RQTR = B Figure 4: Calculating RQTR from QTR and B The utility of relative relevance is twofold. It allows comparisons of relevance between corpora, and it facilitates the comparison between the relevance of different candidate terms in a given corpus. However, as it stands, RQTR is only helpful for comparisons in cases of negative RQTR scores, as the minimum possible RQTR is always the same (-100). The minimum possible RQTR is calculated when QTR is zero, that is, when the candidate term is never found in the same database texts with the core query terms. In the case of positive scores, the maximum possible RQTR score depends on the baseline score (it is inversely proportionate to it), and can fluctuate widely. The maximum RQTR is derived when QTR is 1, that is, when the candidate term is always found in the same database texts with the core query terms. For example, with B=0.228, the maximum RQTR score is 338.6, whereas with B=0.112, the maximum RQTR score is 792.8. Therefore, in order to be able to compare the distance (higher or lower) from the point where QTR=B, we need to normalise positive RQTR scores. We derive the normalised positive RQTR score (RQTRn) by calculating positive RQTR values as if the maximum possible RQTR were 100 (Figure 5): RQTR 100 RQTRn = max. RQTR Figure 5: Calculating RQTRn 20 Selecting query terms to build a specialised corpus from a restricted-access database If we substitute (QTR – B) * 100 for RQTR, and (1–B) * 100 for max.RQTR (as B B the maximum possible QTR value is 1) in the above formula, then we have a formula for calculating RQTRn which makes use of the QTR and B scores, so there is no need to calculate RQTR and max.RQTR (Figure 6:). (QTR – B) 100 RQTRn = 1–B Figure 6: Calculating RQTRn without RQTR For ease of reference, RQTR will denote both negative and normalised positive scores, that is, when the RQTR score is positive it should be understood to have been normalised. The RQTR score indicates relevance on a bi-directional scale, by treating the baseline as the zero point: when QTR=B, then RQTR=0. The scores show whether the relevance of a candidate term is higher (positive) or lower (negative) than the baseline relevance, and also indicate the extent of the distance from the baseline. See Table 7 for details: Table 7: Interpreting RQTR scores. RQTR Interpretation +100 Full relevance: the candidate term is always found in database texts containing one or more of the core query terms. 0 Baseline relevance: the candidate term has the same level of relevance as that set as the minimum for inclusion to the final query.15 –100 No relevance: the candidate term is never found in database texts containing any of the core query terms. The RQTR score is useful in two ways. First, it makes explicit the distance from the baseline score (either positive or negative), and thus facilitates the comparison of term relevance scores within the same corpus. Second, it enables the comparison of relevance between corpora derived from different time periods.16 Interestingly, candidate terms with RQTR of +100 need not be added to the query as they will be returned by the core query alone. Now we are able to carry out a more systematic comparison of keyness and query term relevance, particularly as the previous comparison only used a some- 21 ICAME Journal No. 31 how arbitrary list of items. The top-40 keywords in the comparisons between the two pilot corpora (UK1 and UK6) and the two reference corpora (BNC-S and BNC-N) were combined, producing a total of 74 distinct keywords. In order to create the most helpful list possible, certain words are not included: the core terms (refugee*, asylum, seeker*), function words, and frequent verbs (e.g. say). Table 8 compares the keyness (LL) and relative relevance (RQTR) of these terms. For ease of comparison, and since the highest p value in all four comparisons is as small as 10-14 (see Table 8 for details), in Table 9, top-40 keyness is indicated by the symbol ‘!’ and non-keyness by ‘ ’; additionally, top-40 keyness and positive RQTR are indicated by shading. Table 8: Lowest top-40 LL scores and highest p values in the keyword comparisons Lowest top-40 LL Highest top-40 p UK1*BNC-S 107.8 p<10-14 UK1*BNC-N 171.1 p<10-15 UK6*BNC-S 323.6 p<10-16 UK6*BNC-N 1203.4 p<10-18 Table 9: Comparison of LL and RQTR scores of top-40 keywords in UK1 and UK2 CQ: refugee* OR asylum seeker* Top-40 keywords (n = 74) Top-40 Keyness UK1 UK6 RQTR UK1 UK6 afghan ! -33.0 +3.5 afghanistan ! -59.8 -38.2 ! -72.3 -81.6 arafat ! -17.0 +8.7 ariel ! +1.9 +5.3 army ! -72.3 -69.7 ! -90.2 -87.3 al attacks 22 ! ! Selecting query terms to build a specialised corpus from a restricted-access database CQ: refugee* OR asylum seeker* Top-40 keywords (n = 74) Top-40 Keyness UK1 UK6 bethlehem blair(‘s) ! blunkett RQTR UK1 UK6 ! -100 +12.7 ! -86.6 -85.5 ! -91.1 -47.4 -55.4 -84.6 bondi ! britain ! ! -83.9 -82.0 camp(s) ! ! -40.2 -43.9 ! -41.1 -29.4 -52.7 -19.3 -85.7 -85.1 +12.2 +15.4 -88.4 -81.6 -85.7 -89.5 ! +4.7 0 ! -88.4 -94.3 civilians congo ! ! country darfur ! eu ! family ! gaza ! gbp ! genocide ! -22.3 -39.9 hamas ! +15.5 +11.4 home ! ! -93.8 -88.2 human ! ! -72.3 -76.8 hurricane ! -84.8 -88.1 immigrant(s) ! ! +0.2 +3.2 immigration ! ! +2.2 +11.5 iraq ! ! -66.1 -83.8 israel(‘s) ! ! -48.2 -35.1 israeli(s) ! ! -28.6 -18.0 23 ICAME Journal No. 31 CQ: refugee* OR asylum seeker* Top-40 keywords (n = 74) jack Top-40 Keyness UK1 UK6 RQTR UK1 ! UK6 -81.3 -84.6 jenin ! -100 +63.2 jerusalem ! -44.6 -15.4 ! -7.1 -45.6 jewish ! jew(s) ! -34.8 -44.3 katrina ! -84.8 -79.8 killed ! ! -70.5 -79.8 ! -6.3 -22.8 kosovo labour ! -75.8 -86.8 libeskind ! +4.7 -100 louisiana ! -55.4 -80.7 migrant(s) ! +4.9 +7.9 nazi ! -20.5 -62.7 -82.1 -84.6 -63.4 -86.0 ! -6.3 -14.5 ! -72.3 -75.0 ! office orleans ! palestinian(s) ! peace police ! ! -83.9 -88.6 pounds ! ! -94.6 -95.2 powell ! -92.0 -70.2 ramallah ! -57.1 +21.6 ransome ! -61.6 -100 rwanda ! -12.5 +5.3 rwandan ! +1.9 +19.4 24 Selecting query terms to build a specialised corpus from a restricted-access database CQ: refugee* OR asylum seeker* Top-40 keywords (n = 74) Top-40 Keyness UK1 UK6 RQTR UK1 UK6 saddam ! -83.0 -85.5 samir ! +9.9 -34.2 -80.4 -82.5 -67.0 -97.8 ! secretary seth ! sharon ! -81.3 -45.2 soldiers ! -68.8 -63.2 suicide ! -20.5 -68.4 -12.5 -33.8 sudan ! taliban ! 65.2 -30.3 terror ! -75.9 -73.7 ! -77.7 -70.6 -88.4 -90.8 -76.8 -47.8 -64.3 96.5 -73.2 -77.6 terrorism ! tony ! ! un walter ! war ! wiesenthal ! -33.9 -47.4 wolfgang ! -21.4 -87.7 zarqawi ! -73.2 -94.7 zimbabwe ! -53.6 -86.8 ! The RQTR score shows a significantly higher consistency between different corpora than top-40 keyness. When scores in the two pilot corpora are compared, RQTR polarity coincides for the vast majority of terms (89.2%), whereas top-40 keyness only coincides in just over a quarter of cases (28.4%). Also, in both pilot corpora, top-40 keyness and positive RQTR coincide in only 25 ICAME Journal No. 31 12.2 per cent of the cases; that is, most of the top-40 keywords would be expected to return documents largely unrelated to the core query terms. At this point, we need to revisit the introspectively selected terms examined above (Table 3) and examine the LL and RQTR scores in both UK1 and UK6 (see Table 10; bold indicates positive RQTR scores and keyness at top-40 level for each comparison; LL scores that are lower than 15.3 are indicated by ; parentheses before indicate keyness if the threshold is lowered to LL ! 6.63, p " 10-2 – for an explanation of why a lower keyness threshold was also examined, see page 29). Table 10: Comparison of LL and RQTR scores of introspectively selected terms in UK1 and UK2 Candidate Terms UK1* BNC-S abuse(s) 28.0 abused 22.3 deportation(s) 90.3 deported UK1* BNC-N UK1 RQTR UK6* BNC-S UK6* BNC-N UK6 RQTR -86.6 103.7 -84.8 32.0 203.7 +10.8 126.2 637.9 +7.5 68.7 129.5 +11.7 132.6 552.4 +1.7 deportee(s) 30.3 32.4 +20.3 deporting 19.6 39.3 +7.9 21.0 89.2 +10.2 147.5 418.8 +2.8 381.3 1594.5 +5.4 deport* (6.63) displace(s) displaced (14.0) 31.4 +0.1 -69.6 displacing +4.7 dvd(s)(‘s) (9.1) 48.5 (11.9) ethnic minorit* 26 (7.3) (12.8) -89.0 +1.0 -87.7 59.6 279.9 (9.8) -9.2 -70.6 -83.3 -32.1 44.0 256.6 -44.7 114.8 -92.8 24.6 110.5 -96.5 (8.1) emigration emigr* (11.4) -87.3 28.3 emigrant(s) emigrated (8.3) -100 displacement displac* (12.3) 103.4 (7.7) 27.6 -66.9 -56.1 -27.7 -83.8 -43.8 -50.9 -36.6 16.5 39.8 -69.7 -42.8 20.3 102.0 -61.4 Selecting query terms to build a specialised corpus from a restricted-access database Candidate Terms UK1* BNC-S UK1* BNC-N UK1 RCTR evacuate(s) 22.8 17.8 -44.6 evacuated 23.6 25.6 -54.4 evacuating evacuation(s) UK6* BNC-S 34.8 (11.6) UK6* BNC-N 67.4 -4.5 17.5 -66.7 -47.3 (12.3) 24.2 -56.2 UK16 RTQR -46.9 (12.8) 61.1 -56.6 evacuee(s) 62.7 149.9 -26.8 evacu* 80.5 158.1 -59.8 53.4 138.7 -55.2 -57.1 27.2 55.3 -75.0 -25.9 17.6 89.7 -56.6 -78.6 75.9 170.3 -69.7 -64.3 103.8 259.8 -66.7 expelled (11.8) expulsion(s) 18.6 extradition(s) 18.5 extradition(s) OR expulsion(s) 36.8 (9.4) 25.3 (11.3) 34.5 (8.4) -59.6 firm but fair -100 fugitive(s) -87.5 guitar(s)(‘s) -53.6 hijack(s) -80.2 187.9 721.3 +6.6 hijacker(s) -74.5 135.9 1149.2 +10.2 hijack* -91.9 487.4 2240.4 -27.19 -50.0 358.9 1809.6 -48.2 human rights 123.6 282.5 (7.8) (7.3) 23.6 -39.5 -75.0 -96.9 illegal alien(s) -100 (7.8) +44.4 illegal entry(ies) +4.7 (13.5) +13.6 illegal immigrant(s) 60.8 132.8 immigrate* +4.3 17.1 911.3 -100 +16.2 -100 immigr* 457.7 862.2 -12.5 1528.2 6481.6 +2.5 immigr* OR emigr* 442.0 772.2 -13.4 1502.4 5964.8 -4.8 immigr* OR emigr* OR migrant* 532.9 1000.7 -15.2 1569.1 6640.3 -7.0 *migrant(s) 281.5 579.7 -3.6 648.2 2901.6 +1.0 +3.5 37.6 162.6 +30.6 leave to remain (6.8) 27 ICAME Journal No. 31 Candidate Terms UK1* BNC-S UK1* BNC-N lemon(s)(‘s) UK1 RQTR UK6* BNC-S UK6* BNC-N -94.6 UK6 RQTR -96.0 massacre(s) 27.4 56.5 -48.9 170.6 754.9 -19.3 persecution(s) 24.2 62.1 +8.2 97.7 549.1 +11.4 persecut* 29.6 60.6 +4.0 149.5 665.8 +4.1 26.1 -83.9 156.0 -85.5 policy(ies) (11.6) racism 39.3 32.2 -55.6 186.3 561.9 -57.5 racis* 95.4 96.9 -58.3 470.6 1546.8 -57.0 69.9 +5.4 96.2 646.3 -44.7 -75.0 26.8 -88.4 99.9 settler(s) (14.7) stranded 22.4 threat(s) 18.1 (7.3) (10.7) 74.4 -87.7 -89.5 unemployed -78.6 -87.3 unemployment -79.5 -88.6 Again, keyness does not tend to correlate with relevance. Terms being key in both comparisons have positive RQTR in just above one-third of the cases: 34.8 per cent in UK1 and 35.3 per cent in UK6. The correlation is much lower when we consider terms only being key in one comparison, among which keyness coincides with positive RQTR in 18.2 per cent of the cases in UK1 and never in UK6. Overall, keyness in at least one comparison corresponds to relevance in just over a quarter of instances: 29.4 per cent (UK1) and 28.2 per cent (UK6). The discrepancy is rendered more striking if we consider that the baseline scores are rather low (0.112 for UK1 and 0.228 for UK6). That is, terms need to be present in only 11.2 or 22.8 per cent of the documents containing one or more core query terms in order to register relevance. Consequently, it might be reasonably expected that high-ranking keywords would also register positive RQTR, or, in other words, that keyness would be better able to discriminate between relevant and non-relevant terms, but this is not the case here (explanations follow later in this section). Conversely, lack of keyness correlates highly with lack of relevance. From the terms being non-key in both comparisons, 85.7 per cent (UK1) and 94.1 per cent (UK6) have negative RQTR. However, this also suggests that if keyness were the sole criterion for further examining the suitability of query terms, then a not insignificant proportion of relevant terms 28 Selecting query terms to build a specialised corpus from a restricted-access database would be left out (14.6% in UK1, 5.9% in UK6). This observation points towards a further utility of RQTR, namely the ability to test the relevance of introspectively selected non-key terms. Regarding consistency between the two sample corpora, both keyness and RQTR show similar results, with RQTR being relatively more consistent. RQTR polarity is the same in the two sample corpora for 85.7 per cent of terms.17 Keyness (or its lack) corresponds in both pilot corpora in 75 per cent and 78.6 per cent of the cases, in the keyword comparisons with BNC-S and BNC-N. However, we also need to examine whether the low correspondence between keyness and relevance is due to the threshold set for keyness (LL ! 15.13, p " 10-4), as opposed to the more frequently used lower threshold of LL ! 6.63, p " 10-2 (McEnery 2006: 233, nn. 32). To this end, the same comparisons discussed above were carried out with the lower keyness value, in order to establish whether the lower threshold would increase the correspondence of keyness with relevance. As Tables 11 to 15 show, this does not seem to be the case. For candidate terms being key in both, or at least one, comparison (Tables 11 and 13 respectively), the correspondence seems to mostly decline with the lower keyness threshold. It is significantly higher for terms being key in only one comparison (Table 12); however, the increase from 18.2 per cent to 50 per cent in UK1 only reflects the correspondence in two terms. Also, in no instance does the correlation go above half of the cases, and is overall no more than one-third (Table 13). Predictably, the correspondence between lack of keyness and lack of relevance increases with a lower threshold for keyness (Table 14). Consistency of keyness between different reference corpora shows some increase in only one corpus, and is never higher than the consistency of RQTR (Table 15). Table 11: Correlations: keyness in both comparisons and positive RQTR UK1 UK6 LL ! 15.13 34.8% 35.3% LL ! 6.63 30.3% 32.5% 29 ICAME Journal No. 31 Table 12: Correlations: keyness in one comparison and positive RQTR UK1 LL ! 15.13 UK6 18.2% 0% 50% 40% LL ! 6.63 Table 13: Correlations: keyness in at least one comparison and positive RQTR UK1 UK6 LL ! 15.13 29.4% 28.2% LL ! 6.63 28.9% 33.3% Table 14: Correlations: lack of keyness and negative RQTR UK1 UK6 LL ! 15.13 85.7% 94.1% LL ! 6.63 88.9% 100% Table 15: Consistency of keyness and relevance respectively Keyness BNC-S Relevance BNC-N LL ! 15.13 75.0% 78.6% LL ! 6.63 85.7% 78.6% 85.7% One reason for the discrepancy between the keyness and relevance of the terms examined is, arguably, that texts in the reference corpora predate those in the sample corpora, and, as a result, some words (e.g. names of politicians) will establish keyness irrespective of their relevance to the core query terms. Another possible reason is that some terms, although clearly associated with the 30 Selecting query terms to build a specialised corpus from a restricted-access database core query (e.g. Palestinians), are not central to the perception/presentation of the groups in focus. More generally, discrepancies may also be the result of the nature of the two measures. Keyword analysis regards the pilot and reference corpora as single documents, whereas RQTR looks at co-occurrence within individual texts in the pilot corpus. In that respect, calculating RQTR is not unlike calculating key-keywords (see section 2). However, RQTR bypasses the problem of time- or genre-specific keywords, as it does not depend on a reference corpus. For most of the terms with negative RQTR, it may be argued that their unsuitability for the target corpus was self-evident, and that there was little justification in investing time in examining the merits of their inclusion in the query. However, if these terms are indeed patently irrelevant, then, given the fact that some of them were (strong) keywords, this should be regarded as clear testimony to RQTR being more successful than keyness as an objective indicator of the suitability of candidate terms. In the light of the above, the RQTR score seems to be more reliable than keyness for purposes of query expansion. Nevertheless, the combination of the two measures seems advisable for two reasons. Neither measure is in itself entirely consistent when applied to different corpora. More importantly, as the process of calculating RQTR is itself an investment in time, and given the high correlation of non-keyness to non-relevance, keyword analyses can be employed to limit the number of candidate query terms, without excluding the possibility of also considering non-key candidate terms. Ideally, then, a successful candidate term would be a high-ranking keyword with a high positive RQTR, while also being introspectively plausible (i.e. consistent with our knowledge and experience). However, if an entirely suitable reference corpus is not available, then in cases of discrepancy (i.e. lack of keyness but positive RQTR, and vice versa) the relevance score carries more weight. In the same vein, we may also add introspectively plausible terms with a positive RQTR, irrespective of their keyness, such as, highjack(s), hijacker(s), illegal alien(s), illegal entry, leave to remain. Also, it seems reasonable to add all forms of a lemma or word family18 if a good proportion of the forms are key words with a positive RQTR score: deport* (instead of only deportation, deported, deporting), immigr* (instead of only immigrant(s), immigration), as well as emigr*, because of its semantic similarity to immigr*. Finally, part of a compound or, more generally, a meaningful n-gram may substitute for the whole compound/n-gram if it has a positive RQTR score and is found in a large number of meaningful n-grams, preferably if a good number of them are key. For example, asylum can substitute for asylum seeker(s) (see Appendices 1 and 2). Table 16 summarises the main steps involved in formulating the final query: 31 ICAME Journal No. 31 Table 16: Summary of main steps taken for query formulation • • • • • • • Selection of a minimum of two core query terms based on a clear definition of the content of the corpus to be compiled. Creation of a (sample) corpus using the core query terms linked by the Boolean operator ‘OR’.19 Calculation of the baseline score using QTR. Keyword analysis using an appropriate reference corpus (if available). Selection of candidate query terms among the (high-ranking) keywords, as well as through introspection. Selection of clearly irrelevant terms. Calculation of RQTR for the candidate and irrelevant terms. Examination of RQTR scores for final decision. It is recognised that a query built using the techniques discussed here may exclude some relevant texts, while including some irrelevant ones. However, it must also be stressed that the compilation of a corpus containing all and only the relevant texts in the database would require reading the retrieved documents, which, given the scope of the intended corpus, would be unrealistic. It is also recognised that the procedure is not entirely objective. What can be argued is that any subjective decisions are guided, if not constrained, by objective indicators. Also, the involvement of subjectivity is a characteristic shared with a large number of other techniques. For instance, selecting initial seeds, deciding on the number of top keywords to include in subsequent queries, defining and assigning document index terms, weighing the relevance of a document to a query by means of user/expert reading, or setting the p value that marks statistical significance are all largely subjective decisions. In the light of this, it must be clarified that neither the baseline nor the RQTR score are necessarily binding. Corpus compilers may choose to set a higher or lower baseline score to suit their purposes; for example, they may select as the baseline the highest rather than the lowest QTR score among core query terms. Similarly, they may decide to exclude candidate terms with (low) positive RQTR scores, or include terms with (low) negative scores, depending on their circumstances and aims. 32 Selecting query terms to build a specialised corpus from a restricted-access database 4 Conclusion The introduction of RQTR was intended as a means of triangulating decisions on query expansion by supplementing keyness as an objective indicator of candidate query term relevance, as well as providing a way of evaluating the relevance of introspectively selected candidate terms. It must be reiterated that RQTR requires that at least two clearly relevant terms can be selected, so that a baseline can be established. However, all query-expansion procedures have similar requirements. Furthermore, it seems unlikely that a good definition of the content of a specialised corpus will not suggest the requisite minimum of two clearly relevant terms. Therefore, it can be argued that the procedure discussed here is both principled, in that objective indicators are used, and conscious, in that the process is not fully automatic. To be more precise, the term selection conforms to one or both of the specified objective requirements, while at the same time having introspective plausibility. An important consideration when constructing/expanding a query is that any additional terms should not add undue noise. Ideally, then, suitable candidate terms would be strong key words with positive RQTR. However, it seems reasonable to add to the query other forms of the lemma or word family that a relevant term belongs to if a good proportion of these forms, or their combinations, have positive RQTR. Also, in the same way that keyness is not an absolute criterion, but depends on the maximum p value that is considered acceptable for statistical significance, the baseline score can be adjusted according to the corpus compilers’ needs. RQTR will also be a suitable technique on its own in other instances, particularly when an appropriate reference corpus (for the calculation of keywords) is not available. The RQTR score may be independent of reference corpora, but depends, to some extent, on the sample corpus; however, it is more consistent than keyness in that respect. Also, it disposes of the need to know the total number of documents in the database, and the need to manually examine retrieved documents. While RQTR bypasses the restrictions usually posed by database interfaces, the use of the technique is not limited to restricted-access text databases; on the contrary, the reliability of the RQTR score should increase as the access restrictions decrease. Finally, the procedures and calculations involved are expected to be accessible to all linguists or language educators who might want to build a specialised corpus drawing texts from a database. 33 ICAME Journal No. 31 Notes * I am grateful to Paul Baker (Lancaster University) for his comments on a previous draft. I would also like to thank Sebastian Hoffmann (Lancaster University) for the newspaper sub-corpus of the BNC, and the participants of the meeting of the Lancaster University Corpus Research Group on 30 October 2006 for their comments on an earlier version of this paper. 1. Funded by the ESRC (RES-000-22-1381); principal investigator: Paul Baker. Keywords are those words which are statistically significantly more frequent in the corpus under analysis when compared to another corpus (Rayson and Garside 2000; Scott 2001). http://www.refugeecouncil.org.uk/practice/basics/truth.htm In fact, the term terrorism is one of the highest ranking keywords with LL scores of 160.3 (p<10-15) and 339.8 (p<10-16) in two pilot sub-corpora (see section 2 for details). Examples of metaphorical uses of refugee(s): “After another half-hour, they packed us into a smaller train that crawled back almost to Falkirk before halting for another hour, because the station ahead was overrun by refugees from a London express unable to reach Edinburgh.” (The Daily Mail, 24 July 1998). “But Spurs – hopeless, hapless and complacent beyond belief – defended with all the savvy of refugees from a greasy spoon café” (The Mirror, 2 December 1999). “Ironically, extending the minimum wage to 16- and 17-year-olds may well keep them out of the workforce. Many employers will decide that illiterate refugees from our comprehensive schools give very poor value in comparison with Kurds, Poles and Africans” (The Daily Telegraph, 3 January 2004). The keywords analysis was carried out using WordSmith Tools 4 (Scott 2004). Significance was calculated using the Log Likelihood statistic, with the minimum statistical significance set at p " 10-4, LL!15.13 (see Rayson et al. 2004). For example, both Google and Yahoo APIs allow a maximum of 1,000 and 5,000 queries per day respectively, and return up to 1,000 pages per query. It is also possible that the current largely open access to databases of web pages compiled by search engines may be restricted in the future. A case in point is the recent announcement by Google that, as of 5 December 2006, 2. 3. 4. 5. 6. 7. 34 Selecting query terms to build a specialised corpus from a restricted-access database 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. they have stopped issuing new accounts for their SOAP search API. For details, see: http://code.google.com/apis/soapsearch/index.html (Google), and http://www.informit.com/articles/article.asp?p=382421&seqNum=2&rl=1 (Yahoo). Forty out of the top hundred keywords are proper nouns or adjectives denoting ethnicity. This is the Boolean ‘AND’. The comparison of QTR scores seems to also have the potential to contribute to the semantic analysis of lexis, particularly when their meanings overlap. For example, the difference in the relevance scores of refugee(s) and asylum seekers(s) may be interpreted as indicating that people labelled as asylum seekers tend to also be presented as, or conflated with, refugees (and, arguably, perceived as such) more often than people labelled as ‘refugees’ tend to be presented as, or conflated with, asylum seekers. More tentatively, it could be argued that the notion of ‘refugee’ is a semantic component of ‘asylum seeker’. This interpretation is supported by either of the two sets of definitions mentioned previously, as well as by the results of the collocational analysis of the two terms in the corpus (Gabrielatos and Baker 2006). For a discussion of semantic prosody/preference, see also Louw (1993) and Stubbs (2002: 65–66). The LL scores for word forms have been derived from WordSmith Tools, those of lemmas or groups of word forms (e.g. immigrant* and emigrant*) have been calculated manually using Paul Rayson’s online Log Likelihood Calculator (http://ucrel.lancs.ac.uk/llwizard.html). Since the interest is in terms rather than word forms, and since the database interface also returns the genitive forms of nouns whichever the form used in the query (and this cannot be remedied through the use of Boolean operators), RQTR and LL scores of nouns reflect the relevance/keyness of singular, plural and genitive forms taken together. This does not influence the results, as all the combined forms registered appropriate keyness. Also, this allowed for the inclusion of more key terms. The query excluded the words Deportivo (a football team) and deportment. In this case, the baseline is the lowest-scoring core query term. Provided, of course, that calculations refer to the same core query and database. The majority (78.4%) of terms also show comparable RQTR scores. 35 ICAME Journal No. 31 18. “A word family consists of a base word and all its derived and inflected forms. … [T]he meaning of the base in the derived word must be closely related to the meaning of the base when it stands alone or occurs in other derived forms, for example, hard and hardly would not be members of the same word family” (Bauer and Nation 1993: 253). 19. The core query, and the queries used in establishing the baseline score, can be more complex than those appropriate for this paper. That is, what was treated as a term in the core query can itself be a Boolean query, as a wildcard is a shorthand for Boolean disjunctions. For instance, ‘refugee*’ is a shorthand for the query ‘refugee OR refugees OR refugee’s OR refugees’. In this light, the core query ‘refugee* OR asylum seeker*’ can be more analytically written as follows: (‘refugee OR refugees OR refugee’s OR refugees’) OR (asylum seeker OR asylum seekers OR asylum seeker’s OR asylum seekers’). Furthermore, the brackets can contain not only disjunction (OR), but also conjunction (AND) or negation (NOT). For example, let us assume that the focus of examination was the representation of women refugees and asylum seekers. A possible core query (using wildcards for brevity) might be the following: (refugee* AND wom*n) OR (asylum seeker* AND wom*n), which can be more simply formulated as: ‘refugee* OR asylum seeker* AND wom*n’. References Aston, Guy and Lou Burnard. 1998. The BNC handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto. 1999. Modern information retrieval. London: Addison Wesley. Baker, Paul. 2004. Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics 32(4): 346–359. Baker, Paul and Tony McEnery. 2005. A corpus-based approach to discourses of refugees and asylum seekers in UN and newspaper texts. Journal of Language and Politics 4(2): 197–226. Baroni, Marco and Silvia Bernardini. 2003. The BootCaT toolkit: Simple utilities for bootstrapping corpora and terms from the web, version 0.1.2. Available online: http://sslmit.unibo.it/~baroni/Readme.BootCaT-0.1.2. Baroni, Marco and Silvia Bernardini. 2004. BootCaT: Bootstrapping corpora and terms from the web. LREC 2004 Proceedings, 1313–1316. 36 Selecting query terms to build a specialised corpus from a restricted-access database Baroni, Marco and Serge Sharoff. 2005. Creating specialized and general corpora using automated search engine queries. Paper presented at Corpus Linguistics 2005, Birmingham University, 14–17 July 2005. Available online: http://sslmit.unibo.it/~baroni/wac/serge_marco_wac_talk.slides.pdf. Baroni, Marco, Adam Kilgarriff, Jan Pomikálek and Pavel Rychlý. 2006. WebBootCaT: Instant domain-specific corpora to support human translators. Proceedings of EAMT 2006, 247–252. Available online: http://corpora.fi.muni.cz/bootcat/publications/webbootcat_eamt2006.pdf. Bauer, Laurie and Paul Nation. 1993. Word families. International Journal of Lexicography 6(4): 253–279. Belew, Richard K. and John Hatton. 1996. RAVE reviews: Acquiring relevance assessments from multiple users. In M. Hearst and H. Hirsh (eds.). Working notes of the AAAI Spring Symposium on Machine Learning in Information Access. Menlo Park, CA: AAAI Press. Boughanem, Mohand, Yannick Loiseau and Henri Prade. 2006. Rank-ordering documents according to their relevance in information retrieval using refinements of ordered-weighted aggregations. In M. Detyniecki, J. M. Jose, A. Nürnberger and C. J. van Rijsbergen (eds.). Adaptive multimedia retrieval: User, context, and feedback. Third International Workshop, AMR 2005, Glasgow, UK, July 28–29, 2005: Revised selected papers, 44–54. Berlin: Springer. Also online: http://www.irit.fr/recherches/RPDMP/persos/Prade/Papers/ BougLoiP_AMR.pdf. Buckley, Chris, Gerard Salton and James Allan. 1994. The effect of adding relevance information in a relevance feedback environment. In W.B. Croft and C.J. van Rijsbergen (eds.). Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, 3–6 July 1994), 292–300. New York: SpringerVerlag. Chowdhury, G.G. 2004 (2nd ed.) Introduction to modern information retrieval. London: Facet Publishing. Gabrielatos, Costas and Paul Baker. 2006. Representation of refugees and asylum seekers in UK newspapers: Towards a corpus-based analysis. Joint Annual Meeting of the British Association for Applied Linguistics and the Irish Association for Applied Linguistics (BAAL/IRAAL 2006): From Applied Linguistics to Linguistics Applied: Issues, Practices, Trends, Uni- 37 ICAME Journal No. 31 versity College, Cork, Ireland, 7–9 September 2006. Available online: http://eprints.lancs.ac.uk/265. Ghani, Rayid, Rosie Jones and Dunja Mladeni. 2001. Mining the web to create minority language corpora. CIKM 2001, 279–286. Greenslade, Roy. 2005. Seeking scapegoats: The coverage of asylum in the UK press. Asylum and immigration working paper 5. London: Institute for Public Policy Research. Also available online: http://www.ippr.org/members/ download.asp?f=%2Fecomm%2Ffiles%2Fwp5%5Fscapegoats%2Epdf. Longman dictionary of contemporary English on CD-ROM. 2003. London: Longman. Louw, William. 1993. Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. In M. Baker, G. Francis and E. TogniniBonelli (eds.). Text and technology: In honour of John Sinclair, 157–176. Philadelphia and Amsterdam: John Benjamins. McEnery, Tony. (2006). Swearing in English: Bad language, purity and power from 1586 to the present. London: Routledge. Rayson, Paul and Roger Garside. 2000. Comparing corpora using frequency profiling. Proceedings of Workshop on Comparing Corpora (at ACL 2000), 1–6. Rayson, Paul, Damon Berridge and Brian Francis. 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In G. Purnelle, C. Fairon and A. Dister (eds.). Le Poids des Mots. Proceedings of the 7th International Conference on Statistical Analysis of Textual Data (JADT 2004), Vol. 2, Louvain-la-Neuve, Belgium (March 10–12, 2004), 926–936. Louvain: Presses Universitaires de Louvain. Robertson, Steven and Karen Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27(3): 129–146. Scott, Mike. 2001. Comparing corpora and identifying key words, collocations, and frequency distributions through the WordSmith Tools suite of computer programs. In M. Ghadessy, A. Henry and R.L. Roseberry (eds.). Small corpus studies and ELT: Theory and practice, 47–67. Amsterdam: Benjamins. Scott, Mike. 2004. Oxford WordSmith Tools version 4. Oxford: Oxford University Press. Available online: http://www.lexically.net/downloads/version4/ wordsmith.pdf. Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press. 38 Selecting query terms to build a specialised corpus from a restricted-access database Sinclair, John. 2004. Appendix: How to build a corpus. In M. Wynne (ed.). Developing linguistic corpora: A guide to good practice, 79–83. Oxford: Oxbow Books. Also online: http://www.ahds.ac.uk/creating/guides/linguistic-corpora/appendix.htm. Stubbs, Michael. 2002. Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell. van Leeuven, Theo. 1996. The representation of social actors. In C-R. CaldasCoulthard and M. Coulthard (eds.). Texts and practices. Readings in Critical Discourse Analysis, 32–70. London: Routledge. Wilson, David. 2006. Asylum and the media in Scotland. A report on the portrayal of asylum in the Scottish media undertaken by the Oxfam Asylum Positive Images Network and Glasgow Caledonian University. Available online: http://oxfamgb.org/ukpp/resources/downloads/asylum_media_scotland.pdf. Xu, Jinxi and W. Bruce Croft. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual international ACM SIGIR Conference on Research and Development in information Retrieval (SIGIR ‘96), Zurich, Switzerland (August 18–22, 1996). ACM Press, New York, NY, 4–11. DOI: http://doi.acm.org/10.1145/243199.243202. 39 ICAME Journal No. 31 Appendix 1: Meaningful asylum+noun bigrams in UK6 (in alphabetical order) asylum+noun bigrams asylum abuses asylum act 4 16 asylum advice 2 asylum appeals 9 asylum applicant 2 asylum applicants 18 asylum application 40 asylum applications 139 asylum backlog 6 asylum based 2 asylum bid 4 asylum bids 6 asylum bill 62 asylum camp 2 asylum campaign 2 asylum case 12 asylum cases 27 asylum centre 6 asylum centres 7 asylum chaos 2 asylum charities 2 asylum cheats 2 asylum children 2 asylum claim 40 Freq. 35 asylum claimant 4 asylum claimants 3 Selecting query terms to build a specialised corpus from a restricted-access database asylum claims 79 asylum clampdown 2 asylum concerns 2 asylum control 2 asylum crisis 8 asylum debacle 2 asylum debate 3 asylum decisions 10 asylum detention 8 asylum door 2 asylum figures 7 asylum fraud 5 asylum fury 3 asylum hearing 2 asylum hearings 5 asylum incidents 3 asylum interview 2 asylum issue 7 asylum issues 3 asylum law 11 asylum laws 26 asylum lawyers 3 asylum league 2 asylum legislation 11 asylum myths 2 asylum option 3 asylum overhaul 2 asylum payouts 3 asylum plans 2 41 ICAME Journal No. 31 42 asylum pleas 3 asylum plot 2 asylum policies 11 asylum policy 47 asylum practices 2 asylum problem 5 asylum procedures 4 asylum process 8 asylum queue 2 asylum regime 4 asylum removal 6 asylum requests 13 asylum rights 3 asylum riot 3 asylum row 4 asylum rules 10 asylum scam 8 asylum seeking 4 asylum service 3 asylum shopping 4 asylum spongers 2 asylum statistics 2 asylum status 7 asylum support 16 asylum system 84 asylum system's 2 asylum tradition 3 asylum voucher 2 asylum-shopping 3 Selecting query terms to build a specialised corpus from a restricted-access database Appendix 2: Intransitive verb+asylum bigrams (in alphabetical order) verb+asylum bigrams Freq. awaiting asylum 2 claimed asylum 58 claiming asylum 48 denied asylum 10 deny asylum 2 gain asylum 6 gaining asylum 2 give asylum 9 given asylum 8 gives asylum 3 grant asylum 9 granted asylum 45 granting asylum 2 grants asylum 3 have asylum 3 having asylum 2 obtained asylum 2 offer asylum 2 refuse asylum 4 refused asylum 23 requested asylum seek asylum seeking asylum 9 38 102 seeks asylum 2 sought asylum 23 want asylum 10 wins asylum 3 43