Raslan Skell
Raslan Skell
Raslan Skell
net/publication/275007862
CITATIONS READS
31 1,676
2 authors, including:
Vít Baisa
Masaryk University
32 PUBLICATIONS 705 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Vít Baisa on 15 April 2015.
1 Introduction
There are many websites for language learners: wordreference.com1 and Using
English2 are just two of many. Some of them are using corpus tools or corpus
data such as Linguee3 , Wordnik4 , bab.la5 . They usually provide dictionary-like
features: definitions and translation equivalents in selected languages. Some of
them provide even examples from parallel corpora (Linguee).
We introduce here a new web interface aimed at teachers and students of
English language which offers similar functions as above-mentioned tools but
at the same time it is based on a specially processed corpus data suitable for the
language learning purpose.
We call it SkELL: Sketch Engine for Language Learning. The Sketch Engine6
is a state-of-the-art web-based tool for building, managing and exploring large
1 http://www.wordreference.com/
2 http://www.usingenglish.com/
3 http://www.linguee.com/
4 http://www.wordnik.com/
5 http://en.bab.la/
6 http://www.sketchengine.co.uk
Aleš Horák, Pavel Rychlý (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing,
RASLAN 2014, pp. 63–70, 2014. ○
c NLP Consulting 2014
64 Vít Baisa and Vít Suchomel
2 Features of SkELL
SkELL features offer three ways for exploring the SkELL corpus. The first is
the concordance: for a given word or phrase, it will return up to 40 example
sentences. The second is the word sketch through which typical collocates for
a given word can be discovered. And the third is similar words (thesaurus)
which lists words that are similar to, though not necessarily synonymous with,
a search word. The similar words are visualized with a word cloud. The web
interface is optimized for mobile and touch devices.
SkELL features are built upon Bonito corpus manager [5] features. Bonito
provides many standard functions as many other corpus managers: concor-
dancing, word list generating, context statistics and also some advanced fea-
tures like distributional thesaurus [6] and word sketches [7]. We have chosen
these three: 1) concordance, 2) word sketch and 3) thesaurus (similar words).
The search is case insensitive, i.e. it will yield the same results for rutherford
and Rutherford. Moreover the results may contain the query (a word or a phrase)
in a derived word form. For mouse (lemma) it will find sentences also with
mice. For mice the result will contain a different set of sentences: only mice
occurrences.
It is not necessary for users to specify part of speech (PoS, e.g. noun, verb,
adjective, preposition, adverb etc.) of the search term is not necessary. If you
search for book, it will give sentences with book as a verb and as a noun and both
in various word forms (booking, booked, books).
Word sketch function is very useful for discovering collocates and for
studying contextual behaviour of words. Collocates of a word are words which
occur frequently together with the word—they “co-locate” with the word. See
[7] for more info.
For query mouse, SkELL will generate several tables containing collocates of
the headword mouse. Table headers describe what kind of collocates (always in
basic word form) they contain.
By clicking on a collocate, a concordance with highlighted headwords and
collocates is shown. This way it can be seen how the two collocates together are
usually used in English language.
By default, the most frequent PoS is shown in Word Sketches. If a word
(book, fast, key, . . . ) can have more than one PoS, the alternative links are shown
next to the headword (see in Figure 2).
66 Vít Baisa and Vít Suchomel
3 SkELL corpus
SkELL is using a large text collection—SkELL corpus—gathered specially for
the purpose of English language learning. It consists of texts from news, aca-
demic papers, Wikipedia articles, open-source (non)-fiction books, webpages,
discussion forums, blogs etc. There are more than 60 million sentences in SkELL
corpus and more than one billion words in total. This amount of textual data
provides a sufficient coverage of everyday, standard, formal and professional
English language.
In the following subsections we describe the most important data resources
which have been used in building the SkELL corpus and the processing of the
data.
The Project Gutenberg12 (PG) focuses on gathering public domain texts in many
languages. The majority of texts is in English. We have downloaded all English
texts using wget13 . and converted the HTML files to plain text.
The largest texts in English PG collection are The Memoires of Casanova, The
Bible (Douay-Rheims version), The King James Bible, Maupassant’s Original short
stories, Encyclopaedia Britannica, etc.
We have prepared two subsets from the enTenTen14 [8] which has been crawled
in 2014. The White (bigger) part contains only documents from web domains
in dmoz.org or in the whitelist of urlblacklist.com. The Superwhite (smaller)
containing documents domains listed in the whitelist of urlblacklist.com – a
subset of White (in case there is still some spam in the larger part taken from
dmoz.org)
Categories from the following list are allowed categories from dmoz.org in
Superwhite part: 1) blog: journal/diary websites, 2) childcare: sites to do with
childcare, 3) culinary: sites about cooking, 4) entertainment: sites that promote
movies, books, magazine, humor, 5) games: game related sites, 6) gardening:
gardening sites, 7) government: military and schools etc., 8) homerepair: sites
about home repair, 9) hygiene: sites about hygiene and other personal grooming
related stuff, 10) medical: medical websites, 11) news: news sites, 12) pets:
pet sites, 13) radio: non-news related radio and television, 14) religion: sites
promoting religion, 15) sportnews: sport news sites, 16) sports: all sport sites,
17) vacation: sites about going on holiday, 18) weather: weather news sites
and weather related and 19) whitelist: sites specifically 100% suitable for kids.
Finally we have decided to include the whole White part. It contained 1.6 billion
tokens.
11 http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
12 http://www.gutenberg.org/
13 http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_
to_our_Pages
68 Vít Baisa and Vít Suchomel
Since SkELL corpus may be changed in the future (further cleaned, refined,
updated), all references to particular results of SkELL should be accompanied
by the current version. The web interface may also be changed occasionally.
That is why at the bottom of SkELL page, there is a version in this format:
version1-version2. The first corresponds to a version of the web interface and
the second to a version of SkELL corpus.
References
1. Suchomel, V., Pomikálek, J., et al.: Efficient web crawling for large text corpora. In:
Proceedings of the seventh Web as Corpus Workshop (WAC7). (2012) 39–43
2. Michelfeit, J., Pomikálek, J., Suchomel, V.: Text Tokenisation Using unitok. In Horák,
A., Rychlý, P., eds.: 8th Workshop on Recent Advances in Slavonic Natural Language
Processing, Brno, Tribun EU (2014) 71–75
3. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings
of the international conference on new methods in language processing. Volume 12.,
Manchester, UK (1994) 44–49
4. Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., Rychlý, P.: Gdex: Automatically
finding good dictionary examples in a corpus. In: Proceedings of EURALEX.
Volume 8. (2008)
5. Rychlý, P.: Manatee/bonito–a modular corpus manager. In: 1st Workshop on
Recent Advances in Slavonic Natural Language Processing, within MU: Faculty of
Informatics Further information (2007) 65–70
6. Rychlý, P., Kilgarriff, A.: An efficient algorithm for building a distributional thesaurus
(and other sketch engine developments). In: Proceedings of the 45th Annual
Meeting of the ACL on Interactive Poster and Demonstration Sessions, Association
for Computational Linguistics (2007) 41–44
7. Kilgarriff, A., Rychlý, P., Smrz, P., Tugwell, D.: The Sketch Engine. Information
Technology 105 (2004)
8. Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V., et al.: The tenten
corpus family. In: Proc. Int. Conf. on Corpus Linguistics. (2013)
9. Baroni, M., Kilgarriff, A., Pomikálek, J., Rychlý, P., et al.: Webbootcat: instant domain-
specific corpora to support human translators. In: Proceedings of EAMT. (2006)
10. Leech, G.: 100 million words of english: the british national corpus (bnc). Language
Research 28(1) (1992) 1–13
11. Kilgarriff, A., Rychly, P., Kovář, V., Baisa, V.: Finding multiwords of more than two
words. Proceedings of EURALEX2012 (2012)