Abstract
As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.
Similar content being viewed by others
References
IMPACT Project. Improving Access to Text. http://www.impact-project.eu
Oxford English Dictionary. The Oxford English Dictionary, 2nd edn. 1989. OED Online, Oxford University Press, Oxford (2000). http://dictionary.oed.com
Oxford English Dictionary, Writing the OED (2010). http://www.oed.com/about/writing/
Google books (2011). http://books.google.com/
Project gutenberg (2011). http://www.gutenberg.org/
The Times, November 29 (1814). http://archive.timesonline.co.uk/tol/viewArticle.arc?articleId=ARCHIVE-The_Times-1814-11-29-03-003&pageId=ARCHIVE-The_Times-1814-11-29-03
Abdulkader, A., Casey, M.R.: Low cost correction of ocr errors using learning in a multi-engine environment. In: ICDAR, pp. 576–580 (2009)
Abecker, A., Stojanovic, L.: Ontology evolution: Medline case study. In: Proceedings of Wirtschaftsinformatik 2005: eEconomy, eGovernment, eSociety, pp. 1291–1308 (2005)
Atkinson, K.: Gnu aspell version 0.60.6 (2008). http://aspell.net/
Cheng, P.-J., Kan, M.-Y., Lam, W., Nakov, P. (eds.): Sixth Asia Information Retrieval Societies Conference (AIRS 2010). Springer, Berlin (2010)
Coburn, A.: Lingua::EN::Tagger—part-of-speech tagger for english natural language processing (2008). http://search.cpan.org/acoburn/Lingua-EN-Tagger-0.15/Tagger.pm
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Sebastian, D., Ciura, M.G.: Correcting spelling errors by modeling their causes. Int. J. Appl. Math. Comput. Sci. 15, 275–285 (2005)
Dorow, B.: A Graph Model for Words and their Meanings. PhD thesis, University of Stuttgart (2007)
Dorow, B., Eckmann, J.-P., Sergi, D.: Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination. In: Workshop MEANING-2005 (2004)
Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL ’07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341, ACM, Vancouver, BC, Canada (2007)
Ferret, O.: Discovering word senses from a network of lexical cooccurrences. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, 1326, Geneva, Switzerland (2004)
Finlayson, M.A.: MIT Java Wordnet Interface version 2.1.5, Released under Creative Commons Attribution-NonCommerical Version 3.0 Unported License. http://projects.csail.mit.edu/jwi/
Annette, G., Ulrich, R., Christoph, R., Schulz, K.U., Andreas, N.: Towards information retrieval on historical document collections: The role of matching procedures and special lexica. Int. J. Doc. Anal. Recognit. 14(2), 159–171 (2011)
Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information access to historical documents from the early New High german period. In: IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data (2006)
Hauser, A.W., Schulz, K.U.: Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the First Workshop on Finite-State Techniques and Approximate Search, pp. 1–6, Borovets, Bulgaria (2007)
Hong, T., Hull, J.J., Srihari, S.N., Deborah, Walters, K., Henry, S.B.: Degraded Text Recognition Using Visual And Linguistic, Context (1995)
Lee Daniel, D., Sebastian, S.H.: Algorithms for non-negative matrix factorization. In: Leen Todd, K., Dietterich, T.G., Volker, T. (eds.) NIPS, pp. 556–562. MIT Press, Cambridge (2000)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774, Montreal, QC, Canada (1998)
Lopresti, D.P.: Optical character recognition errors and their effects on natural language processing. IJDAR 12(3), 141–151 (2009)
Miller, G.A.: WordNet: A lexical database for English. Commun. ACM 38, 39–41 (1995)
Pantel, P., Lin, D.: Discovering word senses from text. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 613–619. ACM, Edmonton, Alberta, Canada (2002)
Pedersen, T., Bruce, R.: Distinguishing word senses in untagged text. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 197–207, Providence, RI (1997)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453, Montreal, QC, Canada (1995)
Reynaert, M.: Text Induced Spelling Correction. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, p. 834. Association for Computational Linguistics, Morristown (2004)
Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Computational Linguistics and Intelligent Text Processing, pp. 617–630 (2008)
Evan, S.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49, Manchester. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html (1994)
Heinrich, S.: Automatic word sense discrimination. Comput. Linguistics 24(1), 97–123 (1998)
Spitz, A.L.: An ocr based on character shape codes and lexical information. In: ICDAR, pp. 723–728 (1995)
Strohmaier, C.M.: Methoden der lexikalischen Nachkorrektur OCR-erfasster Dokumente (2004)
Kazem, T., Eric, S.: OCRSpell: An interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3, 2001 (2001)
Tahmasebi, N., Niklas, K., Theuerkauf, T., Risse, T.: Using word sense discrimination on historic document collections. In: 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL), Surfers Paradise, Gold Coast (2010)
Tahmasebi, N.: Automatic detection of terminology evolution. In: Meersman, R., Herrero, P., Dillon, T.S. (eds.) OTM Workshops, vol. 5872 of Lecture Notes in Computer Science, pp. 769–778. Springer, Berlin (2009)
Tahmasebi, N., Gossen, G., Risse, T.: Which words do you remember? Temporal properties of language use in digital archives. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL, volume 7489 of Lecture Notes in Computer Science, pp. 32–37. Springer, Berlin (2012)
Tahmasebi, N., Ramesh, S., Risse, T.: First results on detecting term evolutions. In: 9th International Web Archiving Workshop, Corfu, Greece (2009)
The Times of London (2008). http://archive.timesonline.co.uk/tol/archive/
Van de Cruys, T.: Using three way data for word sense discrimination. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 929–936. Coling 2008 Organizing Committee, Manchester (2008)
Watts, D.J., Strogatz, S.: Collective dynamics of “small-world” networks. Nature 393, 440–442 (1998)
Acknowledgments
We would like to thank Times Newspapers Limited for providing the archive of The Times, London for our research. A special thanks to Gertrud Erbach for her valuable contributions. This work is partly funded by the European Commission under LiWA (IST 216267) and ARCOMEM (IST 270239).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Calculating BigramBoost
When calculating the final score (see Formula 3, Sect. 5.6) of a correction proposal for a given term \(w\), we take the context of the term into consideration. This is done using the variable \(bigramBoost\).
First, we form two bigrams using the term \(w\) and the term before in the text (left bigram) and after (right bigram). We permute the bigrams using the correction proposals for the terms in the left and the right bigrams and add up their frequencies to find \(bigramBoost\) for \(w\).
Let \(w_i\) be the \(i\)-th word of the text. Furthermore, let \(c_j^i\) be the \(j\)-th correction proposal of the word \(w_i\) and \(|c^i|\) the number of correction proposals corresponding to \(w_i\). Then there exists two ways to form a bigram including the word \(w_i\).
-
1.
The left bigrams \(b^{i-}_{jk}\), are formed using the proposals of the word \(w_i\) and \(w_{i-1}\).
-
2.
The right bigram \(b^{i+}_{jk}\) are formed using the proposals of the word \(w_i\) and \(w_{i+1}\).
The bigrams are defined as follows, using ␣ as white-space:
-
1.
\(b^{i-}_{jk} = \{ c_{j}^{i-1} \circ \textvisiblespace \circ c_{k}^{i}, \ 1 \le j \le \min (5, |c^{i-1}|), \ 1 \le k \le \min (5, |c^{i}|)\}\)
-
2.
\(b^{i+}_{jk} = \{c_{k}^{i} \circ \textvisiblespace \circ c_{j}^{i+1}, \ 1 \le k \le \min (5, |c^{i}|), \ 1 \le j \le \min (5, |c^{i+1}|)\} \)
To respect the context, bigrams are only generated if they are not blocked by punctuation marks. If \(w_{i-1}\) ends with a punctuation mark the left bigrams do not contribute to the final score. Analogue, if \(w_{i}\) ends with a punctuation mark the right bigrams do not contribute. We consider all non-digit and non-alphabetic characters as punctuation marks.
Once the bigrams are created, they are queried against the Anagram Hash to retrieve their occurrence count. The frequency of the bigram \(b\) in the Anagram Hash is denoted as \(f(b)\). The resulting \(bigramBoost^i_k\) for the correction proposal \(c^i_k\) corresponding to the word \(w_i\), is computed as follows:
The bigramBoost value is used for computing final score of a correction proposal for a term \(w\) according to Formula 3.
When the final proposal list is obtained, the highest ranking term is chosen as an automatic replacement for the target term. The other terms in the list can be used for a semi-automatic correction strategy.
Appendix B: Example clusters
We will present some sample clusters from The Times Archive as well as the NYT corpus. Due to repetitions, the clusters shown in this Appendix are sampled from all clusters mentioning each term and a limited number of terms are shown for each cluster. In both cluster sets, we find that the number of terms in each cluster increases over time. It should be clear that clusters displayed here do not follow the evolution of each term as a whole, but as it was mentioned in The Times Archive.
In Table 3, we see clusters for the term flight. Among the displayed clusters, it is clear that the senses for flight are several and mostly grouped together. Between 1826 and 1832 there are six clusters (only two of them displayed here) that all refer to a company Flight & Robson which built church (finger) organs. Three decades later, between 1867 and 1894 there are 5 clusters that all refer to hurdle races. 1938–1957 the clusters refer to cricket, the terms in the clusters are referring to the ball. Starting from 1973, the clusters correspond to the modern sense of flight as a means of travel, especially for holidays. The introduction of among others pocket money, visa, accommodation, differentiates the latter clusters from the earlier.
In Table 4, we see some selected clusters for the term computer. The first clusters reveal the computer as a tool for working with terms like spreadsheet, database, printer, language translator. Over time the clusters reveal the computer as an every day tool for entertainment with terms like game, home shopping, commercial, movie and communication. We can also find terms that are now much less frequently used like cdrom, vcr, qvc and based on the surrounding terms infer meaning and context.
In Table 5, we see some selected clusters corresponding to the term travel. We can see that the concept of travel changes over time. In the 19th century, it referred primarily to books and was not an everyday activity for ordinary people. Early 20th century the concept changes and travel becomes more common. With the introduction of terms like sightseeing, full board, good hotel, fishing including locations for travel, the concept of travel clearly becomes more concrete rather than something only available through books.
A similar shift in concept can also be seen in clusters concerning travellers. In Table 6, we see that the type of people that traveled change. The first two clusters containing the term yellow admiral refer to the classic “The Wags, or the Camp of Pleasure” by Charles Dibdin. As with the senses of travel, the traveller transforms from being a salesman, clerk or merchant to being more concrete with terms like visa, passport, ticket, commuter. In all our clusters business men seem to be highly represented.
Rights and permissions
About this article
Cite this article
Tahmasebi, N., Niklas, K., Zenz, G. et al. On the applicability of word sense discrimination on 201 years of modern english. Int J Digit Libr 13, 135–153 (2013). https://doi.org/10.1007/s00799-013-0105-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-013-0105-8