research-article

Exploration of a Balanced Reference Corpus with a Wide Variety of Text Mining Tools

Authors:

Nicolas Turenne,

Xiaolin ZhuAuthors Info & Claims

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

Article No.: 60, Pages 1 - 9

https://doi.org/10.1145/3446132.3446192

Published: 09 March 2021 Publication History

Abstract

To compare various techniques, the same platform is generally used into which the user will import a text dataset. Another approach uses an evaluation based on a gold standard for a specific task, but a balanced common language corpus is not often used. We choose the Corpus of Contemporary American English Corpus (COCA) as a balanced reference corpus, and split this corpus into categories, such as topics and genres, to apply families of feature extraction and machine learning algorithms. We found that the Stanford CoreNLP method was faster and more accurate than the NLTK method, and was more reliable and easier to understand. The results of clustering show that a higher modularity influences interpretation. For genre and topic classification, all techniques achieved a relatively high score, though these were below the state-of-the-art scores from challenge text datasets. Naïve Bayes outperformed the other alternatives. We hope that balanced corpora from a variety of different vernacular (or low-resource) languages can be used as references to determine the efficiency of the wide diversity of state-of-the-art text mining tools.

References

[1]

Nicolas Turenne. 2018. The rumour spectrum. PLOS ONE 13(1):e0189080, 2018, https://doi.org/10.1371/journal.pone.0189080.

[2]

Artemy Kolchinsky, Analia Lourenco, Heng-Yi Wu, Lang Li and Luis M. Rocha. 2015. Extraction of pharmacokinetic evidence of drug–drug interactions from the literature. PLOS ONE, 2015, https://doi.org/10.1371/journal.pone.0122199 .

[3]

Alan Akbik, Duncan Blythe and Roland Vollgraf. 2018. Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics (COLING), New Mexico, USA, 2018.

[4]

Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 2016.

[5]

Mark Davies, Xingfu Wang and Guohui Liu. 2008. The corpus of contemporary American English: A useful tool for English reaching and research. Computer-Assisted Foreign Language Education in China 5:24-31, 2008.

[6]

Lou Burnard. 2000. The British National Corpus users reference guide. October 2000. URL: http://www.natcorp.ox.ac.uk/docs/userManual/.

[7]

Henry Kucera and Winthrop N. Francis. 1967. Computational analysis of present-day English. Brown University Press. Providence, Rhode Island, 1967.

[8]

Alexander Geyken. 2007. The DWDS corpus: A reference corpus for the German language of the 20th century. In: Fellbaum, Christiane (Hg.): Collocations and idioms: Linguistic, lexicographic, and computational aspects. London, S. 23-41, 2007.

[9]

Adam Kilgarriff and Gregory Grefenstette. 2003. Introduction to the special issue on the web as corpus. Computational Linguistics 29(3):333-348, 2003.

Digital Library

[10]

Myongsu Park. 2015. Use of discourse markers kind of and sort of in modern American English. Modern English Education 16(1), Spring 2015Us.

[11]

Yunhyun Lee. 2018. The intransitive usage of the English verb lay in COCA. Modern English Education 19(2):56-62, Summer 2018.

[12]

Sergio Torres-Martínez. 2018. Exploring attachment patterns between multi-word verbs and argument structure constructions. Lingua 209:21-43, July 2018.

[13]

Tahereh S. Paribakht and Stuart Webb. 2015. The relationship between academic vocabulary coverage and scores on a standardized English proficiency test. Journal of English for Academic Purposes 21:121-132, 2016, https://doi.org/10.1016/j.jeap.2015.05.009.

[14]

Deny A. Kwary and Jurianto Jurianto. 2017. Selecting and creating a word list for English language teaching. Teaching English with Technology 17(1):60-72, 2017.

[15]

Mayumi Okamoto. 2015. Is corpus word frequency a good yardstick for selecting words to teach? Threshold levels for vocabulary selection. System 51:1-10, July 2015.

[16]

Abhishek Pathak, Carlos Velasco and Charles Spence. 2020. The sound of branding: An analysis of the initial phonemes of popular brand names. Journal of Brand Management 27: 339-354, 2020.

[17]

Benjamin Horne and Sibel Adali. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. Proceedings of the Eleventh International AAAI Conference on Web and Social Media, Montréal, Québec, Canada, May 15-18, 2017.

[18]

Waheed M.A. Altohami and Amir H.Y. Salama. 2019. The journalistic representations of Saudi women in the corpus of contemporary American English (COCA): A corpus critical discourse. International Journal of English Linguistics 9(6), 2019.

[19]

Puji Rahayu. 2017. Adjectival collocations for Halloween in early 19th until 20th in COHA and COCA. Etnolingual 1(1):59-71, 2017.

[20]

Averil Coxhead and Murielle Demecheleer. 2018. Investigating the technical vocabulary of plumbing. English for Specific Purposes 51:84-97, July 2018.

[21]

David R. Heise. 2019. Contextual meanings of identities. In: Cultural Meanings and Social Institutions. Palgrave Pivot, Cham. https://doi.org/10.1007/978-3-030-03739-0_4, 2019.

[22]

Wenhua Hsu. 2018. The most frequent BNC/COCA mid-and low-frequency word families in English-medium traditional Chinese medicine (TCM) textbooks. English for Specific Purposes 51:98-110, July 2018.

[23]

Marcus Stroebel, Elma Kerz, Daniel Wiechmann and Yu Qiao. 2018. Text genre classification based on linguistic complexity contours using a recurrent neural network. MRC@ IJCAI, Tenth International Workshop Modelling and Reasoning in Context (MRC), July 13, 2018, Stockholm, Sweden.

[24]

Eric Friginal, Marsha Walker, Janet B. Randall. 2014. Exploring mega corpora: Google Ngram Viewer and the corpus of historical American English. EuroAmerican Journal of Applied Linguistics and Languages E-JournALL 1(1):48-68, November 2014.

[25]

Sandra Mollin. 2013. Pathways of change in the diachronic development of binomial reversibility in Late Modern American English. Journal of English Linguistics 41(2):168-203, June 2013.

[26]

Bert Cappelle and Natalia Grabar. 2013. Towards an n-grammar of English. In: International Conference on Constructionist Approaches to Language Pedagogy, CALP, November 8-9, 2013, Brussels, Belgium.

[27]

Forrest Davis and Abigail C. Cohn. 2020. The relationship between lexical frequency, compositionality, and phonological reduction in English compounds. In: 94th Annual Meeting of the Linguistic Society of America, 2020.

[28]

William A. Gale, Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75-102, 1993.

Digital Library

[29]

Andrei Mikheev. 2003. Text segmentation. In: The Oxford Handbook of Computational Linguistics, ch. 10, pp. 201-218. Oxford University Press, Oxford, 2003.

[30]

Jonathan Read, Rebecca Dridan, Stephan Oepen, Lars Jorgen Solberg. 2012. Sentence boundary detection: A long solved problem. In: Proceedings of COLING 2012: Posters, pp. 985-994, COLING 2012, Mumbai, December 2012.

[31]

Eric Brill. 1995. Unsupervised learning of disambiguation rules for part of speech tagging. ACL Third Workshop on Very Large Corpora, 1995.

[32]

Roni Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Computer, Speech and Language 10:187-228, 1996.

[33]

Ekaterina Buyko, Joachim Wermter, Michael Poprat, Udo Hahn. 2006. Automatically adapting an NLP core engine to the biology domain. Proceedings of the ISMB, 2006.

[34]

Roman Tesar, Dalibor Fiala, François Rousselot and Karel Jezek. 2005. A comparison of two algorithms for discovering repeated word sequences. WIT Transactions on Information and Communication Technologies, 35:11, 2005.

[35]

Satoshi Sekine. 2008. A linguistic knowledge discovery tool: Very large ngram database search with arbitrary wildcards. In: COLING: Companion volume: Demonstrations, 2008.

[36]

Jacob Devlin, Mingwei Chang, Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, pp. 4171-4186, Minneapolis, Minnesota, June. Association for Computational Linguistics, 2019.

[37]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny R. Finkel, Steven Bethard and David McClosky. 2006. The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, June 2014, Baltimore, Maryland.

[38]

Joakim Nivre. 2006. Inductive dependency parsing. Springer Netherlands, 2006.

[39]

Zuchao Li, Jiaxun Cai, Shexia He and Hai Zhao. 2018. Seq2seq dependency parsing. In: Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), August 20-26, 2018, Santa Fe, New Mexico, USA.

[40]

Inderjit S. Dhillon. 2008. Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the ACM SIGKDD Conference, pp. 269-274, 2001.

[41]

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 10, 2008.

[42]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pp. 50-57, New York, NY, USA, 1999.

Digital Library

[43]

David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003.

Digital Library

[44]

Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago, September 12-15,1994.

Digital Library

[45]

Jiawei Han, Jian Pei and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. SIGMOD Rec. 29:2, 2000.

Digital Library

[46]

Nicolas Turenne, Evgeniy Tiys, Vladimir Ivanisenko, Nikolay Yudin, Elena Ignatieva, Damien Valour, Sévenrine A Degrelle and Isabelle Hue. 2012. Finding biomarkers in non-model species: Literature mining of transcription factors involved in bovine embryo development. BioData Mining 5:12, 2012, https://doi.org/10.1186/1756-0381-5-12.

[47]

Qinran Dang, Nicolas Turenne and Mathieu Valette. 2018. Using smog-related data of Chinese Sina Weibo to explore correlation between health issues and relevant regions. In: Proceedings of the 13th International Workshop on Natural Language Processing and Cognitive Science, (NLPCS), Kraków, Poland, September 11-12, 2018.

[48]

Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, pp. 137-142. Chemnitz, DE.

Digital Library

[49]

Susana Eyheramendy, David D. Lewis and David Madigan. 2003. On the naive Bayes model for text categorization. In: Bishop, C.M., Frey, B.J., eds.: AI & Statistics 2003: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003, pp. 332-339.

[50]

Duyu Tang, Bing Qin and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.1422-1432.

[51]

Fredrik Olsson, Magnus Sahlgren, Fehmi ben Abdesslem, Ariel Ekgren and Kristine Eck. 2020. Text categorization for conflict event annotation. In: Proceedings of the Workshop on Automated Extraction of Socio-political Events from News, 2020, Marseille, France.

[52]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751.

[53]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 562-570, 2017.

[54]

Rinaldo Lima, Bernard Espinasse and Fred Freitas. 2018. OntoILPER: An ontology- and inductive logic programming-based system to extract entities and relations from text. Knowledge and Information Systems 56:223-255, 2018.

Digital Library

[55]

Elisabetta Fersini, Enza Messina, Frederico Alberto Pozzi. 2014. Sentiment analysis: Bayesian ensemble learning. Decision Support Systems 68:26-38, December 2014.

Digital Library

Recommendations

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
An automatic non-English sentiment lexicon builder using unannotated corpus

Sentiment lexicons in the English language are widely accessible while in many other languages, these resources are extremely deficient. Current techniques and methods for sentiment analysis focus mainly on the English language, whereas other languages ...
Mining Synonymous Transliterations from the World Wide Web

The World Wide Web has been considered one of the important sources for information. Using search engines to retrieve Web pages can gather lots of information, including foreign information. However, to be better understood by local readers, proper ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACAI '20: Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 2020

576 pages

ISBN:9781450388115

DOI:10.1145/3446132

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

BNU HKBU United International College (UIC)

Conference

ACAI 2020

ACAI 2020: 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence

December 24 - 26, 2020

Sanya, China

Acceptance Rates

Overall Acceptance Rate 173 of 395 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
73
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents