research-article

Big data for Natural Language Processing

Authors:

Rodrigo Agerri,

Zuhaitz Beloki,

Aitor SoroaAuthors Info & Claims

Knowledge-Based Systems, Volume 79, Issue C

Pages 36 - 42

https://doi.org/10.1016/j.knosys.2014.11.007

Published: 01 May 2015 Publication History

Abstract

Requirements in computational power have grown dramatically in recent years. This is also the case in many language processing tasks, due to the overwhelming and ever increasing amount of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale data processing strategies used in the Natural Language Processing field. In this paper we present a new distributed architecture and technology for scaling up text analysis running a complete chain of linguistic processors on several virtual machines. Furthermore, we also describe a series of experiments carried out with the goal of analyzing the scaling capabilities of the language processing pipeline used in this setting. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency when processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding language processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.

References

[1]

R. Agerri, J. Bermudez, G. Rigau, IXA Pipeline: efficient and ready to use multilingual NLP tools, in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland, 2014.

[2]

E. Agirre, O. López de Lacalle, A. Soroa, Random walks for knowledge-based word sense disambiguation, Comput. Linguist., 40 (2014) 57-84.

Digital Library

[3]

E. Agirre, O.L.D. Lacalle, A. Soroa, Knowledge-based WSD on specific domains: performing better than generic supervised WSD, in: Proceedings of IJCAI 2009, 2009.

[4]

X. Artola, Z. Beloki, A. Soroa, A stream computing approach towards scalable NLP, in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland, 2014.

[5]

A. Björkelund, L. Hafdell, P. Nugues, Multilingual semantic role labeling, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task CoNLL '09, Boulder, Colorado, USA, 2009, pp. 43-48.

Digital Library

[6]

W. Bosma, P. Vossen, A. Soroa, G. Rigau, M. Tesconi, A. Marchetti, M. Monachini, C. Aliprandi, KAF: a generic semantic annotation format, in: Proceedings of the GL2009 Workshop on Semantic Annotation, 2009.

[7]

X. Carreras, L. Marquez, L. Padro, Named entity extraction using AdaBoost, in: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, 2002, pp. 1-4.

Digital Library

[8]

M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S. Zdonik, Scalable distributed stream processing, in: CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, 2003.

[9]

S. Clark, J. Curran, Language independent NER using a maximum entropy tagger, in: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-03), Edmonton, Canada, 2003, pp. 164-167.

[10]

M. Collins, Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms, in: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, 2002, pp. 1-8.

Digital Library

[11]

M. Collins, Head-driven statistical models for natural language parsing, Comput. Linguist., 29 (2003) 589-637.

Digital Library

[12]

B. Cowan, M. Collins, Morphology and reranking for the statistical parsing of Spanish, in: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005, pp. 795-802.

[13]

H. Cunningham, Gate, a general architecture for text engineering, Comp. Human., 36 (2002) 223-254.

[14]

A. Cybulska, P. Vossen, Semantic relations between events and their time, locations and participants for event coreference resolution, in: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd., Hissar, Shoumen, Bulgaria, 2013, pp. 156-163.

[15]

J.R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information extraction systems by gibbs sampling, in: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 363-370.

[16]

A. Fokkens, A. Soroa, Z. Beloki, N. Ockeloen, G. Rigau, W.R. van Hage, P. Vossen, NAF and GAF: linking linguistic annotations, in: Proceedings of 10th Joint ACL/ISO Workshop on Interoperable Semantic Annotation (ISA-10), LREC 2014 Workshop, Reykjavik, Iceland, 2014, p. 9.

[17]

J. Giménez, L. Marquez, Svmtool: a general POS tagger generator based on support vector machines, in: Proceedings of the 4th International Conference on Language Resources and Evaluation, 2004.

[18]

N. Ide, L. Romary, Éric Villemonte de La Clergerie, International standard for a linguistic annotation framework, in: Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), Association for Computational Linguistics, 2003.

[19]

H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, D. Jurafsky, Deterministic coreference resolution based on entity-centric, precision-ranked rules, Comput. Linguist. (2013) 1-54.

[20]

R. McCreadie, C. Macdonald, I. Ounis, M. Osborne, S. Petrovic, Scalable distributed event detection for twitter, in: Proceedings of IEEE International Conference on Big Data, 2013.

[21]

P.N. Mendes, J, Daiber, M. Jakob, C. Bizer, Evaluating DBpedia spotlight for the TAC-KBP entity linking task, in: Proceedings of the TACKBP 2011 Workshop, 2011.

[22]

L. Padró, E. Stanilovsky, Freeling 3.0: towards wider multilinguality, in: Proceedings of the Language Resources and Evaluation Conference (LREC 2012), ELRA, Istanbul, Turkey, 2012.

[23]

P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, V. Vyas, Web-scale distributional similarity and entity set expansion, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, pp. 938-947.

[24]

L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 2009, pp. 147-155.

Digital Library

[25]

A. Ratnaparkhi, Learning to parse natural language with maximum entropy models, Mach. Learn., 34 (1999) 151-175.

[26]

R. Saurí, J. Pustejovsky, FactBank: a corpus annotated with event factuality, Lang. Resour. Eval., 43 (2009) 227-268.

[27]

S. Singh, A. Subramanya, F. Pereira, A. McCallum, Large-scale cross-document coreference using distributed inference and hierarchical models, in: Association for Computational Linguistics: Human Language Technologies (ACL HLT), 2011.

[28]

M. Taulé, M.A. Martí, M. Recasens, AnCora: multilevel annotated corpora for Catalan and Spanish, in: LREC, 2008.

[29]

K. Toutanova, D. Klein, C. Manning, Y. Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Proceedings of HLT-NAACL, 2003, pp. 252-259.

Digital Library

[30]

H. Wu, Z. Fei, A. Dai, M. Sammons, D. Roth, S. Mayhew, Illinoiscloudnlp: text analytics services in the cloud, in: Proceedings of (LREC-2014), 2014.

[31]

M. Zaharia, N.M.M. Chowdhury, M. Franklin, S. Shenker, I. Stoica, Spark: Cluster Computing with Working Sets. Technical Report EECS Department, University of California, Berkeley, 2010.

Cited By

Nugroho KSukmadewa AYudistira N(2021)Large-Scale News Classification using BERT Language Model: Spark NLP ApproachProceedings of the 6th International Conference on Sustainable Information Engineering and Technology10.1145/3479645.3479658(240-246)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3479645.3479658
Chew APan YWang YZhang L(2021)Hybrid deep learning of social media big data for predicting the evolution of COVID-19 transmissionKnowledge-Based Systems10.1016/j.knosys.2021.107417233:COnline publication date: 5-Dec-2021
https://dl.acm.org/doi/10.1016/j.knosys.2021.107417
Novo-Lourés MPavón RLaza RRuano-Ordas DMéndez J(2020)Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text SourcesScientific Programming10.1155/2020/23909412020Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1155/2020/2390941
Show More Cited By

Big data for Natural Language Processing
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Word recommendation for English composition using big corpus data processing
Abstract
Writing essays and technical documents can be a challenging task for many people, especially for non-native speakers. Good content and ideas are both important in writing, but clear and effective expressions that can accurately convey the meaning ...
An experimental survey on big data frameworks
Abstract
Recently, increasingly large amounts of data are generated from a variety of sources.Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a buzzword ...
Highlights
- An overview of most popular Big Data frameworks.
- A categorization of the presented frameworks and techniques.
- An extensive set of experiments to evaluate the studied Big Data frameworks.
- A description of best practices related ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems

Knowledge-Based Systems Volume 79, Issue C

May 2015

116 pages

ISSN:0950-7051

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nugroho KSukmadewa AYudistira N(2021)Large-Scale News Classification using BERT Language Model: Spark NLP ApproachProceedings of the 6th International Conference on Sustainable Information Engineering and Technology10.1145/3479645.3479658(240-246)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3479645.3479658
Chew APan YWang YZhang L(2021)Hybrid deep learning of social media big data for predicting the evolution of COVID-19 transmissionKnowledge-Based Systems10.1016/j.knosys.2021.107417233:COnline publication date: 5-Dec-2021
https://dl.acm.org/doi/10.1016/j.knosys.2021.107417
Novo-Lourés MPavón RLaza RRuano-Ordas DMéndez J(2020)Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text SourcesScientific Programming10.1155/2020/23909412020Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1155/2020/2390941
Mohamed ANajafabadi MWah YZaman EMaskat R(2020)The state of the art and taxonomy of big data analytics: view from new big data frameworkArtificial Intelligence Review10.1007/s10462-019-09685-953:2(989-1037)Online publication date: 1-Feb-2020
https://dl.acm.org/doi/10.1007/s10462-019-09685-9
Kelechi NGeng S(2020)Counterfactual Retrieval for Augmentation and DecisionsMachine Learning for Cyber Security10.1007/978-3-030-62460-6_30(338-346)Online publication date: 8-Oct-2020
https://dl.acm.org/doi/10.1007/978-3-030-62460-6_30
Rinaldi ARusso CChbeir RIshikawa HSumiya KHatano KKoeppen M(2018)A semantic-based model to represent multimedia big dataProceedings of the 10th International Conference on Management of Digital EcoSystems10.1145/3281375.3281386(31-38)Online publication date: 25-Sep-2018
https://dl.acm.org/doi/10.1145/3281375.3281386
Flisar JPodgorelec V(2018)Document Enrichment using DBPedia Ontology for Short Text ClassificationProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227649(1-9)Online publication date: 25-Jun-2018
https://dl.acm.org/doi/10.1145/3227609.3227649
Al-Mansoori AYu SXiang YSood KAbramson D(2018)A survey on big data stream processing in SDN supported cloud environmentProceedings of the Australasian Computer Science Week Multiconference10.1145/3167918.3167924(1-11)Online publication date: 29-Jan-2018
https://dl.acm.org/doi/10.1145/3167918.3167924
Safhi HFrikh BHirchoua BOuhbi BKhalil I(2017)Data intelligence in the context of big dataJournal of Mobile Multimedia10.5555/3177197.317719813:1-2(1-27)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.5555/3177197.3177198
Basha SRajput D(2017)Evaluating the Impact of Feature Selection on Overall Performance of Sentiment AnalysisProceedings of the 2017 International Conference on Information Technology10.1145/3176653.3176665(96-102)Online publication date: 27-Dec-2017
https://dl.acm.org/doi/10.1145/3176653.3176665
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents