Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Big data for Natural Language Processing

Published: 01 May 2015 Publication History

Abstract

Requirements in computational power have grown dramatically in recent years. This is also the case in many language processing tasks, due to the overwhelming and ever increasing amount of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale data processing strategies used in the Natural Language Processing field. In this paper we present a new distributed architecture and technology for scaling up text analysis running a complete chain of linguistic processors on several virtual machines. Furthermore, we also describe a series of experiments carried out with the goal of analyzing the scaling capabilities of the language processing pipeline used in this setting. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency when processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding language processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.

References

[1]
R. Agerri, J. Bermudez, G. Rigau, IXA Pipeline: efficient and ready to use multilingual NLP tools, in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland, 2014.
[2]
E. Agirre, O. López de Lacalle, A. Soroa, Random walks for knowledge-based word sense disambiguation, Comput. Linguist., 40 (2014) 57-84.
[3]
E. Agirre, O.L.D. Lacalle, A. Soroa, Knowledge-based WSD on specific domains: performing better than generic supervised WSD, in: Proceedings of IJCAI 2009, 2009.
[4]
X. Artola, Z. Beloki, A. Soroa, A stream computing approach towards scalable NLP, in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland, 2014.
[5]
A. Björkelund, L. Hafdell, P. Nugues, Multilingual semantic role labeling, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task CoNLL '09, Boulder, Colorado, USA, 2009, pp. 43-48.
[6]
W. Bosma, P. Vossen, A. Soroa, G. Rigau, M. Tesconi, A. Marchetti, M. Monachini, C. Aliprandi, KAF: a generic semantic annotation format, in: Proceedings of the GL2009 Workshop on Semantic Annotation, 2009.
[7]
X. Carreras, L. Marquez, L. Padro, Named entity extraction using AdaBoost, in: Proceedings of the 6th Conference on Natural Language Learning, vol. 20, 2002, pp. 1-4.
[8]
M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S. Zdonik, Scalable distributed stream processing, in: CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, 2003.
[9]
S. Clark, J. Curran, Language independent NER using a maximum entropy tagger, in: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-03), Edmonton, Canada, 2003, pp. 164-167.
[10]
M. Collins, Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms, in: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, vol. 10, 2002, pp. 1-8.
[11]
M. Collins, Head-driven statistical models for natural language parsing, Comput. Linguist., 29 (2003) 589-637.
[12]
B. Cowan, M. Collins, Morphology and reranking for the statistical parsing of Spanish, in: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2005, pp. 795-802.
[13]
H. Cunningham, Gate, a general architecture for text engineering, Comp. Human., 36 (2002) 223-254.
[14]
A. Cybulska, P. Vossen, Semantic relations between events and their time, locations and participants for event coreference resolution, in: Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013, INCOMA Ltd., Hissar, Shoumen, Bulgaria, 2013, pp. 156-163.
[15]
J.R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information extraction systems by gibbs sampling, in: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 363-370.
[16]
A. Fokkens, A. Soroa, Z. Beloki, N. Ockeloen, G. Rigau, W.R. van Hage, P. Vossen, NAF and GAF: linking linguistic annotations, in: Proceedings of 10th Joint ACL/ISO Workshop on Interoperable Semantic Annotation (ISA-10), LREC 2014 Workshop, Reykjavik, Iceland, 2014, p. 9.
[17]
J. Giménez, L. Marquez, Svmtool: a general POS tagger generator based on support vector machines, in: Proceedings of the 4th International Conference on Language Resources and Evaluation, 2004.
[18]
N. Ide, L. Romary, Éric Villemonte de La Clergerie, International standard for a linguistic annotation framework, in: Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), Association for Computational Linguistics, 2003.
[19]
H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, D. Jurafsky, Deterministic coreference resolution based on entity-centric, precision-ranked rules, Comput. Linguist. (2013) 1-54.
[20]
R. McCreadie, C. Macdonald, I. Ounis, M. Osborne, S. Petrovic, Scalable distributed event detection for twitter, in: Proceedings of IEEE International Conference on Big Data, 2013.
[21]
P.N. Mendes, J, Daiber, M. Jakob, C. Bizer, Evaluating DBpedia spotlight for the TAC-KBP entity linking task, in: Proceedings of the TACKBP 2011 Workshop, 2011.
[22]
L. Padró, E. Stanilovsky, Freeling 3.0: towards wider multilinguality, in: Proceedings of the Language Resources and Evaluation Conference (LREC 2012), ELRA, Istanbul, Turkey, 2012.
[23]
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, V. Vyas, Web-scale distributional similarity and entity set expansion, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, Association for Computational Linguistics, Stroudsburg, PA, USA, 2009, pp. 938-947.
[24]
L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 2009, pp. 147-155.
[25]
A. Ratnaparkhi, Learning to parse natural language with maximum entropy models, Mach. Learn., 34 (1999) 151-175.
[26]
R. Saurí, J. Pustejovsky, FactBank: a corpus annotated with event factuality, Lang. Resour. Eval., 43 (2009) 227-268.
[27]
S. Singh, A. Subramanya, F. Pereira, A. McCallum, Large-scale cross-document coreference using distributed inference and hierarchical models, in: Association for Computational Linguistics: Human Language Technologies (ACL HLT), 2011.
[28]
M. Taulé, M.A. Martí, M. Recasens, AnCora: multilevel annotated corpora for Catalan and Spanish, in: LREC, 2008.
[29]
K. Toutanova, D. Klein, C. Manning, Y. Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Proceedings of HLT-NAACL, 2003, pp. 252-259.
[30]
H. Wu, Z. Fei, A. Dai, M. Sammons, D. Roth, S. Mayhew, Illinoiscloudnlp: text analytics services in the cloud, in: Proceedings of (LREC-2014), 2014.
[31]
M. Zaharia, N.M.M. Chowdhury, M. Franklin, S. Shenker, I. Stoica, Spark: Cluster Computing with Working Sets. Technical Report EECS Department, University of California, Berkeley, 2010.

Cited By

View all
  • (2021)Large-Scale News Classification using BERT Language Model: Spark NLP ApproachProceedings of the 6th International Conference on Sustainable Information Engineering and Technology10.1145/3479645.3479658(240-246)Online publication date: 13-Sep-2021
  • (2021)Hybrid deep learning of social media big data for predicting the evolution of COVID-19 transmissionKnowledge-Based Systems10.1016/j.knosys.2021.107417233:COnline publication date: 5-Dec-2021
  • (2020)Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text SourcesScientific Programming10.1155/2020/23909412020Online publication date: 1-Jan-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Knowledge-Based Systems
Knowledge-Based Systems  Volume 79, Issue C
May 2015
116 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 May 2015

Author Tags

  1. Big data
  2. Distributed NLP architectures
  3. NLP tools
  4. Natural Language Processing
  5. Storm

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Large-Scale News Classification using BERT Language Model: Spark NLP ApproachProceedings of the 6th International Conference on Sustainable Information Engineering and Technology10.1145/3479645.3479658(240-246)Online publication date: 13-Sep-2021
  • (2021)Hybrid deep learning of social media big data for predicting the evolution of COVID-19 transmissionKnowledge-Based Systems10.1016/j.knosys.2021.107417233:COnline publication date: 5-Dec-2021
  • (2020)Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text SourcesScientific Programming10.1155/2020/23909412020Online publication date: 1-Jan-2020
  • (2020)The state of the art and taxonomy of big data analytics: view from new big data frameworkArtificial Intelligence Review10.1007/s10462-019-09685-953:2(989-1037)Online publication date: 1-Feb-2020
  • (2020)Counterfactual Retrieval for Augmentation and DecisionsMachine Learning for Cyber Security10.1007/978-3-030-62460-6_30(338-346)Online publication date: 8-Oct-2020
  • (2018)A semantic-based model to represent multimedia big dataProceedings of the 10th International Conference on Management of Digital EcoSystems10.1145/3281375.3281386(31-38)Online publication date: 25-Sep-2018
  • (2018)Document Enrichment using DBPedia Ontology for Short Text ClassificationProceedings of the 8th International Conference on Web Intelligence, Mining and Semantics10.1145/3227609.3227649(1-9)Online publication date: 25-Jun-2018
  • (2018)A survey on big data stream processing in SDN supported cloud environmentProceedings of the Australasian Computer Science Week Multiconference10.1145/3167918.3167924(1-11)Online publication date: 29-Jan-2018
  • (2017)Data intelligence in the context of big dataJournal of Mobile Multimedia10.5555/3177197.317719813:1-2(1-27)Online publication date: 1-Sep-2017
  • (2017)Evaluating the Impact of Feature Selection on Overall Performance of Sentiment AnalysisProceedings of the 2017 International Conference on Information Technology10.1145/3176653.3176665(96-102)Online publication date: 27-Dec-2017
  • Show More Cited By

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media