research-article

Computing n-gram statistics in MapReduce

Authors:

Klaus Berberich,

Srikanta BedathurAuthors Info & Claims

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

Pages 101 - 112

https://doi.org/10.1145/2452376.2452389

Published: 18 March 2013 Publication History

Abstract

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

References

[1]

Apache Hadoop http://hadoop.apache.org/.

[2]

Apache OpenNLP http://opennlp.apache.org/.

[3]

Berkeley DB Java Edition http://www.oracle.com/products/berkeleydb/.

[4]

Boilerpipe http://code.google.com/p/boilerpipe/.

[5]

Google n-Gram Corpus http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html.

[6]

The ClueWeb09 Dataset http://lemurproject.org/clueweb09.

[7]

The New York Times Annotated Corpus http://corpus.nytimes.com.

[8]

R. Agrawal et al. Mining association rules between sets of items in large databases. SIGMOD 1993

Digital Library

[9]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. VLDB 1994.

Digital Library

[10]

R. Agrawal and R. Srikant. Mining sequential patterns. ICDE 1995.

Digital Library

[11]

S. Banerjee and T. Pedersen. The design, implementation, and use of the ngram statistics package. CICLing 2003.

Digital Library

[12]

Y. Bernstein and J. Zobel. Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst., 31(7):595--609, 2006.

Digital Library

[13]

T. Brants et al. Large language models in machine translation. EMNLP-CoNLL 2007.

[14]

A. Ceglar and J. F. Roddick. Association mining. ACM Comput. Surv., 38(2), 2006.

Digital Library

[15]

H. Ceylan and R. Mihalcea. An efficient indexer for large n-gram corpora. ACL 2011.

Digital Library

[16]

F. Chierichetti et al. Max-cover in map-reduce. WWW 2010.

Digital Library

[17]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI 2004.

Digital Library

[18]

M. Federico et al. Irstlm: an open source toolkit for handling large scale language models. INTERSPEECH 2008.

[19]

V. Guralnik and G. Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443--472, 2004.

Digital Library

[20]

J. Han et al. Frequent pattern mining: current status and future directions. DMKD, 15(1):55--86, 2007.

Digital Library

[21]

J. Han et al. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. DMKD, 8(1):53--87, 2004.

Digital Library

[22]

J.-W. Huang et al. Dpsp: Distributed progressive sequential pattern mining on the cloud. PAKDD 2010.

Digital Library

[23]

S. Huston et al. Efficient indexing of repeated n-grams. WSDM 2011.

Digital Library

[24]

S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. ASSP, 35(3):400--401, 1987.

[25]

C. Kohlschütter et al. Boilerplate detection using shallow text features. WSDM 2010.

Digital Library

[26]

H. Li et al. Pfp: parallel fp-growth for query recommendation. RecSys 2008.

Digital Library

[27]

D. Lin et al. New tools for web-scale n-grams. LREC 2010.

[28]

J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. SIGIR 2009.

Digital Library

[29]

J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.

Digital Library

[30]

N. R. Mabroukeh and C. I. Ezeife. A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv., 43(1):3, 2010.

Digital Library

[31]

U. Manber and E. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935--948, 1993.

Digital Library

[32]

J.-B. Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 2010.

[33]

G. D. F. Morales et al. Social content matching in mapreduce. PVLDB, 4(7):460--469, 2011.

Digital Library

[34]

P. Nguyen et al. Msrlm: a scalable language modeling toolkit. MSR-TR-2007-144

[35]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. SIGMOD 2011.

Digital Library

[36]

S. Parthasarathy et al. Parallel data mining for association rules on shared-memory systems. Knowl. Inf. Syst., 3(1):1--29, 2001.

Digital Library

[37]

J. Pei et al. Mining sequential patterns by pattern-growth: The prefixspan approach. TKDE 2004.

Digital Library

[38]

R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 1996.

Digital Library

[39]

A. Stolcke. Srilm - an extensible language modeling toolkit. INTERSPEECH 2002.

[40]

K. Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications. NAACL-HLT 2010.

Digital Library

[41]

T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2nd edition, 2010.

Digital Library

[42]

I. H. Witten et al. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999.

Digital Library

[43]

M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist., 27:1--30, March 2001.

Digital Library

[44]

M. J. Zaki. Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput., 61(3):401--426, 2001.

Digital Library

[45]

M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. ML, 42(1/2):31--60, 2001.

[46]

C. Zhai. Statistical language models for information retrieval: a critical review. FTIR, 2:137--213, 2008.

Digital Library

Cited By

Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Biemann CHeyer GQuasthoff UBiemann CHeyer GQuasthoff U(2022)Maschinelle Verarbeitung von TextWissensrohstoff Text10.1007/978-3-658-35969-0_3(73-130)Online publication date: 22-May-2022
https://doi.org/10.1007/978-3-658-35969-0_3
Zhang JBi CWang YZeng TLiao BChen L(2020)Efficient Mining Closed k-Mers from DNA and Protein Sequences2020 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BigComp48618.2020.00-51(342-349)Online publication date: Feb-2020
https://doi.org/10.1109/BigComp48618.2020.00-51
Show More Cited By

Index Terms

Computing n-gram statistics in MapReduce
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Character contiguity in N-gram-based word matching: the case for Arabic text searching

This work assesses the performance of two N-gram matching techniques for Arabic root-driven string searching: contiguous N-grams and hybrid N-grams, combining contiguous and non-contiguous. The two techniques were tested using three experiments ...
Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

March 2013

793 pages

ISBN:9781450315975

DOI:10.1145/2452376

General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EDBT/ICDT '13

EDBT/ICDT '13: Joint 2013 EDBT/ICDT Conferences

March 18 - 22, 2013

Genoa, Italy

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
520
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chai C(2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
https://doi.org/10.1017/S1351324922000213
Biemann CHeyer GQuasthoff UBiemann CHeyer GQuasthoff U(2022)Maschinelle Verarbeitung von TextWissensrohstoff Text10.1007/978-3-658-35969-0_3(73-130)Online publication date: 22-May-2022
https://doi.org/10.1007/978-3-658-35969-0_3
Zhang JBi CWang YZeng TLiao BChen L(2020)Efficient Mining Closed k-Mers from DNA and Protein Sequences2020 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BigComp48618.2020.00-51(342-349)Online publication date: Feb-2020
https://doi.org/10.1109/BigComp48618.2020.00-51
Huynh HNguyen LVo BYun UOplatková ZHong T(2020)Efficient algorithms for mining clickstream patterns using pseudo-IDListsFuture Generation Computer Systems10.1016/j.future.2020.01.034Online publication date: Jan-2020
https://doi.org/10.1016/j.future.2020.01.034
Beedkar KGemulla RMartens W(2019)A Unified Framework for Frequent Sequence Mining with Subsequence ConstraintsACM Transactions on Database Systems10.1145/332148644:3(1-42)Online publication date: 5-Jun-2019
https://dl.acm.org/doi/10.1145/3321486
Renz-Wieland ABertsch MGemulla R(2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00134
Akhmed-Zaki DMansurova MImankulov TLebedev DTurar ODaribayev BAubakirov SShomanov AAidarov K(2018)High-Performance Computational and Information Technologies for Numerical Models and Data ProcessingRecent Trends in Computational Science and Engineering10.5772/intechopen.73836Online publication date: 30-May-2018
https://doi.org/10.5772/intechopen.73836
Salah SAkbarinia RMasseglia F(2017)A highly scalable parallel algorithm for maximally informative k-itemset miningKnowledge and Information Systems10.1007/s10115-016-0931-250:1(1-26)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1007/s10115-016-0931-2
Goncalves CSilva JCunha J(2016)An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs2016 IEEE 12th International Conference on e-Science (e-Science)10.1109/eScience.2016.7870892(120-129)Online publication date: Oct-2016
https://doi.org/10.1109/eScience.2016.7870892
Beedkar KGemulla R(2016)DESQ: Frequent Sequence Mining with Subsequence Constraints2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0092(793-798)Online publication date: Dec-2016
https://doi.org/10.1109/ICDM.2016.0092
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents