Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2452376.2452389acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Computing n-gram statistics in MapReduce

Published: 18 March 2013 Publication History

Abstract

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-σ that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete Map-Reduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

References

[1]
Apache Hadoop http://hadoop.apache.org/.
[2]
Apache OpenNLP http://opennlp.apache.org/.
[3]
Berkeley DB Java Edition http://www.oracle.com/products/berkeleydb/.
[4]
Boilerpipe http://code.google.com/p/boilerpipe/.
[5]
Google n-Gram Corpus http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html.
[6]
The ClueWeb09 Dataset http://lemurproject.org/clueweb09.
[7]
The New York Times Annotated Corpus http://corpus.nytimes.com.
[8]
R. Agrawal et al. Mining association rules between sets of items in large databases. SIGMOD 1993
[9]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. VLDB 1994.
[10]
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE 1995.
[11]
S. Banerjee and T. Pedersen. The design, implementation, and use of the ngram statistics package. CICLing 2003.
[12]
Y. Bernstein and J. Zobel. Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst., 31(7):595--609, 2006.
[13]
T. Brants et al. Large language models in machine translation. EMNLP-CoNLL 2007.
[14]
A. Ceglar and J. F. Roddick. Association mining. ACM Comput. Surv., 38(2), 2006.
[15]
H. Ceylan and R. Mihalcea. An efficient indexer for large n-gram corpora. ACL 2011.
[16]
F. Chierichetti et al. Max-cover in map-reduce. WWW 2010.
[17]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI 2004.
[18]
M. Federico et al. Irstlm: an open source toolkit for handling large scale language models. INTERSPEECH 2008.
[19]
V. Guralnik and G. Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443--472, 2004.
[20]
J. Han et al. Frequent pattern mining: current status and future directions. DMKD, 15(1):55--86, 2007.
[21]
J. Han et al. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. DMKD, 8(1):53--87, 2004.
[22]
J.-W. Huang et al. Dpsp: Distributed progressive sequential pattern mining on the cloud. PAKDD 2010.
[23]
S. Huston et al. Efficient indexing of repeated n-grams. WSDM 2011.
[24]
S. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. ASSP, 35(3):400--401, 1987.
[25]
C. Kohlschütter et al. Boilerplate detection using shallow text features. WSDM 2010.
[26]
H. Li et al. Pfp: parallel fp-growth for query recommendation. RecSys 2008.
[27]
D. Lin et al. New tools for web-scale n-grams. LREC 2010.
[28]
J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. SIGIR 2009.
[29]
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.
[30]
N. R. Mabroukeh and C. I. Ezeife. A taxonomy of sequential pattern mining algorithms. ACM Comput. Surv., 43(1):3, 2010.
[31]
U. Manber and E. W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935--948, 1993.
[32]
J.-B. Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 2010.
[33]
G. D. F. Morales et al. Social content matching in mapreduce. PVLDB, 4(7):460--469, 2011.
[34]
P. Nguyen et al. Msrlm: a scalable language modeling toolkit. MSR-TR-2007-144
[35]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. SIGMOD 2011.
[36]
S. Parthasarathy et al. Parallel data mining for association rules on shared-memory systems. Knowl. Inf. Syst., 3(1):1--29, 2001.
[37]
J. Pei et al. Mining sequential patterns by pattern-growth: The prefixspan approach. TKDE 2004.
[38]
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT 1996.
[39]
A. Stolcke. Srilm - an extensible language modeling toolkit. INTERSPEECH 2002.
[40]
K. Wang et al. An Overview of Microsoft Web N-gram Corpus and Applications. NAACL-HLT 2010.
[41]
T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc., 2nd edition, 2010.
[42]
I. H. Witten et al. Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann, 1999.
[43]
M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist., 27:1--30, March 2001.
[44]
M. J. Zaki. Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput., 61(3):401--426, 2001.
[45]
M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. ML, 42(1/2):31--60, 2001.
[46]
C. Zhai. Statistical language models for information retrieval: a critical review. FTIR, 2:137--213, 2008.

Cited By

View all

Index Terms

  1. Computing n-gram statistics in MapReduce

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
    March 2013
    793 pages
    ISBN:9781450315975
    DOI:10.1145/2452376
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 March 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MapReduce
    2. n-grams

    Qualifiers

    • Research-article

    Conference

    EDBT/ICDT '13

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Comparison of text preprocessing methodsNatural Language Engineering10.1017/S1351324922000213(1-45)Online publication date: 13-Jun-2022
    • (2022)Maschinelle Verarbeitung von TextWissensrohstoff Text10.1007/978-3-658-35969-0_3(73-130)Online publication date: 22-May-2022
    • (2020)Efficient Mining Closed k-Mers from DNA and Protein Sequences2020 IEEE International Conference on Big Data and Smart Computing (BigComp)10.1109/BigComp48618.2020.00-51(342-349)Online publication date: Feb-2020
    • (2020)Efficient algorithms for mining clickstream patterns using pseudo-IDListsFuture Generation Computer Systems10.1016/j.future.2020.01.034Online publication date: Jan-2020
    • (2019)A Unified Framework for Frequent Sequence Mining with Subsequence ConstraintsACM Transactions on Database Systems10.1145/332148644:3(1-42)Online publication date: 5-Jun-2019
    • (2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
    • (2018)High-Performance Computational and Information Technologies for Numerical Models and Data ProcessingRecent Trends in Computational Science and Engineering10.5772/intechopen.73836Online publication date: 30-May-2018
    • (2017)A highly scalable parallel algorithm for maximally informative k-itemset miningKnowledge and Information Systems10.1007/s10115-016-0931-250:1(1-26)Online publication date: 1-Jan-2017
    • (2016)An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs2016 IEEE 12th International Conference on e-Science (e-Science)10.1109/eScience.2016.7870892(120-129)Online publication date: Oct-2016
    • (2016)DESQ: Frequent Sequence Mining with Subsequence Constraints2016 IEEE 16th International Conference on Data Mining (ICDM)10.1109/ICDM.2016.0092(793-798)Online publication date: Dec-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media