Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1571941.1572106acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
poster

On single-pass indexing with MapReduce

Published: 19 July 2009 Publication History

Abstract

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.

References

[1]
G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. In Proceedings of AFIPS, pgs. 483--485, 1967.
[2]
Apache Software Foundation. The Apache Hadoop project. http://hadoop.apache.org/, accessed on 25/01/2009.
[3]
J. Dean and S. Ghemawat. Simplified data processing on large clusters. In Proceedings of OSDI 2004, pgs. 137--150.
[4]
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. JASIST, 54(8):713--729, 2003.
[5]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of EuroSys 2007, pgs. 59--72.

Cited By

View all
  • (2021)Indexing on Healthcare Big DataSoft Computing for Problem Solving10.1007/978-981-16-2712-5_23(271-283)Online publication date: 14-Oct-2021
  • (2021)Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduceMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_9(217-239)Online publication date: 2-Apr-2021
  • (2021)Indexing in Big Data Mining and AnalyticsMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_5(123-143)Online publication date: 2-Apr-2021
  • Show More Cited By

Index Terms

  1. On single-pass indexing with MapReduce

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
    July 2009
    896 pages
    ISBN:9781605584836
    DOI:10.1145/1571941

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 July 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MapReduce
    2. indexing

    Qualifiers

    • Poster

    Conference

    SIGIR '09
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Indexing on Healthcare Big DataSoft Computing for Problem Solving10.1007/978-981-16-2712-5_23(271-283)Online publication date: 14-Oct-2021
    • (2021)Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduceMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_9(217-239)Online publication date: 2-Apr-2021
    • (2021)Indexing in Big Data Mining and AnalyticsMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_5(123-143)Online publication date: 2-Apr-2021
    • (2019)Hadoop MapReduce's InputSplit based Indexing for Join Query Processing2019 9th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)10.1109/ICCSCE47578.2019.9068559(131-135)Online publication date: Nov-2019
    • (2018)Research on Data Mining Algorithm of Association Rules Based on HadoopComputational Intelligence and Intelligent Systems10.1007/978-981-13-1648-7_25(288-297)Online publication date: 21-Jul-2018
    • (2017)MDMP: A new algorithm to create inverted index files in BigData, using MapReduce2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE.2017.8167907(372-378)Online publication date: Oct-2017
    • (2016)Black-Box Optimization of Hadoop Parameters Using Derivative-Free Optimization2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)10.1109/PDP.2016.35(43-50)Online publication date: Feb-2016
    • (2015)Improving Throughput of a Pipeline Model IndexerProceedings of the 20th Australasian Document Computing Symposium10.1145/2838931.2838943(1-4)Online publication date: 8-Dec-2015
    • (2015)BSP cost and scalability analysis for MapReduce operationsConcurrency and Computation: Practice and Experience10.1002/cpe.362828:8(2503-2527)Online publication date: 7-Oct-2015
    • (2014)Indexing Word Sequences for Ranked RetrievalACM Transactions on Information Systems10.1145/255916832:1(1-26)Online publication date: 1-Jan-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media