poster

On single-pass indexing with MapReduce

Authors:

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pages 742 - 743

https://doi.org/10.1145/1571941.1572106

Published: 19 July 2009 Publication History

Get Access

Abstract

Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora.

References

[1]

G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. In Proceedings of AFIPS, pgs. 483--485, 1967.

Digital Library

Google Scholar

[2]

Apache Software Foundation. The Apache Hadoop project. http://hadoop.apache.org/, accessed on 25/01/2009.

Google Scholar

[3]

J. Dean and S. Ghemawat. Simplified data processing on large clusters. In Proceedings of OSDI 2004, pgs. 137--150.

Digital Library

Google Scholar

[4]

S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. JASIST, 54(8):713--729, 2003.

Digital Library

Google Scholar

[5]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of EuroSys 2007, pgs. 59--72.

Digital Library

Google Scholar

Cited By

View all

Mehta AVandriwala VPatel JPatel J(2021)Indexing on Healthcare Big DataSoft Computing for Problem Solving10.1007/978-981-16-2712-5_23(271-283)Online publication date: 14-Oct-2021
https://doi.org/10.1007/978-981-16-2712-5_23
Abdullahi AAhmad RZakaria N(2021)Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduceMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_9(217-239)Online publication date: 2-Apr-2021
https://doi.org/10.1007/978-3-030-66288-2_9
Abdullahi AAhmad RZakaria N(2021)Indexing in Big Data Mining and AnalyticsMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_5(123-143)Online publication date: 2-Apr-2021
https://doi.org/10.1007/978-3-030-66288-2_5
Show More Cited By

Index Terms

On single-pass indexing with MapReduce
1. Information systems
  1. Information retrieval

Recommendations

MapReduce indexing strategies: Studying scalability and efficiency

In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this ...
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
GPU-based MapReduce for large-scale near-duplicate video retrieval

With the exponential growth of multimedia data, people are overwhelmed with massive amount of online videos, of which Near-Duplicate Videos (NDVs) occupy a large portion. In this paper, we present a novel framework for NDV retrieval, which explores the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

July 2009

896 pages

ISBN:9781605584836

DOI:10.1145/1571941

General Chairs:
James Allan
University of Massachusetts Amherst, USA
,
Javed Aslam
Northeastern University, USA
,
Program Chairs:
Mark Sanderson
University of Sheffield, UK
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Justin Zobel
University of Melbourne, Australia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

SIGIR '09

Sponsor:

SIGIR '09: The 32nd International ACM SIGIR conference on research and development in Information Retrieval

July 19 - 23, 2009

MA, Boston, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,048
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Mehta AVandriwala VPatel JPatel J(2021)Indexing on Healthcare Big DataSoft Computing for Problem Solving10.1007/978-981-16-2712-5_23(271-283)Online publication date: 14-Oct-2021
https://doi.org/10.1007/978-981-16-2712-5_23
Abdullahi AAhmad RZakaria N(2021)Big Data Analytics: Partitioned B+-Tree-Based Indexing in MapReduceMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_9(217-239)Online publication date: 2-Apr-2021
https://doi.org/10.1007/978-3-030-66288-2_9
Abdullahi AAhmad RZakaria N(2021)Indexing in Big Data Mining and AnalyticsMachine Learning and Data Mining for Emerging Trend in Cyber Dynamics10.1007/978-3-030-66288-2_5(123-143)Online publication date: 2-Apr-2021
https://doi.org/10.1007/978-3-030-66288-2_5
Ahmad RZakaria MAbdullahi A(2019)Hadoop MapReduce's InputSplit based Indexing for Join Query Processing2019 9th IEEE International Conference on Control System, Computing and Engineering (ICCSCE)10.1109/ICCSCE47578.2019.9068559(131-135)Online publication date: Nov-2019
https://doi.org/10.1109/ICCSCE47578.2019.9068559
Qiu L(2018)Research on Data Mining Algorithm of Association Rules Based on HadoopComputational Intelligence and Intelligent Systems10.1007/978-981-13-1648-7_25(288-297)Online publication date: 21-Jul-2018
https://doi.org/10.1007/978-981-13-1648-7_25
Arab AAbrishami S(2017)MDMP: A new algorithm to create inverted index files in BigData, using MapReduce2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE.2017.8167907(372-378)Online publication date: Oct-2017
https://doi.org/10.1109/ICCKE.2017.8167907
Desani DGil-Costa VMarcondes CSenger H(2016)Black-Box Optimization of Hadoop Parameters Using Derivative-Free Optimization2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)10.1109/PDP.2016.35(43-50)Online publication date: Feb-2016
https://doi.org/10.1109/PDP.2016.35
Crane MTrotman AEyers D(2015)Improving Throughput of a Pipeline Model IndexerProceedings of the 20th Australasian Document Computing Symposium10.1145/2838931.2838943(1-4)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2838931.2838943
Senger HGil‐Costa VArantes LMarcondes CMarín MSato Lda Silva F(2015)BSP cost and scalability analysis for MapReduce operationsConcurrency and Computation: Practice and Experience10.1002/cpe.362828:8(2503-2527)Online publication date: 7-Oct-2015
https://doi.org/10.1002/cpe.3628
Huston SCulpepper JCroft W(2014)Indexing Word Sequences for Ranked RetrievalACM Transactions on Information Systems10.1145/255916832:1(1-26)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1145/2559168
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

MapReduce indexing strategies: Studying scalability and efficiency

MapReduce: Review and open challenges

GPU-based MapReduce for large-scale near-duplicate video retrieval