Sorting Out the Document Identifier Assignment Problem

Fabrizio Silvestri¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4425))

Included in the following conference series:

European Conference on Information Retrieval

2140 Accesses
33 Citations

Abstract

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Faster Exact Search Using Document Clustering

Document retrieval on repetitive string collections

Article Open access 01 April 2017

On Inverted Index Compression for Search Engine Efficiency

References

Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)
Article Google Scholar
Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, pp. 226–233. ACM Press, New York (2005)
Chapter Google Scholar
Blanco, R., Barreiro, A.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, pp. 587–588. ACM Press, New York (2005), doi:10.1145/1076034.1076141
Chapter Google Scholar
Blanco, R., Barreiro, A.: Document Identifier Reassignment Through Dimensionality Reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, p. 375. Springer, Heidelberg (2005)
Google Scholar
Blandford, D., Blelloch, G.: Index Compression through Document Reordering. In: Proceedings of the Data Compression Conference (DCC’02), Washington, DC, USA, pp. 342–351. IEEE Computer Society Press, Los Alamitos (2002)
Google Scholar
Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, pp. 595–602. ACM Press, New York (2004), doi:10.1145/988672.988752
Chapter Google Scholar
Bookstein, A., Klein, S.T., Raita, T.: Modeling word occurrences for the compression of concordances. ACM Trans. Inf. Syst. 15(3), 254–290 (1997), doi:10.1145/256163.256166
Article Google Scholar
Buckley, C.: Implementation of the smart information retrieval system. Technical Report TR85–686, Cornell University, Computer Science Department (May 1985)
Google Scholar
Luhn, H.P.: The Automatic Creation of Literature Abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)
Article MathSciNet Google Scholar
Randall, K.H., et al.: The link database: Fast access to graphs of the web. In: DCC ’02: Proceedings of the Data Compression Conference, Washington, DC, USA, p. 122. IEEE Computer Society Press, Los Alamitos (2002)
Chapter Google Scholar
Scholer, F., et al.: Compression of inverted indexes for fast query evaluation. In: SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland, pp. 222–229. ACM Press, New York (2002), doi:10.1145/564376.564416
Chapter Google Scholar
Shieh, W.-Y., et al.: Inverted file compression through document identifier reassignment. Information Processing and Management 39 (1), 117–131 (2003)
Article MATH Google Scholar
Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp. 305–312. ACM Press, New York (2004), doi:10.1145/1008992.1009046
Google Scholar
Trotman, A.: Compressing Inverted Files. Information Retrieval 6 (1), 5–19 (2003)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Information Science and Technologies, ISTI - CNR, via Moruzzi, 1, 56126 Pisa, Italy
Fabrizio Silvestri

Authors

Fabrizio Silvestri
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Giambattista Amati Claudio Carpineto Giovanni Romano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silvestri, F. (2007). Sorting Out the Document Identifier Assignment Problem. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-71496-5_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sorting Out the Document Identifier Assignment Problem

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Faster Exact Search Using Document Clustering

Document retrieval on repetitive string collections

On Inverted Index Compression for Search Engine Efficiency

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Sorting Out the Document Identifier Assignment Problem

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Faster Exact Search Using Document Clustering

Document retrieval on repetitive string collections

On Inverted Index Compression for Search Engine Efficiency

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation