Abstract
Recent studies demonstrated that it is possible to reduce Inverted Files (IF) sizes by reassigning the document identifiers of the original collection, as this lowers the distance between the positions of documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. This paper presents an efficient solution to the reassignment problem, which consists in reducing the input data dimensionality using a SVD transformation, as well as considering it a Travelling Salesman Problem (TSP). We also present some efficient solutions based on clustering. Finally, we combine both the TSP and the clustering strategies for reordering the document identifiers. We present experimental tests and performance results in two text TREC collections, obtaining good compression ratios with low running times, and advance the possibility of obtaining scalable solutions for web collections based on the techniques presented here.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bartell, B. T., Cottrel, G. W., & Belew, R. K. (1992). Latent semantic indexing is an optimal special case of multidimensional scaling. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 161–167).
Blanco, R., & Barreiro, A. (2005). Document identifier reassignment through dimensionality reduction. In Proceedings of the 27th European Conference on IR Research, ECIR 2005, LNCS 3408 (pp. 375–387).
Blandford, D., & Blelloch, G. (2002). Index compression through document reordering. In: Proceedings of the IEEE Data Compression Conference (DCC’02), pp. 342–351.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Dumais, S. T. (1994). Latent Semantic Indexing (LSI): TREC-3 Report. In NIST Special Publication 500-225: Proceedings of the Third Text REtrieval Conference (TREC-3), November 1994.
Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2), 194–203.
Gallager, R., & van Voorhis, D. (1975). Optimal source codes for geometrically distributed integer alphabets. IEEE Transactions on Information Theory, 21, 228–230.
Golomb, S. (1966). Run-length encodings. IEEE Transactions on Information Theory, 12, 399–401.
http://mg4j.dsi.unimi.it/MG4J (Managing Gigabytes for Java).
http://tedlab.mit.edu/~dr/SVDLIBC/ SVDLIBC.
http://www.cs.mu.oz.au.mg/ Managing Gigabytes.
Karypis, G., & Kumar, V. (1995). A Fast and High Quality Multilevel Scheme for Patitioning Irregular Graphs (Tech. Rep. No. TR 95-035). Department of Computer Science, University of Minnesota.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Kokiopoulou, E., & Saad, Y. (2004). Polynomial filtering in latent semantic indexing for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–111.
Moffat, A., & Turpin, A. (2002). Compression and coding algorithms, Kluwer.
Moffat, A., & Stuiver, L. (2000). Binary interpolative coding for effective index compression. Information Retrieval, 3(1), 25–47.
Rivest, R. (1992). RFC 1321: The md5 algorithm.
Shieh, W.-Y., Chen, T.-F., Shann, J. J.-J., & Chung, C.-P. (2003). Inverted file compression through document identifier reassignment. In Information Processing and Management, 39(1), 117–131.
Silvestri, F., Orlando, S., & Perego, R. (2004). Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 305–312.
Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes—Compressing and indexing documents and images (2nd edn.). San Francisco: Morgan Kaufmann Publishing.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Blanco, R., Barreiro, Á. TSP and cluster-based solutions to the reassignment of document identifiers. Inf Retrieval 9, 499–517 (2006). https://doi.org/10.1007/s10791-006-6614-y
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10791-006-6614-y