research-article

Efficient immediate-access dynamic indexing

Authors:

Alistair Moffat,

Joel MackenzieAuthors Info & Claims

Volume 60, Issue 3

https://doi.org/10.1016/j.ipm.2022.103248

Published: 01 May 2023 Publication History

Abstract

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a “normal” static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting, less than half the space required by the best previous mechanism.

Highlights

•

Dynamic indexes allow new documents to be added to a text collection.

•

In an immediate-access index they are immediately findable via keyword queries.

•

We develop a new method for constructing immediate-access indexes.

•

We include a novel compression technique that reduces the memory required.

•

We further describe an innovative approach to extensible lists.

•

We present comprehensive experimental results.

References

[1]

Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., & Kanoulas, E. (2008). Million query track 2008 overview. In Proc. text retrieval conf.

[2]

Allan, J., Carterette, B., Aslam, J. A., Pavlu, V., Dachev, B., & Kanoulas, E. (2007). Million query track 2007 overview. In Proc. text retrieval conf.

[3]

Arroyuelo D., Oyarzún M., González S., Sepulveda V., Hybrid compression of inverted lists for reordered document collections, Information Processing & Management 54 (6) (2018) 1308–1324.

[4]

Asadi, N., Lin, J., & Busch, M. (2013). Dynamic memory allocation policies for postings in real-time Twitter search. In Proc. conf. on knowledge discovery and data mining (pp. 1186–1194).

[5]

Bai X., Arapakis I., Cambazoglu B.B., Freire A., Understanding and leveraging the impact of response latency on user behaviour in web search, ACM Transactions on Information Systems 36 (2) (2017) 1–42.

[6]

Bai X., Cambazoglu B.B., Impact of response latency on sponsored search, Information Processing & Management 56 (1) (2019) 110–129.

[7]

Brisaboa N.R., Farina A., Navarro G., Esteller M.F., ( S, C )-dense coding: An optimized compression code for natural language text databases, in: Proc. symp. on string processing and information retrieval, Springer, 2003, pp. 122–136.

[8]

Broder, A. Z., Carmel, D., Herscovici, M., Soffer, A., & Zien, J. (2003). Efficient query evaluation using a two-level retrieval process. In Proc. ACM int. conf. on information and knowledge management (pp. 426–434).

[9]

Brown, E. W., Callan, J. P., & Croft, W. B. (1994). Fast incremental indexing for full-text information retrieval. In Proc. conf. on very large databases (pp. 192–202).

[10]

Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., & Lin, J. (2012). Earlybird: Real-time search at Twitter. In Proc. int. conf. on data engineering (pp. 1360–1369).

[11]

Büttcher, S., & Clarke, C. L. A. (2005a). Indexing time vs. query time: Trade-offs in dynamic information retrieval systems. In Proc. ACM int. conf. on information and knowledge management (pp. 317–318).

[12]

Büttcher S., Clarke C.L.A., Memory management strategies for single-pass index construction in text retrieval systems, University of Waterloo, 2005, URL http://www.stefan.buettcher.org/papers/wumpus-tr-2005-02.pdf.

[13]

Büttcher S., Clarke C.L.A., Hybrid index maintenance for contiguous inverted lists, Information Retrieval 11 (3) (2008) 175–207.

[14]

Büttcher S., Clarke C.L.A., Cormack G.V., Information retrieval: Implementing and evaluating search engines, MIT Press, Cambridge, MA, 2010.

[15]

Carterette, B., Pavlu, V., Fang, H., & Kanoulas, E. (2009). Million query track 2009 overview. In Proc. text retrieval conf.

[16]

Chang M., Poon C.-K., Efficient phrase querying with common phrase index, Information Processing & Management 44 (2) (2008) 756–769.

[17]

Chowdhury G., An agenda for green information retrieval research, Information Processing & Management 48 (6) (2012) 1067–1077.

[18]

Clarke C.L.A., Cormack G.V., Burkowski F.J., Dynamic inverted indexes for a distributed full-text retrieval system, University of Waterloo, 1994, URL https://www.researchgate.net/publication/2699270_Fast_Inverted_Indexes_with_On-Line_Update.

[19]

Culpepper, J. S., & Moffat, A. (2005). Enhanced byte codes with restricted prefix properties. In Proc. symp. on string processing and information retrieval (pp. 1–12).

[20]

Culpepper J.S., Moffat A., Efficient set intersection for inverted indexing, ACM Transactions on Information Systems 29 (1) (2010) 1.1–1.25.

[21]

Cutting, D., & Pedersen, J. (1990). Optimizations for dynamic inverted index maintenance. In Proc. ACM int. conf. on research and development in information retrieval (pp. 405–411).

[22]

Daoud C.M., de Moura E.S., Carvalho A.L., da Silva A.S., Fernandes D., Rossi C., Fast top-k preserving query processing using two-tier indexes, Information Processing & Management 52 (5) (2016) 855–872.

Digital Library

[23]

Daoud C.M., de Moura E.S., Fernandes D., da Silva A.S., Rossi C., Carvalho A., Waves: A fast multi-tier top-k query processing algorithm, Information Retrieval 20 (3) (2017) 292–316.

[24]

de Moura E.S., Navarro G., Ziviani N., Baeza-Yates R., Fast and flexible word searching on compressed text, ACM Transactions on Information Systems 18 (2) (2000) 113–139.

Digital Library

[25]

Dean, J. (2009). Challenges in building large-scale information retrieval systems: Invited talk. In Proc. conf. on web search and data mining (p. 1).

[26]

Dhulipala, L., Kabiljo, I., Karrer, B., Ottaviano, G., Pupyrev, S., & Shalita, A. (2016). Compressing graphs and indexes with recursive graph bisection. In Proc. conf. on knowledge discovery and data mining (pp. 1535–1544).

[27]

Ding, S., & Suel, T. (2011). Faster top-k document retrieval using block-max indexes. In Proc. ACM int. conf. on research and development in information retrieval (pp. 993–1002).

[28]

Eades, P., Wirth, A., & Zobel, J. (2022). Immediate text search on streams using apoptosic indexes. In Proc. European conf. on information retrieval (pp. 157–169).

[29]

Faloutsos, C., & Jagadish, H. V. (1992). On B-tree indices for skewed distributions. In Proc. conf. on very large databases (pp. 363–374).

[30]

Fox E.A., Lee W.C., FAST-INV: A fast algorithm for building large inverted files, Virginia Tech., Blacksburg, Virginia, 1991, URL http://hdl.handle.net/10919/19663.

[31]

Harman D., Candela G.T., Retrieving records from a gigabyte of text on a mini-computer using statistical ranking, Journal of the American Society for Information Science 41 (8) (1990) 581–589.

[32]

Hawking, D., & Billerbeck, B. (2017). Efficient in-memory, list-based text inversion. In Proc. Australasian document computing symp (pp. 5:1–5:8).

[33]

Heaps H.S., Storage analysis of a compression coding for document data bases, INFOR. Information Systems and Operational Research 10 (1) (1972) 47–61.

[34]

Heinz S., Zobel J., Efficient single-pass index construction for text databases, Journal of the American Society for Information Science and Technology 54 (8) (2003) 713–729.

[35]

Kim, S., Lee, T., Hwang, S. w., & Elnikety, S. (2018). List intersection for web search: Algorithms, cost models, and optimizations. In Proc. conf. on very large databases, vol. 12, no. 1 (pp. 1–13).

[36]

Lemire D., Boytsov L., Decoding billions of integers per second through vectorization, Software Practice & Experience 41 (1) (2015) 1–29.

[37]

Lemire D., Kurz N., Rupp C., Stream VByte: Faster byte-oriented integer compression, Information Processing Letters 130 (2018) 1–6.

[38]

Lester N., Moffat A., Zobel J., Efficient on-line index construction for text databases, ACM Transactions on Database Systems 33 (3) (2008) 19.1–19.33.

[39]

Lester N., Zobel J., Williams H.E., Efficient online index maintenance for contiguous inverted lists, Information Processing & Management 42 (4) (2006) 916–933.

[40]

Lin J., Paniak L., Boerke G., The performance envelope of inverted indexing on modern hardware, 2019, arXiv:1910.11028v2.

[41]

Mackenzie, J., & Moffat, A. (2020). Examining the additivity of top-k query processing innovations. In Proc. ACM int. conf. on information and knowledge management (pp. 1085–1094).

[42]

Mackenzie J., Petri M., Moffat A., Tradeoff options for bipartite graph partitioning, IEEE Transactions on Knowledge and Data Engineering (2022),. (in press).

Digital Library

[43]

Mallia, A., Ottaviano, G., Porciani, E., Tonellotto, N., & Venturini, R. (2017). Faster BlockMax WAND with variable-sized blocks. In Proc. ACM int. conf. on research and development in information retrieval (pp. 625–634).

[44]

Mallia, A., Siedlaczek, M., Mackenzie, J., & Suel, T. (2019). PISA: Performant indexes and search for academia. In Proc. OSIRRC at SIGIR 2019 (pp. 50–56).

[45]

Mallia, A., Siedlaczek, M., & Suel, T. (2019). An experimental study of index compression and DAAT query processing methods. In Proc. European conf. on information retrieval (pp. 353–368b).

[46]

Mendoza M., Marín M., Gil-Costa V., Ferrarotti F., Reducing hardware hit by queries in web search engines, Information Processing & Management 52 (6) (2016) 1031–1052.

[47]

Moffat A., Economical inversion of large text files, Computing Systems 5 (2) (1992) 125–139.

[48]

Moffat A., Bell T.A.H., In-situ generation of compressed inverted files, Journal of the American Society for Information Science 46 (7) (1995) 537–550.

[49]

Moffat, A., & Mackenzie, J. (2022). Immediate-access indexing using space-efficient extensible arrays. In Proc. Australasian document computing symp. https://doi.org/10.1145/3572960.3572984, (in press).

[50]

Moffat A., Stuiver L., Binary interpolative coding for effective index compression, Information Retrieval 3 (1) (2000) 25–47.

[51]

Moffat A., Zobel J., Self indexing inverted files for fast text retrieval, ACM Transactions on Information Systems 14 (4) (1996) 349–379.

[52]

Pibiri G.E., Venturini R., Techniques for inverted index compression, ACM Computing Surveys 53 (6) (2021) 125.1–125.36.

[53]

Robertson S.E., Zaragoza H., The probabilistic relevance framework: BM25 and beyond, Foundations & Trends in Information Retrieval 3 (2009) 333–389.

[54]

Scholer, F., Williams, H. E., Yiannis, J., & Zobel, J. (2002). Compression of inverted indexes for fast query evaluation. In Proc. ACM int. conf. on research and development in information retrieval.

[55]

Shieh W.-Y., Chung C.-P., A statistics-based approach to incrementally update inverted files, Information Processing & Management 41 (2) (2005) 275–288.

[56]

Shoens, K., Tomasic, A., & Garćıa-Molina, H. (1994). Synthetic workload performance analysis of incremental updates. In Proc. ACM int. conf. on research and development in information retrieval (pp. 329–338).

[57]

Stepanov, A. A., Gangolli, A. R., Rose, D. E., Ernst, R. J., & Oberoi, P. S. (2011). SIMD-based decoding of posting lists. In Proc. ACM int. conf. on information and knowledge management (pp. 317–326).

[58]

Tomasic, A., Garćıa-Molina, H., & Shoens, K. (1994). Incremental updates of inverted lists for text document retrieval. In Proc. int. conf. on management of data (pp. 289–300).

[59]

Tonellotto N., Macdonald C., Ounis I., Efficient query processing for scalable web search, Foundations & Trends in Information Retrieval 12 (4–5) (2018) 319–500.

[60]

Trotman A., Compressing inverted files, Information Retrieval 6 (1) (2003) 5–19.

[61]

Trotman, A. (2014). Compression, SIMD, and postings lists. In Proc. Australasian document computing symp (pp. 50.50–50.57).

[62]

Trotman, A., Jia, X., & Crane, M. (2013). Managing short postings lists. In Proc. Australasian document computing symp (pp. 113–116).

[63]

Turtle H.R., Flood J., Query evaluation: Strategies and optimizations, Information Processing & Management 31 (6) (1995) 831–850.

[64]

Wang, J., Lin, C., He, R., Chae, M., Papakonstantinou, Y., & Swanson, S. (2017). MILC: Inverted list compression in memory. In Proc. conf. on very large databases, vol. 10, no. 8 (pp. 853–864).

[65]

Williams H.E., Zobel J., Compressing integers for fast file access, Computer Journal 42 (3) (1999) 193–201.

[66]

Zobel J., Moffat A., Inverted files for text search engines, ACM Computing Surveys 38 (2) (2006) 6:1–6:56.

Recommendations

Fast construction of the HYB index

As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, ...
Efficient Update of Indexes for Dynamically Changing Web Documents

Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index ...
Optimizing positional index structures for versioned document collections
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Versioned document collections are collections that contain multiple versions of each document. Important examples are Web archives, Wikipedia and other wikis, or source code and documents maintained in revision control systems. Versioned document ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 60, Issue 3

May 2023

1647 pages

ISSN:0306-4573

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 May 2023

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents