AI6122 Topic 3.1 - Index
AI6122 Topic 3.1 - Index
AI6122 Topic 3.1 - Index
https://lucene.apache.org/
2
An example document collection: RCV1
• RCV1:
– One year of Reuters newswire (part of 1995 and 1996); not very large
– As an example for applying scalable index construction algorithms, we will use
the Reuters RCV1 collection.
– Related datasets:
http://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+RCV2+Multilingual,+Multiv
iew+Text+Categorization+Test+collection#
3
Reuters RCV1
Symbol Statistic Value
N Documents 800,000
L Average number of tokens per document 200
M Distinct terms (word types) 400,000
Average number of bytes per token (include spaces/punctuations) 6
Average number of bytes per token (without spaces/punctuations) 4.5
Average number of bytes per term 7.5
Number of tokens 100,000,000
4
Index Construction
• These separate indexes can then be merged into one big index for the
document collection.
5
Inverted Index by SPIMI
6
Merging two inverted indexes
Multi-way merge?
Main Memory
Disk INPUT Buffer 1 Disk
INPUT Buffer 2
OUTPUT
Buffer
… …
…
INPUT Buffer M
8
If the document collection is huge: Distributed Indexing
10
Term-partitioned distributed indexing: MapReduce
11
Parsers and Inverters
• Parser
– reads a document at a time, and emits <term, docID> pairs
– Parser writes pairs into j partitions, each partition is for a range of terms’ first
letters (e.g., a-f, g-p, q-z) – here j = 3.
• An inverter
– collects all <term, docID> pairs for one term-partition (e.g., a-f)
– Sorts and writes to postings lists
12
Schema for index construction in MapReduce
13
Example for index construction
• Map:
– d1 : C came, C c’ed.
– d2 : C died.
→ <C,d1>, <came,d1>, <C,d1>, <c’ed, d1>, <C,
d2>, <died,d2>
• Reduce:
– (<C,(d1,d2,d1)>, <died,(d2)>, <came,(d1)>,
<c’ed,(d1)>)
→ (<C,(d1:2,d2:1)>, <died,(d2:1)>,
<came,(d1:1)>, <c’ed,(d1:1)>)
Dynamic indexing
• This means that the dictionary and postings lists have to be modified:
– Postings updates for terms already in dictionary
– New terms added to dictionary
15
Simplest approach for dynamic indexing
• Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this invalidation bit-vector
16
Index Compression
• Dictionary compression
• Postings compression
17
Why compression (in general)?
18
Lossless vs. lossy compression
– Prune postings entries that are unlikely to turn up in the top 𝑘 list for any query.
• Almost no loss quality for top 𝑘 list.
19
Why compression for inverted indexes?
• Dictionary
– Make it small enough to keep in main memory
– Make it so small that you can keep some postings lists in main memory too
• Postings file(s)
– Reduce disk space needed
– Decrease time needed to read postings lists from disk
– Large search engines keep a significant part of the postings in memory.
[Compression lets you keep more in memory]
20
Reuters RCV1
Symbol Statistic Value
N Documents 800,000
L Average number of tokens per document 200
M Distinct terms (word types) 400,000
Average number of bytes per token (include spaces/punctuations) 6
Average number of bytes per token (without spaces/punctuations) 4.5
Average number of bytes per term 7.5
Number of tokens 100,000,000
21
Index parameters vs. index size
22
Vocabulary vs. collection size
• In practice, the vocabulary will keep growing with the collection size
– Especially with Unicode ☺
23
Vocabulary vs. collection size
• Heaps’ law: 𝑀 = 𝑘𝑇 𝑏
– 𝑀 is the size of the vocabulary,
– 𝑇 is the number of tokens in the collection
– Typical values: 30 ≤ 𝑘 ≤ 100 and 𝑏 ≈ 0.5
24
Heaps’ Law: 𝑀 = 𝑘𝑇 𝑏
➢ For RCV1, the dashed line is
the best least squares fit.
– log10 𝑀 = 0.49 log10 𝑇 + 1.64
– 𝑀 = 101.64 𝑇 0.49
– 𝑘 = 101.64 ≈ 44 and b = 0.49.
– Good empirical fit for Reuters
RCV1 !
➢ Example:
– for first 1,000,020 tokens, law
predicts 38,323 terms;
– Actual number: 38,365 terms
Zipf’s law
• Zipf’s law: The 𝑖-th most frequent term has frequency proportional to
1/𝑖 .
– 𝑐𝑓𝑖 ∝ 1/𝑖 = 𝐾/𝑖 where 𝐾 is a normalizing constant
– 𝑐𝑓𝑖 is collection frequency
• The number of occurrences of the term 𝑡𝑖 in the collection.
26
Zipf consequences
27
Index Compression
• Now, we will consider compressing the space for the dictionary and
postings
– Basic Boolean index only
– Not considering positional indexes, etc.
– We will consider different compression schemes
28
Compressing the term list: Dictionary-as-a-String
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
29
Space for dictionary as a string
• Storage requirement:
– 4 bytes per term for document frequency
– 4 bytes per term for pointer to Postings. Freq. Postings ptr. Term ptr.
– 3 bytes per term pointer 33
30
Can we do better? → Blocking
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Lose 4 bytes per 4 terms
31
Is Blocking effective
32
Front coding – more compression
→8automat*a1e2ic3ion
33
Postings compression
34
Postings: two conflicting forces
• A term like the occurs in virtually every doc, so 20 bits per posting is
too expensive.
– Prefer 0/1 bitmap vector in this case
35
Postings file entry
• Hope: most gaps can be encoded/stored with far fewer than 20 bits.
36
Variable length encoding
• Aim:
– For arachnocentric, we will use ~20 bits/gap entry.
– For the, we will use ~1 bit/gap entry.
37
Variable Byte code example (we skip details)
38
RCV1 Index Compression
Data Size in MB
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
39
Index compression summary
• We can now create an index for highly efficient Boolean retrieval that is
very space efficient
– However, we’ve ignored positional information
40