IIR chapters 4, and 5
1,Inversion
(1) To sort termID-docID pairs;
(2) To collect all termID-docID pairs with the
same termID into a postings list, where a posting is simply a docID.
2,BSBI
(1)
segments
the collection into parts of equal size,
(2)
sorts
the termID–docID pairs of each part in memory,
(3)
stores
intermediate sorted results on disk,
(4)
merges
all intermediate results into the final index.
3,SPIMI(single-pass in-memory indexing) uses terms instead of termIDs, writes
each block’s dictionary to disk, and then starts a new dictionary for the next
block. SPIMI can index collections of any size as long as there is enough disk
space available.
4,
Lossy compression:
To discard some information, such as case
folding, stemming, and stop word elimination.
5,
Dictionary
The simplest data structure for the dictionary
is to sort the vocabulary lexicographically and store it in an array of fixed-width
entries.
6,
Variable byte encoding uses an integral number of bytes to encode a gap. The
last 7 bits of a byte are "payload" and encode part of the gap. The
first bit of the byte is a continuation bit.
No comments:
Post a Comment