IS2140_Ruiting Yi: IS2140_Reading notes

IIR chapters 4, and 5

1,Inversion

(1) To sort termID-docID pairs;

(2) To collect all termID-docID pairs with the same termID into a postings list, where a posting is simply a docID.

2，BSBI

(1) segments the collection into parts of equal size,

(2) sorts the termID–docID pairs of each part in memory,

(3) stores intermediate sorted results on disk,

(4) merges all intermediate results into the final index.

3,SPIMI（single-pass in-memory indexing） uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.

4, Lossy compression:

To discard some information, such as case folding, stemming, and stop word elimination.

5, Dictionary

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

6, Variable byte encoding uses an integral number of bytes to encode a gap. The last 7 bits of a byte are "payload" and encode part of the gap. The first bit of the byte is a continuation bit.

IS2140_Ruiting Yi

Thursday, January 23, 2014

IS2140_Reading notes_Unit 3

No comments:

Post a Comment