Thursday, January 16, 2014

IS2140_Reading notes_Unit 2


IIR sections 1.2, chapters 2 and 3.
Date: 01/16/2014

Chapter 1 Boolean Retrieval

1,Boolean Retrieval Model
A model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, in which terms are combined with the operators AND,OR, and NOT. The model views each document as a set of words.
2,Two key statistics to assess Effectiveness
(1) Precision
(2) Recall
3,Steps for building an inverted index
(1) Collect the documents to be indexed;
(2) Tokenize the text, turning each document into a list of tokens;
(3) Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms;
(4) Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings. ----sorting and grouping
 

Chapter 2 The term vocabulary and postings lists

1,Tokenization

Tokenization is the process of chopping character streams into tokens, perhaps at the same time throwing away certain characters, such as puncuation, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

2,Stop words

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.
To sort the terms by collection frequency.

3, Token normalization
Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the token.
[1] Most standard way: To implicitly create equivalence classes, which are normally named after one member of the set.
antidiscriminatory<antidiscriminatory,anti-discriminatory>
      [2] To maintain relations between unnormalized tokens.

(1) To index unnormalized tokens and to maintain a query expansion list of multiple      

      vocabulary entries to consider for a certain query term.

      More flexible but requires more processing at query time.

(2) To perform the expansion during index construction.

      More flexible but requires more space for storing postings.

 4, Stemming and Lemmatization
Stemming usually refers to a crude heuristicc process that chops off the ends of words in the hope of achieving this goal correctly most of the time , and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word.
Comparison:

(1)   Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complet vocabulary and morphological analysis t correctly lemmatize words.

(2)   Particular domains may require special stemmng rules.

(3)   Stemming increases recall while harming precision.

5,Porter's algorithm
   Porter's algorithm consists of 5 phases of word reductions, applied sequentially.

6,Skip pointer

For a postings list of length P, use p1/2 evenly-spaced skip pointers.

7, Two approaches to supporting phrase queries an their combination

  [1] Biword indexes

       To consider every pair of consecutive terms in a document as a phrase. Longer phrases can be processed by breaking them down.

[2] Positional indexes

For each term in the vocabulary, we store postings of the form docID:<position1,position2,...>, where each position is a token index in the document.

 

Chapter 3 Dictionaries and tolerant retrieval

1,Two techniques for handling general wildcard queries

  [1] Permuterm indexes

(1) $ : to mark the end of a term

(2) To construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term.

    (3)permuterm vocabulary:  the set of rotated terms in the permuterm index.

       disadvantage: large dictionary.

 [2] k-gram indexes
A k-gram is a sequence of k characters. In a k-gram index, the dictionary contains all k-grams that occur in any term in the vocabulary.

$ to denote the beginning or end of a term, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$.

The 3-gram etr would point to vocabulary terms such as metric and retrieval. disadvantage: demands one further step of processing.

post-filtering step, in which the terms enumerated by the Boolean query on the 3-gram index are checked individually against the original query red*. This is a simple string-matching operation and weeds out terms such as retired that do not match the original query. Terms that survive are then searched in the standard inverted index as usual.

2, Implementing spelling correction

(1)   Of various alternative correct spellings for a mis-spelled query, choose the “nearest” one. This demands that we have a notion of nearness or proximity between a pair of queries.

     (2) When two correctly spelled queries are tied (or nearly tied), select the one that is more common. The simplest notion of more common is to consider the number of occurrences of the term in the collection.

   Special: web. The idea is to use the correction that is most common among queries typed in by other users.

3, Forms of spelling correction

[1] isolated-term correction: to correct a single query term at a time – even when we have a multiple-term query.

 Two techniques for addressing isolated-term correction:

(1)    edit distance

          The edit distance between two character strings s1 and s2 is the minimum number of edit operations required to transform s1 into s2.

       Edit operations :

(i)             insert a character into a string;

(ii)            delete a character from a string and

(iii)           replace a character of a string by another character
catà dog edit distance=3
(2)    k-gram overlap

[2] Context-sensitive correction: to enumerate corrections of each of the three query terms even though each query term is correctly spelled, then try substitutions of each correction in the phrase.

Or expand the alternatives, we retain only the most frequent combinations in the collection or in the query logs, which contain previous queries by users.

4,Phonetic correction

The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.

Soundex algorithm, with various variants, built on the following scheme:

1. Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index.

2. Do the same with query terms.

3. When the query calls for a soundex match, search this soundex index.

No comments:

Post a Comment