IIR sections 1.2, chapters 2 and 3.
Date: 01/16/2014
Chapter 1 Boolean Retrieval
1,Boolean Retrieval
Model
A model for information
retrieval in which we can pose any query which is in the form of a Boolean
expression of terms, in which terms are combined with the operators AND,OR, and
NOT. The model views each document as a set of words.
2,Two key statistics to
assess Effectiveness
(1) Precision
(2) Recall
3,Steps for building an
inverted index
(1) Collect the
documents to be indexed;
(2) Tokenize the text,
turning each document into a list of tokens;
(3) Do linguistic
preprocessing, producing a list of normalized tokens, which are the indexing
terms;
(4) Index the documents
that each term occurs in by creating an inverted index, consisting of a
dictionary and postings. ----sorting and grouping
Chapter 2 The term vocabulary and postings lists
1,Tokenization
Tokenization is the process of chopping character streams
into tokens, perhaps at the same time throwing away certain characters, such as
puncuation, while linguistic preprocessing then deals with building equivalence
classes of tokens which are the set of terms that are indexed.
2,Stop words
Sometimes, some extremely common words which would appear
to be of little value in helping select documents matching a user need are
excluded from the vocabulary entirely. These words are called stop words.
To sort the terms by collection frequency.
3, Token normalization
Token normalization is the process of canonicalizing
tokens so that matches occur despite superficial differences in the character
sequences of the token.
[1] Most standard way: To implicitly
create equivalence classes, which are normally named after one member of the
set.
antidiscriminatory<antidiscriminatory,anti-discriminatory>
[2] To
maintain relations between unnormalized tokens.
(1) To index unnormalized tokens and
to maintain a query expansion list of multiple
vocabulary entries to consider for a certain query term.
More flexible but requires more processing at query time.
(2) To perform the expansion during
index construction.
More flexible but requires more space for storing postings.
4, Stemming and
Lemmatization
Stemming usually refers to a crude heuristicc process
that chops off the ends of words in the hope of achieving this goal correctly
most of the time , and often includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological analysis of words, normally aiming
to remove inflectional endings only and to return the base or dictionary form
of a word. Comparison:
(1) Stemmers use
language-specific rules, but they require less knowledge than a lemmatizer,
which needs a complet vocabulary and morphological analysis t correctly
lemmatize words.
(2) Particular domains may
require special stemmng rules.
(3) Stemming increases
recall while harming precision.
5,Porter's algorithm
Porter's
algorithm consists of 5 phases of word reductions, applied sequentially.
6,Skip pointer
For a postings list of length P, use p1/2 evenly-spaced
skip pointers.
7, Two approaches to supporting phrase queries an their combination
[1] Biword
indexes
To consider
every pair of consecutive terms in a document as a phrase. Longer phrases can
be processed by breaking them down.
[2] Positional indexes
For each term in the vocabulary, we store postings of the
form docID:<position1,position2,...>, where each position is a token
index in the document.
Chapter 3 Dictionaries and tolerant retrieval
1,Two techniques for handling general wildcard
queries
[1] Permuterm
indexes
(1) $ : to mark the end of a
term
(2) To construct a permuterm
index, in which the various rotations of each term (augmented with $) all link
to the original vocabulary term.
(3)permuterm vocabulary: the set
of rotated terms in the permuterm index.
disadvantage: large dictionary.
[2] k-gram indexes
A k-gram is a sequence of k characters. In
a k-gram index, the dictionary contains all k-grams that occur in any term in
the vocabulary.$ to denote the beginning or end of a term, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$.
The
3-gram etr would point to vocabulary terms such as metric and retrieval. disadvantage:
demands one further step of processing.
post-filtering
step, in which the terms enumerated by the Boolean query on the 3-gram index
are checked individually against the original query red*. This is a simple
string-matching operation and
weeds out terms such as
retired that do not match the original query. Terms that survive are then
searched in the standard inverted index as usual.
2,
Implementing spelling correction
(1) Of various alternative correct spellings for a
mis-spelled query, choose the “nearest” one. This demands that we have a notion
of nearness or proximity between a pair of queries.
(2) When two correctly spelled queries are
tied (or nearly tied), select the one that is more common. The simplest notion of
more common is to consider the number of occurrences of the term in the
collection.
Special: web. The idea is to use the correction
that is most common among queries typed in by other users.
3,
Forms of spelling correction
[1] isolated-term correction: to correct a single query
term at a time – even when we have a multiple-term query.
Two techniques for
addressing isolated-term correction:
(1) edit distance
The
edit distance between two character strings s1 and s2 is the minimum number of
edit operations required to transform s1 into s2.
Edit operations :
(i)
insert a character into a string;
(ii)
delete a character from a string and
(iii)
replace a character of a string by another
character
catà dog
edit distance=3
(2) k-gram overlap
[2]
Context-sensitive correction: to
enumerate corrections of each of the three query terms even though each query
term is correctly spelled, then try substitutions of each correction in the
phrase.
Or
expand the alternatives, we retain only the most frequent combinations in the
collection or in the query logs, which contain previous queries by users.
4,Phonetic
correction
The
main idea here is to generate, for each term, a “phonetic hash” so that
similar-sounding terms hash to the same value.
Soundex
algorithm, with various variants, built on the following scheme:
1.
Turn every term to be indexed into a 4-character reduced form. Build an inverted
index from these reduced forms to the original terms; call this the soundex
index.
2.
Do the same with query terms.
3.
When the query calls for a soundex match, search this soundex index.
No comments:
Post a Comment