IS2140_Ruiting Yi: January 2014

Thursday, January 30, 2014

IS2140_Muddiest Points_Class3

Class 3

I still don't understand how to merge index blocks on the slide page 51. It means to store the word "apple" and record its positions in the document?

IS2140_Reading notes_Unit 4

IIR sections 1.3 and 1.4, chapter 6.

1，Zone: an arbitrary, unbounded amount of text, for instance, document titles and abstracts.

2, How to implement the computation of weighted zone scores?

The algorithm treats the case when the query q is a two term query consisting of query terms q1 and q2, and the Boolean function is AND: 1 if both query terms are present in a zone and 0 otherwise

ZONESCORE(q1, q2)

1 float scores[N] = [0]

2 constant g[ℓ]

3 p1 ← postings(q1)

4 p2 ← postings(q2)

5 // scores[] is an array with a score entry for each document, initialized to zero.

6 //p1 and p2 are initialized to point to the beginning of their respective postings.

7 //Assume g[] is initialized to the respective zone weights.

8 while p1 6= NIL and p2 6= NIL

9 do if docID(p1) = docID(p2)

10 then scores[docID(p1)] ← WEIGHTEDZONE(p1, p2, g)

11 p1 ← next(p1)

12 p2 ← next(p2)

13 else if docID(p1) < docID(p2)

14 then p1 ← next(p1)

15 else p2 ← next(p2)

16 return scores

3, Scoring has hinged on whether or not a query term is present in a zone within a document.

For each training example Fj we have Boolean values sT(dj, qj) and sB(dj, qj) that we use to compute a score from

score(dj , qj) = g · sT(dj, qj) + (1− g)sB(dj, qj)

4, A document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score.

To assign to each term in a document a weight for that term depends on the number of occurrences of the term in the document. To compute a score between a query term t and a document d, based on the weight of t in d.

5, Various frequency.

Term frequency: The simplest approach is to assign the weight to be equal to the number of occurrences of term t in document d

Document frequency dft: defined to be the number of documents in the collection that contain a term t.

Inverse document frequency (idf) of a term t: idft = log N /dft

Term frequency and inverse document frequency: the weighting scheme assigns to term t a weight in document d given by tf-idft,d = tft,d ×idft.

6, The tf-idft,d assigns to term t a weight in document d that is

(1). highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);

(2). lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

(3). lowest when the term occurs in virtually all documents.

7, Overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d. We can refine this idea so that we add up not the number of occurrences of each query term t in d, but instead the tf-idf weight of each term in d.

8，the notion of a document vector that captures the relative importance of the terms in a document.

vector space model：The representation of a set of documents as vectors in a common vector space

Thursday, January 23, 2014

IS2140_Reading notes_Unit 3

IIR chapters 4, and 5

1,Inversion

(1) To sort termID-docID pairs;

(2) To collect all termID-docID pairs with the same termID into a postings list, where a posting is simply a docID.

2，BSBI

(1) segments the collection into parts of equal size,

(2) sorts the termID–docID pairs of each part in memory,

(3) stores intermediate sorted results on disk,

(4) merges all intermediate results into the final index.

3,SPIMI（single-pass in-memory indexing） uses terms instead of termIDs, writes each block’s dictionary to disk, and then starts a new dictionary for the next block. SPIMI can index collections of any size as long as there is enough disk space available.

4, Lossy compression:

To discard some information, such as case folding, stemming, and stop word elimination.

5, Dictionary

The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries.

6, Variable byte encoding uses an integral number of bytes to encode a gap. The last 7 bits of a byte are "payload" and encode part of the gap. The first bit of the byte is a continuation bit.

Thursday, January 16, 2014

IS2140_Muddiest Points_Class2

Date:01/16/2014

1, In the "How are Phrases Used" part and the Precoordinate step, query constituents such as "white house" need to be replaced with the phrase "white_house". But how to distinguish the phrase "White House" which stands for The executive mansion of the President of the United States from other "white house" which simply stands for a house with white wall? Need another algorithm to distinguish their semantic meaning?

IS2140_Reading notes_Unit 2

IIR sections 1.2, chapters 2 and 3.

Date: 01/16/2014

Chapter 1 Boolean Retrieval

1,Boolean Retrieval Model

A model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, in which terms are combined with the operators AND,OR, and NOT. The model views each document as a set of words.

2,Two key statistics to assess Effectiveness

(1) Precision

(2) Recall

3,Steps for building an inverted index

(1) Collect the documents to be indexed;

(2) Tokenize the text, turning each document into a list of tokens;

(3) Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms;

(4) Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings. ----sorting and grouping

Chapter 2 The term vocabulary and postings lists

1,Tokenization

Tokenization is the process of chopping character streams into tokens, perhaps at the same time throwing away certain characters, such as puncuation, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

2,Stop words

Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

To sort the terms by collection frequency.

3, Token normalization

Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the token.

[1] Most standard way: To implicitly create equivalence classes, which are normally named after one member of the set.

antidiscriminatory<antidiscriminatory,anti-discriminatory>

[2] To maintain relations between unnormalized tokens.

(1) To index unnormalized tokens and to maintain a query expansion list of multiple

vocabulary entries to consider for a certain query term.

More flexible but requires more processing at query time.

(2) To perform the expansion during index construction.

More flexible but requires more space for storing postings.

4, Stemming and Lemmatization

Stemming usually refers to a crude heuristicc process that chops off the ends of words in the hope of achieving this goal correctly most of the time , and often includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word.
Comparison:

(1) Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complet vocabulary and morphological analysis t correctly lemmatize words.

(2) Particular domains may require special stemmng rules.

(3) Stemming increases recall while harming precision.

5,Porter's algorithm

Porter's algorithm consists of 5 phases of word reductions, applied sequentially.

6,Skip pointer

For a postings list of length P, use p^1/2evenly-spaced skip pointers.

7, Two approaches to supporting phrase queries an their combination

[1] Biword indexes

To consider every pair of consecutive terms in a document as a phrase. Longer phrases can be processed by breaking them down.

[2] Positional indexes

For each term in the vocabulary, we store postings of the form docID:<position1,position2,...>, where each position is a token index in the document.

Chapter 3 Dictionaries and tolerant retrieval

1,Two techniques for handling general wildcard queries

[1] Permuterm indexes

(1) $ : to mark the end of a term

(2) To construct a permuterm index, in which the various rotations of each term (augmented with $) all link to the original vocabulary term.

(3)permuterm vocabulary: the set of rotated terms in the permuterm index.

disadvantage: large dictionary.

[2] k-gram indexes

A k-gram is a sequence of k characters. In a k-gram index, the dictionary contains all k-grams that occur in any term in the vocabulary.

$ to denote the beginning or end of a term, the full set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$.

The 3-gram etr would point to vocabulary terms such as metric and retrieval. disadvantage: demands one further step of processing.

post-filtering step, in which the terms enumerated by the Boolean query on the 3-gram index are checked individually against the original query red*. This is a simple string-matching operation and weeds out terms such as retired that do not match the original query. Terms that survive are then searched in the standard inverted index as usual.

2, Implementing spelling correction

(1) Of various alternative correct spellings for a mis-spelled query, choose the “nearest” one. This demands that we have a notion of nearness or proximity between a pair of queries.

(2) When two correctly spelled queries are tied (or nearly tied), select the one that is more common. The simplest notion of more common is to consider the number of occurrences of the term in the collection.

Special: web. The idea is to use the correction that is most common among queries typed in by other users.

3, Forms of spelling correction

[1] isolated-term correction: to correct a single query term at a time – even when we have a multiple-term query.

Two techniques for addressing isolated-term correction:

(1) edit distance

The edit distance between two character strings s1 and s2 is the minimum number of edit operations required to transform s1 into s2.

Edit operations :

(i) insert a character into a string;

(ii) delete a character from a string and

(iii) replace a character of a string by another character

catà dog edit distance=3

(2) k-gram overlap

[2] Context-sensitive correction: to enumerate corrections of each of the three query terms even though each query term is correctly spelled, then try substitutions of each correction in the phrase.

Or expand the alternatives, we retain only the most frequent combinations in the collection or in the query logs, which contain previous queries by users.

4,Phonetic correction

The main idea here is to generate, for each term, a “phonetic hash” so that similar-sounding terms hash to the same value.

Soundex algorithm, with various variants, built on the following scheme:

1. Turn every term to be indexed into a 4-character reduced form. Build an inverted index from these reduced forms to the original terms; call this the soundex index.

2. Do the same with query terms.

3. When the query calls for a soundex match, search this soundex index.

Friday, January 10, 2014

IS2140_Muddiest Points_Class1

Date: 01/10/2014

Information retrieval is hard, because user send some key words as query to search engines and these keywords are usually imprecise. I wonder that if there is any criterion or direction for users to improve the query's preciseness and then improve the accuracy and efficiency of search results.

IS2140_Reading notes_Unit 1

Date: 01/10/2014

1, definition of FOA

The process of Finding Out About (FOA) is a series of cognitive activities, which helps decision-makers focus on knowledge relevant to the topic of search and operate on the large collections of linguistic objects by FOA searching techniques. The semantic meanings of linguistic objects are essential for FOA. We usually draw on others' opinion by communicating with others, reading valuable writings in library, searching the Internet and so on. Well-arranged linguistic expressions enable people to understand knowledge wholly and represent own opinion more clear.

2, the process of FOA

The FOA conversion loop includes three phases:

(1) Asking a question;

Users turn their information need into external expression of questions, proposing query.

(2) Constructing an answer;

Search engine is a computer program that algorithmically performs the task to answer users' questions or query.

(3) Assessing the answer.

This is a closing of the FOA conversion loop. Making use of asker's relevance feedback fully provides more information with their reaction to each retrieved document.

3, Information Retrieval

IR is concerned with representing, searching, and manipulating large collections of electronic text and other human-language data.

4, IR Applications:

(1) Web Search Engine

Web search engines are….., based on clusters of computers which work cooperatively to generate a ranked list of Web pages without redundant and duplicate pages to satisfy users' information need embodied in the query. The rank of the Web pages usually abide by the search engine 's ranking algorithm, which keeps several features including the content and structure of pages, the relation to other pages, the content and structure of the Web as a whole, and characteristics of user such as geographic location or past searching behavior in balance.

(2)Desktop and File System Search Engines

(3)Document Management and Search Service

(4)Digital Libraries and Other Specialized IR Systems

(5)Other IR Applications Associated with Storage, Manipulation, and Retrieval of Human-Language Data

For example, news aggregator, e-mail system, text clustering and categorization systems, summarization systems, information extraction systems, Topic detection and tracking systems, expert search systems, question answering systems, and multimedia information retrieval systems.

5, Basic Information Retrieval Systems Architecture

User constructs a topic depending on the information need, and issues a query which consists of terms to the IR system. Then a search engine accepts user's queries and processes them by maintaining and manipulating an inverted index for a document collection, which forms the principal data structure used by the search engine. The search engine sorts relevant documents according to the score or the retrieval status value (RSV) for each document. After eliminating the redundant and duplicate results, the search engine reports the ranked result list for further processing.

6, IR System Performance Evaluation

There are two aspects: efficiency and effectiveness.

Efficiency may be measured in terms of time and space.

Effectiveness may be measured in terms of relevance, depending on human judgment.

7,IR Problem

There are two crucial problems:
(1) How to exactly extract information from documents;
(2) How to decide the relevance which depends on use's assessment for solving query.