IS2140_Ruiting Yi: IS2140_Reading notes

IIR sections 1.3 and 1.4, chapter 6.

1，Zone: an arbitrary, unbounded amount of text, for instance, document titles and abstracts.

2, How to implement the computation of weighted zone scores?

The algorithm treats the case when the query q is a two term query consisting of query terms q1 and q2, and the Boolean function is AND: 1 if both query terms are present in a zone and 0 otherwise

ZONESCORE(q1, q2)

1 float scores[N] = [0]

2 constant g[ℓ]

3 p1 ← postings(q1)

4 p2 ← postings(q2)

5 // scores[] is an array with a score entry for each document, initialized to zero.

6 //p1 and p2 are initialized to point to the beginning of their respective postings.

7 //Assume g[] is initialized to the respective zone weights.

8 while p1 6= NIL and p2 6= NIL

9 do if docID(p1) = docID(p2)

10 then scores[docID(p1)] ← WEIGHTEDZONE(p1, p2, g)

11 p1 ← next(p1)

12 p2 ← next(p2)

13 else if docID(p1) < docID(p2)

14 then p1 ← next(p1)

15 else p2 ← next(p2)

16 return scores

3, Scoring has hinged on whether or not a query term is present in a zone within a document.

For each training example Fj we have Boolean values sT(dj, qj) and sB(dj, qj) that we use to compute a score from

score(dj , qj) = g · sT(dj, qj) + (1− g)sB(dj, qj)

4, A document or zone that mentions a query term more often has more to do with that query and therefore should receive a higher score.

To assign to each term in a document a weight for that term depends on the number of occurrences of the term in the document. To compute a score between a query term t and a document d, based on the weight of t in d.

5, Various frequency.

Term frequency: The simplest approach is to assign the weight to be equal to the number of occurrences of term t in document d

Document frequency dft: defined to be the number of documents in the collection that contain a term t.

Inverse document frequency (idf) of a term t: idft = log N /dft

Term frequency and inverse document frequency: the weighting scheme assigns to term t a weight in document d given by tf-idft,d = tft,d ×idft.

6, The tf-idft,d assigns to term t a weight in document d that is

(1). highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);

(2). lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

(3). lowest when the term occurs in virtually all documents.

7, Overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d. We can refine this idea so that we add up not the number of occurrences of each query term t in d, but instead the tf-idf weight of each term in d.

8，the notion of a document vector that captures the relative importance of the terms in a document.

vector space model：The representation of a set of documents as vectors in a common vector space

IS2140_Ruiting Yi

Thursday, January 30, 2014

IS2140_Reading notes_Unit 4

No comments:

Post a Comment