IIR sections 1.3 and 1.4,
chapter 6.
1,Zone:
an arbitrary, unbounded amount of text, for instance, document titles and
abstracts.
2, How
to implement the computation of weighted zone scores?
ZONESCORE(q1, q2)
1 float scores[N]
= [0]
2
constant g[ℓ]
3 p1 ←
postings(q1)
4 p2 ←
postings(q2)
5 //
scores[] is an array with a score entry for each document, initialized to zero.
6 //p1 and
p2 are initialized to point to the beginning of their respective postings.
7
//Assume g[] is initialized to the respective zone weights.
8 while
p1 6= NIL and p2 6= NIL
9 do if docID(p1)
= docID(p2)
10 then
scores[docID(p1)] ← WEIGHTEDZONE(p1,
p2, g)
11 p1 ←
next(p1)
12 p2 ←
next(p2)
13 else
if docID(p1) < docID(p2)
14 then
p1 ←
next(p1)
15 else
p2 ←
next(p2)
16 return
scores
3, Scoring
has hinged on whether or not a query term is present in a zone within a
document.
For each
training example Fj we have Boolean values sT(dj, qj) and sB(dj, qj) that we use
to compute a score from
score(dj , qj) = g ·
sT(dj, qj) + (1− g)sB(dj, qj)
4, A
document or zone that mentions a query term more often has more to do with that
query and therefore should receive a higher score.
To assign
to each term in a document a weight for that term depends on the number of
occurrences of the term in the document. To compute a score between a query
term t and a document d, based on the weight of t in d.
Term frequency:
The simplest approach is to assign the weight to be equal to the number of
occurrences of term t in document d
Term
frequency and inverse document frequency: the weighting scheme assigns to term t
a weight in document d given by tf-idft,d
= tft,d ×idft.
6, The tf-idft,d
assigns to term t a weight in document d that is
(1).
highest when t occurs many times within a small number of documents (thus
lending high discriminating power to those documents);
(2).
lower when the term occurs fewer times in a document, or occurs in many documents
(thus offering a less pronounced relevance signal);
(3).
lowest when the term occurs in virtually all documents.
7, Overlap
score measure: the score of a document d is the sum, over all query terms, of
the number of times each of the query terms occurs in d. We can refine this
idea so that we add up not the number of occurrences of each query term t in d,
but instead the tf-idf weight of each term in d.
8,the
notion of a document vector that captures the relative importance of the terms
in a document.
vector
space model:The
representation of a set of documents as vectors in a common vector space
No comments:
Post a Comment