Thursday, April 17, 2014
Thursday, April 10, 2014
IS2140_Muddiest Points_Class12
Class 12
Integrating users’ clickthrough
data is often the most effective approach. Does the clickthrough data mean the
number of total clicks on one specific link which is relevant to the query or
all relevant links?
Thursday, April 3, 2014
IS2140_Reading notes_Unit 12
1, Personal
email sorting
2, Topic-specific or vertical search
3, labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.
4,
Mutual information measures
how much information – in the information theoretic sense – a term contains
about the class. If a term’s distribution is the same in the class as it is in
the collection as a whole, then I(U;
C) = 0. MI reaches
its maximum value if the term is a perfect indicator for class membership, that
is, if the term is present in a document if and only if the document is in the
class.
A user may have folders like talk
announcements, electronic
bills, email from
family and friends, and so on, and may want a classifier to
classify each incoming email and automatically move it to the appropriate
folder.
2, Topic-specific or vertical search
Vertical
search engines restrict searches to
a particular topic. For example, the query computer science on
a vertical search engine for the topic China will
return a list of Chinese computer science departments with higher precision and
recall than the query computer
science
China on a general purpose search engine.
3, labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.
IS2140_Muddiest Points_Class11
In query translation, the searcher expresses queries in document language. I wonder that the web pages which rank high will be translated into searcher's language or just stay in document language.
Thursday, March 27, 2014
IS2140_Reading notes_Unit 11
1, Parallel
Query Processing
There are many ways in which parallelism can
help a search engine process queries faster. The
two most popular approaches are index partitioning and replication.
(1)
inter-query parallelism
Multiple queries can be processed in parallel
but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we
can realize an n-fold increase of the
search engine’s service rate (its theoretical throughput) without affecting the
time required to process a single query.
(2)
intra-query parallelism
To split the index into n parts and have
each node work only on its own small part of the index. Each query is processed
by multiple servers in parallel.
2, Index partitioning schemes
The two predominant index partitioning
schemes are document partitioning and term partitioning.In a
document-partitioned index, each node holds an index
for a subset of the documents in the
collection. In a term-partitioned index, each node is responsible for a subset
of the terms in the collection.
3,Document Partitioning
In a document-partitioned index, each index
server is responsible for a subset of the documents
in the collection. Each incoming user query
is received by a frontend server, the receptionist,
which forwards it to all n index nodes, waits for them
to process the query, merges the search
results received from the index nodes, and
sends the final list of results to the user.
4, Term Partitioning
Term partitioning addresses the disk seek
problem by splitting the collection into sets of terms
instead of sets of documents. Each index
node vi is responsible for a
certain term set Ti and
is involved in processing a given query only
if one or more query terms are members of Ti.
5,Replication.
We can maintain multiple copies of the same
index node, as described in the previous section on hybrid partitioning
schemes, and have them process queries in parallel. If one of the r replicas for a given index
node fails, the remaining r−1 will take over the load of the missing replica.
6, Partial Replication.
Instead of replicating the entire index r times, we may choose to
replicate index information only for important documents. The rationale behind
this strategy is that most search results aren’t vital and can easily be
replaced by an equally relevant document.
7, Dormant Replication.
Suppose the search engine comprises a total
of n nodes. We can divide the
index found on each node vi into n − 1
fragments and distribute them evenly among the n−1 remaining nodes, but
leave them dormant (on disk) and not use them for query processing. Only when vi fails will the
corresponding n−1
fragments be activated inside the remaining nodes and loaded into memory for
query processing.
8, MapReduce
MapReduce was inspired by the map and reduce functions found in
functional programming
languages, such as Lisp. The map function
takes as its arguments a function f and a list of
elements l = hl1, l2, . . . , lni. It
returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.
The reduce function (also known as fold or accumulate) takes a function g and a list of elements
l = hl1, l2, . . . , lni. It
returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).
MapReduces are highly parallelizable,
because both map and reduce can be executed in
parallel on many different machines.
Thursday, March 20, 2014
IS2140_Reading notes_Unit 10
1, Power
law:
5, Several aspects of Goto’s model are worth highlighting.
6 , Search Engine Marketing (SEM):
7 ,
How do search engines differentiate themselves and grow their traffic? Google
identified two principles that helped it grow at the expense of its
competitors:
8 , Three categories of web search queries:
In power law, the total number of web pages
with in-degree i is proportional to1/iα. The value
of αtypically reported by studies is 2.1.
2, Spam
First
generation of spam in the context of web search is the manipulation of web page
content for the purpose of appearing high up in search results for selected
keywords. To avoid irritating users with these repetitions, sophisticated
spammers restored to such tricks as rendering these repeated terms in the same
color as the background. More complex
spamming techniques involve manipulation of the metadata related to a page
including the links into a web page.
3, Search
Engine Optimizers
SEOs provide
consultancy services for clients who seek to have their web pages rank highly
on selected keywords.
4, Cost Per
Mil(CPM) basis: the cost to the company of having its banner advertisement
displayed 1000 times. An advertisement (1) is priced by the number of times it
is displayed (impressions); (2)is priced by the number of times it was clicked
on by the user(cost per click model). 5, Several aspects of Goto’s model are worth highlighting.
(1) A user typing the query q into
Goto’s search interface was actively expressing an interest and intent related
to the query q.
(2) Goto only got compensated when
a user actually expressed interest in an advertisement – as evinced by the user
clicking the advertisement.
For advertisers, understanding how
search engines do this ranking and how to allocate marketing campaign budgets
to different keywords and to different sponsored search engines has become a
profession known as search engine marketing (SEM).
(1) a focus on relevance,
specifically precision rather than recall in the first few results;
(2) a user experience that is
lightweight, meaning that both the search query page and the search results page
are uncluttered and almost entirely textual, with very few graphical elements.
(1) Informational queries seek
general information on a broad topic
(2)Navigational queries seek the website or home page of a
single entity that the user has in mind, say Lufthansa airlines.
(3)A
transactional query is one that is a prelude to the
user performing a transaction on the Web – such as purchasing a product,
downloading a file or making a reservation.
9, Teleport operation:
In the teleport operation the surfer jumps
from a node to any other node in the web graph.
10,A
Markov chain is a discrete-time stochastic process: a process that
occurs in a series of time-steps in each of
which a random choice is made. A Markov chain consists of N states.
Each web page will correspond to a state in the Markov chain we will formulate.
Thursday, February 27, 2014
IS2140_Reading notes_Unit 8
1, Information visualization
Information visualization attempts to provide visual
depictions of very large information spaces.The main information
visualization techniques include brushing and linking, panning and zooming, focus-plus-context,
magic lenses, and the use of animation to retain context and help make occluded
information visible.
2, Tools of computer interface design: windows, menus,
icons, dialog boxes, and so on.
3, Brushing and linking refers to the connecting of two or
more views of the same data, such that a change to the representation in one
view affects the representation in the other views as well.
Panning and zooming refers to the actions of a movie camera
that can scan sideways across a scene (panning) or move in for a close up or
back away to get a wider view (zooming).
Magic lenses are directly manipulable transparent windows
that, when overlapped on some other data type, cause a transformation to be
applied to the underlying data, thus changing its appearance.
4, Important differences for information access interfaces
include relative spatial ability and memory, reasoning abilities, verbal
aptitude, and (potentially) personality differences.
5, Models of Interaction
Most accounts of the information access process assume an
interaction cycle consisting of query specification, receipt and examination of
retrieval results, and then either topping or reformulating the query and
repeating the process until a perfect result set is found.
Thursday, February 20, 2014
IS2140_Reading notes_Unit 7
IIR chapter 9
1.relevance feedback (RF) is to involve the user in the
retrieval process so as to improve the final result set.
The basic procedure is:
• The user issues a (short, simple) query.
• The system returns an initial set of retrieval
results.
• The user marks some returned documents as relevant
or non-relevant.
• The system computes a better representation of the
information need based on the user feedback.
• The system displays a revised set of retrieval
results.
2.The Rocchio Algorithm
3.Cases where relevance feedback alone is not
sufficient include:
Misspellings.
Cross-language information retrieval.
Mismatch of searcher’s vocabulary versus collection
vocabulary.
4.Pseudo
relevance feedback, also known as blind
relevance feedback, provides a method for automatic local analysis. It
automates the manual part of relevance feedback, so that the user gets improved
retrieval performance without an extended interaction. The method is to do
normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and
finally to do relevance feedback as before under this assumption.
5.Implicit feedback
while users are often reluctant to provide explicit feedback, it is easy to collect implicit feedback in
large quantities for a high volume system, such as a web search engine
6.three global methods for expanding a query:
(1)by simply
aiding the user in doing so,
(2) by using
a manual thesaurus,
(3)through
building a thesaurus automatically.
IS2140_Muddiest Points_Class6
Assume there are 100 documents in the collection, 20
documents contain the term apple(fruit), 20 documents contain the term apple(company),
and 20 documents contain the term iPad. And there is no document contains apple and
iPad at the same time. One user searches the term apple (company). The iPad is
a product of Apple company, so the 20 documents containing term ipad are
relevant to the query and S equals to 40 and s equals to 20, right?
Thursday, February 13, 2014
IS2140_Reading notes_Unit 6
IIR
chapter 8.
Chapter
8 Evaluation in information retrieval
1,
Test Collection
(1). A document collection
(2). A test suite of information needs,
expressible as queries
(3). A set of relevance judgments, standardly
a binary assessment of either relevant
or non-relevant for each
query-document pair.
Relevance is assessed relative to an
information need.
2,Standard
test collections
(1) Cranfield collection : allowing precise quantitative
measures of information retrieval effectiveness.
(2) Text Retrieval Conference(TREC)
(3) NII Test Collections for IR System(NTCIR): has built
various test collections of similar sizes to the TREC collections, focusing on
East Asian language and cross-language information retrieval.
(4) Cross Language Evaluation Forum(CLEF): concentrating on
European languages and cross-language information retrieval.
(5) Reuters-21578 and Reuters-RCV1
(6) 20 Newsgroups
3,Evaluation of unranked retrieval sets
(1) Precision : the fraction of retrieved documents PRECISION
that are relevant

(2) Recall: the fraction of relevant documents that are retrieved


P = tp/(tp+ f p)
R = tp/(tp+ f n)
accuracy
= (tp + tn)/(tp + f p + f n + tn)
4,
ROC Curve
An ROC
curve plots the true positive rate or sensitivity against the false positive
rate or (1 − specificity).
sensitivity is
just another term for recall.
The
false positive rate = f p/( f
p+ tn).
specificity
= tn/( f
p+tn)
5, kappa statistic
A common measure for agreement between judges,
designed for categorical judgments and corrects a simple agreement rate for the
rate of chance agreement.
Thursday, February 6, 2014
IS2140_Muddiest Points_Class4
I'm confused about how to draw the d1 , d2 , d3 and q line in the slice page 51.
IS2140_Reading notes_Unit 5
IIR chapters 11 and 12
Chapter 11
1, Proba
chain rule: P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)
a partition
rule, P(B) = P(A, B)+ P(A, B)
2,The Probability Ranking
Principle
(1)
The 1/0 loss
case
Probability Ranking Principle (PRP):
For
a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given
query q. That is, it takes on a
value of 1 when the document is relevant and 0 otherwise. Using a probabilistic
model, the obvious order in which to present documents to the user is to rank
documents by their estimated probability of relevance with respect to the
information need: P(R = 1|d, q).
1/0
loss: a binary situation where you are evaluated on your accuracy
d is
relevant iff P((11.6) R = 1|d, q) > P(R = 0|d, q)
(2)
The PRP with
retrieval cost
C0
· P(R = 0|d) − C1 · P(R = 1|d)
≤ C0 · P(R = 0|d′)
− C1 · P(R = 1|d′)
C1:
the cost of not retrieving a relevant document
C0:
the cost of retrieval of a non relevant document. Then the Probability
Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved then d is the next document to be
retrieved.
3, The Binary Independence Model
The Binary
Independence Model (BIM) we present in this section is the model that
has traditionally been used with the PRP. It introduces some simple
assumptions, which make estimating the probability function P(R|d,
q) practical.
A document d
is represented by the vector ~x
= (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is
not present in d.
Chapter 12
1, Finite
automata and language models
the finite automaton can generate strings that
include the examples. The full set of strings that can be generated is called
the language of the automaton.
2, The query likelihood model
In query likelihood model, we construct from each
document d in the collection a
language model Md. And then to
rank documents by P(d|q), where the probability of a
document is interpreted as the likelihood
that it is relevant to the query.
P(d|q)
= P(q|d)P(d)/P(q)
The Language Modeling approach thus attempts to
model the query generation process: Documents are ranked by the probability
that a query would be observed as a random sample from the respective document
model. Most common way --- the multinomial unigram language model, which is
equivalent to a multinomial Naive Bayes model.
Subscribe to:
Comments (Atom)