IS2140_Ruiting Yi
Thursday, April 17, 2014
Thursday, April 10, 2014
IS2140_Muddiest Points_Class12
Class 12
Integrating users’ clickthrough
data is often the most effective approach. Does the clickthrough data mean the
number of total clicks on one specific link which is relevant to the query or
all relevant links?
Thursday, April 3, 2014
IS2140_Reading notes_Unit 12
1, Personal
email sorting
2, Topic-specific or vertical search
3, labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.
4,
Mutual information measures
how much information – in the information theoretic sense – a term contains
about the class. If a term’s distribution is the same in the class as it is in
the collection as a whole, then I(U;
C) = 0. MI reaches
its maximum value if the term is a perfect indicator for class membership, that
is, if the term is present in a document if and only if the document is in the
class.
A user may have folders like talk
announcements, electronic
bills, email from
family and friends, and so on, and may want a classifier to
classify each incoming email and automatically move it to the appropriate
folder.
2, Topic-specific or vertical search
Vertical
search engines restrict searches to
a particular topic. For example, the query computer science on
a vertical search engine for the topic China will
return a list of Chinese computer science departments with higher precision and
recall than the query computer
science
China on a general purpose search engine.
3, labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.
IS2140_Muddiest Points_Class11
In query translation, the searcher expresses queries in document language. I wonder that the web pages which rank high will be translated into searcher's language or just stay in document language.
Thursday, March 27, 2014
IS2140_Reading notes_Unit 11
1, Parallel
Query Processing
There are many ways in which parallelism can
help a search engine process queries faster. The
two most popular approaches are index partitioning and replication.
(1)
inter-query parallelism
Multiple queries can be processed in parallel
but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we
can realize an n-fold increase of the
search engine’s service rate (its theoretical throughput) without affecting the
time required to process a single query.
(2)
intra-query parallelism
To split the index into n parts and have
each node work only on its own small part of the index. Each query is processed
by multiple servers in parallel.
2, Index partitioning schemes
The two predominant index partitioning
schemes are document partitioning and term partitioning.In a
document-partitioned index, each node holds an index
for a subset of the documents in the
collection. In a term-partitioned index, each node is responsible for a subset
of the terms in the collection.
3,Document Partitioning
In a document-partitioned index, each index
server is responsible for a subset of the documents
in the collection. Each incoming user query
is received by a frontend server, the receptionist,
which forwards it to all n index nodes, waits for them
to process the query, merges the search
results received from the index nodes, and
sends the final list of results to the user.
4, Term Partitioning
Term partitioning addresses the disk seek
problem by splitting the collection into sets of terms
instead of sets of documents. Each index
node vi is responsible for a
certain term set Ti and
is involved in processing a given query only
if one or more query terms are members of Ti.
5,Replication.
We can maintain multiple copies of the same
index node, as described in the previous section on hybrid partitioning
schemes, and have them process queries in parallel. If one of the r replicas for a given index
node fails, the remaining r−1 will take over the load of the missing replica.
6, Partial Replication.
Instead of replicating the entire index r times, we may choose to
replicate index information only for important documents. The rationale behind
this strategy is that most search results aren’t vital and can easily be
replaced by an equally relevant document.
7, Dormant Replication.
Suppose the search engine comprises a total
of n nodes. We can divide the
index found on each node vi into n − 1
fragments and distribute them evenly among the n−1 remaining nodes, but
leave them dormant (on disk) and not use them for query processing. Only when vi fails will the
corresponding n−1
fragments be activated inside the remaining nodes and loaded into memory for
query processing.
8, MapReduce
MapReduce was inspired by the map and reduce functions found in
functional programming
languages, such as Lisp. The map function
takes as its arguments a function f and a list of
elements l = hl1, l2, . . . , lni. It
returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.
The reduce function (also known as fold or accumulate) takes a function g and a list of elements
l = hl1, l2, . . . , lni. It
returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).
MapReduces are highly parallelizable,
because both map and reduce can be executed in
parallel on many different machines.
Thursday, March 20, 2014
IS2140_Reading notes_Unit 10
1, Power
law:
5, Several aspects of Goto’s model are worth highlighting.
6 , Search Engine Marketing (SEM):
7 ,
How do search engines differentiate themselves and grow their traffic? Google
identified two principles that helped it grow at the expense of its
competitors:
8 , Three categories of web search queries:
In power law, the total number of web pages
with in-degree i is proportional to1/iα. The value
of αtypically reported by studies is 2.1.
2, Spam
First
generation of spam in the context of web search is the manipulation of web page
content for the purpose of appearing high up in search results for selected
keywords. To avoid irritating users with these repetitions, sophisticated
spammers restored to such tricks as rendering these repeated terms in the same
color as the background. More complex
spamming techniques involve manipulation of the metadata related to a page
including the links into a web page.
3, Search
Engine Optimizers
SEOs provide
consultancy services for clients who seek to have their web pages rank highly
on selected keywords.
4, Cost Per
Mil(CPM) basis: the cost to the company of having its banner advertisement
displayed 1000 times. An advertisement (1) is priced by the number of times it
is displayed (impressions); (2)is priced by the number of times it was clicked
on by the user(cost per click model). 5, Several aspects of Goto’s model are worth highlighting.
(1) A user typing the query q into
Goto’s search interface was actively expressing an interest and intent related
to the query q.
(2) Goto only got compensated when
a user actually expressed interest in an advertisement – as evinced by the user
clicking the advertisement.
For advertisers, understanding how
search engines do this ranking and how to allocate marketing campaign budgets
to different keywords and to different sponsored search engines has become a
profession known as search engine marketing (SEM).
(1) a focus on relevance,
specifically precision rather than recall in the first few results;
(2) a user experience that is
lightweight, meaning that both the search query page and the search results page
are uncluttered and almost entirely textual, with very few graphical elements.
(1) Informational queries seek
general information on a broad topic
(2)Navigational queries seek the website or home page of a
single entity that the user has in mind, say Lufthansa airlines.
(3)A
transactional query is one that is a prelude to the
user performing a transaction on the Web – such as purchasing a product,
downloading a file or making a reservation.
9, Teleport operation:
In the teleport operation the surfer jumps
from a node to any other node in the web graph.
10,A
Markov chain is a discrete-time stochastic process: a process that
occurs in a series of time-steps in each of
which a random choice is made. A Markov chain consists of N states.
Each web page will correspond to a state in the Markov chain we will formulate.
Subscribe to:
Comments (Atom)