IS2140_Ruiting Yi

Thursday, April 17, 2014

IS2140_Muddiest Points_Class13

No muddiest points.

Thursday, April 10, 2014

IS2140_Muddiest Points_Class12

Class 12

Integrating users’ clickthrough data is often the most effective approach. Does the clickthrough data mean the number of total clicks on one specific link which is relevant to the query or all relevant links?

Thursday, April 3, 2014

IS2140_Reading notes_Unit 12

1， Personal email sorting

A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.

2， Topic-specific or vertical search

Vertical search engines restrict searches to a particular topic. For example, the query computer science on a vertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query computer science China on a general purpose search engine.

3， labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.

4， Mutual information measures how much information – in the information theoretic sense – a term contains about the class. If a term’s distribution is the same in the class as it is in the collection as a whole, then I(U; C) = 0. MI reaches its maximum value if the term is a perfect indicator for class membership, that is, if the term is present in a document if and only if the document is in the class.

IS2140_Muddiest Points_Class11

In query translation, the searcher expresses queries in document language. I wonder that the web pages which rank high will be translated into searcher's language or just stay in document language.

Thursday, March 27, 2014

IS2140_Reading notes_Unit 11

1, Parallel Query Processing

There are many ways in which parallelism can help a search engine process queries faster. The

two most popular approaches are index partitioning and replication.

(1) inter-query parallelism

Multiple queries can be processed in parallel but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we can realize an n-fold increase of the search engine’s service rate (its theoretical throughput) without affecting the time required to process a single query.

(2) intra-query parallelism

To split the index into n parts and have each node work only on its own small part of the index. Each query is processed by multiple servers in parallel.

2, Index partitioning schemes

The two predominant index partitioning schemes are document partitioning and term partitioning.In a document-partitioned index, each node holds an index

for a subset of the documents in the collection. In a term-partitioned index, each node is responsible for a subset of the terms in the collection.

3,Document Partitioning

In a document-partitioned index, each index server is responsible for a subset of the documents

in the collection. Each incoming user query is received by a frontend server, the receptionist,

which forwards it to all n index nodes, waits for them to process the query, merges the search

results received from the index nodes, and sends the final list of results to the user.

4, Term Partitioning

Term partitioning addresses the disk seek problem by splitting the collection into sets of terms

instead of sets of documents. Each index node vi is responsible for a certain term set Ti and

is involved in processing a given query only if one or more query terms are members of Ti.

5,Replication.

We can maintain multiple copies of the same index node, as described in the previous section on hybrid partitioning schemes, and have them process queries in parallel. If one of the r replicas for a given index node fails, the remaining r−1 will take over the load of the missing replica.

6, Partial Replication.

Instead of replicating the entire index r times, we may choose to replicate index information only for important documents. The rationale behind this strategy is that most search results aren’t vital and can easily be replaced by an equally relevant document.

7, Dormant Replication.

Suppose the search engine comprises a total of n nodes. We can divide the index found on each node vi into n − 1 fragments and distribute them evenly among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for query processing. Only when vi fails will the corresponding n−1 fragments be activated inside the remaining nodes and loaded into memory for query processing.

8, MapReduce

MapReduce was inspired by the map and reduce functions found in functional programming

languages, such as Lisp. The map function takes as its arguments a function f and a list of

elements l = hl1, l2, . . . , lni. It returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.

The reduce function (also known as fold or accumulate) takes a function g and a list of elements

l = hl1, l2, . . . , lni. It returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).

MapReduces are highly parallelizable, because both map and reduce can be executed in

parallel on many different machines.

IS2140_Muddiest Points_Class10

No muddiest points.

Thursday, March 20, 2014

IS2140_Reading notes_Unit 10

1, Power law:

In power law, the total number of web pages with in-degree i is proportional to1/i α. The value of αtypically reported by studies is 2.1.

2, Spam

First generation of spam in the context of web search is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords. To avoid irritating users with these repetitions, sophisticated spammers restored to such tricks as rendering these repeated terms in the same color as the background. More complex spamming techniques involve manipulation of the metadata related to a page including the links into a web page.

3, Search Engine Optimizers

SEOs provide consultancy services for clients who seek to have their web pages rank highly on selected keywords.

4, Cost Per Mil(CPM) basis: the cost to the company of having its banner advertisement displayed 1000 times. An advertisement (1) is priced by the number of times it is displayed (impressions); (2)is priced by the number of times it was clicked on by the user(cost per click model).

5, Several aspects of Goto’s model are worth highlighting.

(1) A user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q.

(2) Goto only got compensated when a user actually expressed interest in an advertisement – as evinced by the user clicking the advertisement.

6, Search Engine Marketing (SEM):

For advertisers, understanding how search engines do this ranking and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines has become a profession known as search engine marketing (SEM).

7, How do search engines differentiate themselves and grow their traffic? Google identified two principles that helped it grow at the expense of its competitors:

(1) a focus on relevance, specifically precision rather than recall in the first few results;

(2) a user experience that is lightweight, meaning that both the search query page and the search results page are uncluttered and almost entirely textual, with very few graphical elements.

8, Three categories of web search queries:

(1) Informational queries seek general information on a broad topic

(2)Navigational queries seek the website or home page of a single entity that the user has in mind, say Lufthansa airlines.

(3)A transactional query is one that is a prelude to the user performing a transaction on the Web – such as purchasing a product, downloading a file or making a reservation.

9， Teleport operation：

In the teleport operation the surfer jumps from a node to any other node in the web graph.

10,A Markov chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made. A Markov chain consists of N states. Each web page will correspond to a state in the Markov chain we will formulate.