IS2140_Ruiting Yi: 2014

Thursday, April 17, 2014

IS2140_Muddiest Points_Class13

No muddiest points.

Thursday, April 10, 2014

IS2140_Muddiest Points_Class12

Class 12

Integrating users’ clickthrough data is often the most effective approach. Does the clickthrough data mean the number of total clicks on one specific link which is relevant to the query or all relevant links?

Thursday, April 3, 2014

IS2140_Reading notes_Unit 12

1， Personal email sorting

A user may have folders like talk announcements, electronic bills, email from family and friends, and so on, and may want a classifier to classify each incoming email and automatically move it to the appropriate folder.

2， Topic-specific or vertical search

Vertical search engines restrict searches to a particular topic. For example, the query computer science on a vertical search engine for the topic China will return a list of Chinese computer science departments with higher precision and recall than the query computer science China on a general purpose search engine.

3， labeling refers to the process of annotating each document with its class. But labeling is arguably an easier task than writing rules.

4， Mutual information measures how much information – in the information theoretic sense – a term contains about the class. If a term’s distribution is the same in the class as it is in the collection as a whole, then I(U; C) = 0. MI reaches its maximum value if the term is a perfect indicator for class membership, that is, if the term is present in a document if and only if the document is in the class.

IS2140_Muddiest Points_Class11

In query translation, the searcher expresses queries in document language. I wonder that the web pages which rank high will be translated into searcher's language or just stay in document language.

Thursday, March 27, 2014

IS2140_Reading notes_Unit 11

1, Parallel Query Processing

There are many ways in which parallelism can help a search engine process queries faster. The

two most popular approaches are index partitioning and replication.

(1) inter-query parallelism

Multiple queries can be processed in parallel but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we can realize an n-fold increase of the search engine’s service rate (its theoretical throughput) without affecting the time required to process a single query.

(2) intra-query parallelism

To split the index into n parts and have each node work only on its own small part of the index. Each query is processed by multiple servers in parallel.

2, Index partitioning schemes

The two predominant index partitioning schemes are document partitioning and term partitioning.In a document-partitioned index, each node holds an index

for a subset of the documents in the collection. In a term-partitioned index, each node is responsible for a subset of the terms in the collection.

3,Document Partitioning

In a document-partitioned index, each index server is responsible for a subset of the documents

in the collection. Each incoming user query is received by a frontend server, the receptionist,

which forwards it to all n index nodes, waits for them to process the query, merges the search

results received from the index nodes, and sends the final list of results to the user.

4, Term Partitioning

Term partitioning addresses the disk seek problem by splitting the collection into sets of terms

instead of sets of documents. Each index node vi is responsible for a certain term set Ti and

is involved in processing a given query only if one or more query terms are members of Ti.

5,Replication.

We can maintain multiple copies of the same index node, as described in the previous section on hybrid partitioning schemes, and have them process queries in parallel. If one of the r replicas for a given index node fails, the remaining r−1 will take over the load of the missing replica.

6, Partial Replication.

Instead of replicating the entire index r times, we may choose to replicate index information only for important documents. The rationale behind this strategy is that most search results aren’t vital and can easily be replaced by an equally relevant document.

7, Dormant Replication.

Suppose the search engine comprises a total of n nodes. We can divide the index found on each node vi into n − 1 fragments and distribute them evenly among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for query processing. Only when vi fails will the corresponding n−1 fragments be activated inside the remaining nodes and loaded into memory for query processing.

8, MapReduce

MapReduce was inspired by the map and reduce functions found in functional programming

languages, such as Lisp. The map function takes as its arguments a function f and a list of

elements l = hl1, l2, . . . , lni. It returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.

The reduce function (also known as fold or accumulate) takes a function g and a list of elements

l = hl1, l2, . . . , lni. It returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).

MapReduces are highly parallelizable, because both map and reduce can be executed in

parallel on many different machines.

IS2140_Muddiest Points_Class10

No muddiest points.

Thursday, March 20, 2014

IS2140_Reading notes_Unit 10

1, Power law:

In power law, the total number of web pages with in-degree i is proportional to1/i α. The value of αtypically reported by studies is 2.1.

2, Spam

First generation of spam in the context of web search is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords. To avoid irritating users with these repetitions, sophisticated spammers restored to such tricks as rendering these repeated terms in the same color as the background. More complex spamming techniques involve manipulation of the metadata related to a page including the links into a web page.

3, Search Engine Optimizers

SEOs provide consultancy services for clients who seek to have their web pages rank highly on selected keywords.

4, Cost Per Mil(CPM) basis: the cost to the company of having its banner advertisement displayed 1000 times. An advertisement (1) is priced by the number of times it is displayed (impressions); (2)is priced by the number of times it was clicked on by the user(cost per click model).

5, Several aspects of Goto’s model are worth highlighting.

(1) A user typing the query q into Goto’s search interface was actively expressing an interest and intent related to the query q.

(2) Goto only got compensated when a user actually expressed interest in an advertisement – as evinced by the user clicking the advertisement.

6, Search Engine Marketing (SEM):

For advertisers, understanding how search engines do this ranking and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines has become a profession known as search engine marketing (SEM).

7, How do search engines differentiate themselves and grow their traffic? Google identified two principles that helped it grow at the expense of its competitors:

(1) a focus on relevance, specifically precision rather than recall in the first few results;

(2) a user experience that is lightweight, meaning that both the search query page and the search results page are uncluttered and almost entirely textual, with very few graphical elements.

8, Three categories of web search queries:

(1) Informational queries seek general information on a broad topic

(2)Navigational queries seek the website or home page of a single entity that the user has in mind, say Lufthansa airlines.

(3)A transactional query is one that is a prelude to the user performing a transaction on the Web – such as purchasing a product, downloading a file or making a reservation.

9， Teleport operation：

In the teleport operation the surfer jumps from a node to any other node in the web graph.

10,A Markov chain is a discrete-time stochastic process: a process that occurs in a series of time-steps in each of which a random choice is made. A Markov chain consists of N states. Each web page will correspond to a state in the Markov chain we will formulate.

Thursday, February 27, 2014

IS2140_Reading notes_Unit 8

1, Information visualization

Information visualization attempts to provide visual depictions of very large information spaces．The main information visualization techniques include brushing and linking, panning and zooming, focus-plus-context, magic lenses, and the use of animation to retain context and help make occluded information visible.

2, Tools of computer interface design: windows, menus, icons, dialog boxes, and so on.

3, Brushing and linking refers to the connecting of two or more views of the same data, such that a change to the representation in one view affects the representation in the other views as well.

Panning and zooming refers to the actions of a movie camera that can scan sideways across a scene (panning) or move in for a close up or back away to get a wider view (zooming).

Magic lenses are directly manipulable transparent windows that, when overlapped on some other data type, cause a transformation to be applied to the underlying data, thus changing its appearance.

4, Important differences for information access interfaces include relative spatial ability and memory, reasoning abilities, verbal aptitude, and (potentially) personality differences.

5, Models of Interaction

Most accounts of the information access process assume an interaction cycle consisting of query specification, receipt and examination of retrieval results, and then either topping or reformulating the query and repeating the process until a perfect result set is found.

IS2140_Muddiest Points_Class7

No muddiest points.

Thursday, February 20, 2014

IS2140_Reading notes_Unit 7

IIR chapter 9

1.relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

The basic procedure is:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or non-relevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

2.The Rocchio Algorithm

3.Cases where relevance feedback alone is not sufficient include:

Misspellings.

Cross-language information retrieval.

Mismatch of searcher’s vocabulary versus collection vocabulary.

4.Pseudo relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

5.Implicit feedback

while users are often reluctant to provide explicit feedback, it is easy to collect implicit feedback in large quantities for a high volume system, such as a web search engine

6.three global methods for expanding a query:

(1)by simply aiding the user in doing so,

(2) by using a manual thesaurus,

(3)through building a thesaurus automatically.

IS2140_Muddiest Points_Class6

Assume there are 100 documents in the collection, 20 documents contain the term apple(fruit), 20 documents contain the term apple(company), and 20 documents contain the term iPad. And there is no document contains apple and iPad at the same time. One user searches the term apple (company). The iPad is a product of Apple company, so the 20 documents containing term ipad are relevant to the query and S equals to 40 and s equals to 20, right?

Thursday, February 13, 2014

IS2140_Muddiest Points_Class5

Is there any tool we can use to calculate models？

IS2140_Reading notes_Unit 6

IIR chapter 8.

Chapter 8 Evaluation in information retrieval

1, Test Collection

(1). A document collection

(2). A test suite of information needs, expressible as queries

(3). A set of relevance judgments, standardly a binary assessment of either relevant or non-relevant for each query-document pair.

Relevance is assessed relative to an information need.

2,Standard test collections

(1) Cranfield collection : allowing precise quantitative measures of information retrieval effectiveness.

(2) Text Retrieval Conference(TREC)

(3) NII Test Collections for IR System(NTCIR): has built various test collections of similar sizes to the TREC collections, focusing on East Asian language and cross-language information retrieval.

(4) Cross Language Evaluation Forum(CLEF): concentrating on European languages and cross-language information retrieval.

(5) Reuters-21578 and Reuters-RCV1

(6) 20 Newsgroups

3，Evaluation of unranked retrieval sets

(1) Precision : the fraction of retrieved documents PRECISION that are relevant

(2) Recall： the fraction of relevant documents that are retrieved

P = tp/(tp+ f p)

R = tp/(tp+ f n)

accuracy = (tp + tn)/(tp + f p + f n + tn)

4, ROC Curve

An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1 − specificity). sensitivity is just another term for recall.

The false positive rate = f p/( f p+ tn).

specificity = tn/( f p+tn)

5, kappa statistic

A common measure for agreement between judges, designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.

IS2140_Muddiest Points_Class5

Class 5

Is there any tool we can use to calculate models?

Thursday, February 6, 2014

IS2140_Muddiest Points_Class4

I'm confused about how to draw the d1 , d2 , d3 and q line in the slice page 51.

IS2140_Reading notes_Unit 5

IIR chapters 11 and 12

Chapter 11

1, Proba

chain rule: P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

a partition rule, P(B) = P(A, B)+ P(A, B)

2,The Probability Ranking Principle

(1) The 1/0 loss case

Probability Ranking Principle (PRP):

For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. Using a probabilistic model, the obvious order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q).

1/0 loss: a binary situation where you are evaluated on your accuracy

d is relevant iff P((11.6) R = 1|d, q) > P(R = 0|d, q)

(2) The PRP with retrieval cost

C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)

C1: the cost of not retrieving a relevant document

C0: the cost of retrieval of a non relevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved then d is the next document to be retrieved.

3, The Binary Independence Model

The Binary Independence Model (BIM) we present in this section is the model that has traditionally been used with the PRP. It introduces some simple assumptions, which make estimating the probability function P(R|d, q) practical.

A document d is represented by the vector ~x = (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d.

Chapter 12

1, Finite automata and language models

the finite automaton can generate strings that include the examples. The full set of strings that can be generated is called the language of the automaton.

2, The query likelihood model

In query likelihood model, we construct from each document d in the collection a language model Md. And then to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query.

P(d|q) = P(q|d)P(d)/P(q)

The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model. Most common way --- the multinomial unigram language model, which is equivalent to a multinomial Naive Bayes model.