IS2140_Ruiting Yi: February 2014

Thursday, February 27, 2014

IS2140_Reading notes_Unit 8

1, Information visualization

Information visualization attempts to provide visual depictions of very large information spaces．The main information visualization techniques include brushing and linking, panning and zooming, focus-plus-context, magic lenses, and the use of animation to retain context and help make occluded information visible.

2, Tools of computer interface design: windows, menus, icons, dialog boxes, and so on.

3, Brushing and linking refers to the connecting of two or more views of the same data, such that a change to the representation in one view affects the representation in the other views as well.

Panning and zooming refers to the actions of a movie camera that can scan sideways across a scene (panning) or move in for a close up or back away to get a wider view (zooming).

Magic lenses are directly manipulable transparent windows that, when overlapped on some other data type, cause a transformation to be applied to the underlying data, thus changing its appearance.

4, Important differences for information access interfaces include relative spatial ability and memory, reasoning abilities, verbal aptitude, and (potentially) personality differences.

5, Models of Interaction

Most accounts of the information access process assume an interaction cycle consisting of query specification, receipt and examination of retrieval results, and then either topping or reformulating the query and repeating the process until a perfect result set is found.

IS2140_Muddiest Points_Class7

No muddiest points.

Thursday, February 20, 2014

IS2140_Reading notes_Unit 7

IIR chapter 9

1.relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

The basic procedure is:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or non-relevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

2.The Rocchio Algorithm

3.Cases where relevance feedback alone is not sufficient include:

Misspellings.

Cross-language information retrieval.

Mismatch of searcher’s vocabulary versus collection vocabulary.

4.Pseudo relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis. It automates the manual part of relevance feedback, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, to then assume that the top k ranked documents are relevant, and finally to do relevance feedback as before under this assumption.

5.Implicit feedback

while users are often reluctant to provide explicit feedback, it is easy to collect implicit feedback in large quantities for a high volume system, such as a web search engine

6.three global methods for expanding a query:

(1)by simply aiding the user in doing so,

(2) by using a manual thesaurus,

(3)through building a thesaurus automatically.

IS2140_Muddiest Points_Class6

Assume there are 100 documents in the collection, 20 documents contain the term apple(fruit), 20 documents contain the term apple(company), and 20 documents contain the term iPad. And there is no document contains apple and iPad at the same time. One user searches the term apple (company). The iPad is a product of Apple company, so the 20 documents containing term ipad are relevant to the query and S equals to 40 and s equals to 20, right?

Thursday, February 13, 2014

IS2140_Muddiest Points_Class5

Is there any tool we can use to calculate models？

IS2140_Reading notes_Unit 6

IIR chapter 8.

Chapter 8 Evaluation in information retrieval

1, Test Collection

(1). A document collection

(2). A test suite of information needs, expressible as queries

(3). A set of relevance judgments, standardly a binary assessment of either relevant or non-relevant for each query-document pair.

Relevance is assessed relative to an information need.

2,Standard test collections

(1) Cranfield collection : allowing precise quantitative measures of information retrieval effectiveness.

(2) Text Retrieval Conference(TREC)

(3) NII Test Collections for IR System(NTCIR): has built various test collections of similar sizes to the TREC collections, focusing on East Asian language and cross-language information retrieval.

(4) Cross Language Evaluation Forum(CLEF): concentrating on European languages and cross-language information retrieval.

(5) Reuters-21578 and Reuters-RCV1

(6) 20 Newsgroups

3，Evaluation of unranked retrieval sets

(1) Precision : the fraction of retrieved documents PRECISION that are relevant

(2) Recall： the fraction of relevant documents that are retrieved

P = tp/(tp+ f p)

R = tp/(tp+ f n)

accuracy = (tp + tn)/(tp + f p + f n + tn)

4, ROC Curve

An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1 − specificity). sensitivity is just another term for recall.

The false positive rate = f p/( f p+ tn).

specificity = tn/( f p+tn)

5, kappa statistic

A common measure for agreement between judges, designed for categorical judgments and corrects a simple agreement rate for the rate of chance agreement.

IS2140_Muddiest Points_Class5

Class 5

Is there any tool we can use to calculate models?

Thursday, February 6, 2014

IS2140_Muddiest Points_Class4

I'm confused about how to draw the d1 , d2 , d3 and q line in the slice page 51.

IS2140_Reading notes_Unit 5

IIR chapters 11 and 12

Chapter 11

1, Proba

chain rule: P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)

a partition rule, P(B) = P(A, B)+ P(A, B)

2,The Probability Ranking Principle

(1) The 1/0 loss case

Probability Ranking Principle (PRP):

For a query q and a document d in the collection, let Rd,q be an indicator random variable that says whether d is relevant with respect to a given query q. That is, it takes on a value of 1 when the document is relevant and 0 otherwise. Using a probabilistic model, the obvious order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need: P(R = 1|d, q).

1/0 loss: a binary situation where you are evaluated on your accuracy

d is relevant iff P((11.6) R = 1|d, q) > P(R = 0|d, q)

(2) The PRP with retrieval cost

C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)

C1: the cost of not retrieving a relevant document

C0: the cost of retrieval of a non relevant document. Then the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved then d is the next document to be retrieved.

3, The Binary Independence Model

The Binary Independence Model (BIM) we present in this section is the model that has traditionally been used with the PRP. It introduces some simple assumptions, which make estimating the probability function P(R|d, q) practical.

A document d is represented by the vector ~x = (x1, . . . , xM) where xt = 1 if term t is present in document d and xt = 0 if t is not present in d.

Chapter 12

1, Finite automata and language models

the finite automaton can generate strings that include the examples. The full set of strings that can be generated is called the language of the automaton.

2, The query likelihood model

In query likelihood model, we construct from each document d in the collection a language model Md. And then to rank documents by P(d|q), where the probability of a document is interpreted as the likelihood that it is relevant to the query.

P(d|q) = P(q|d)P(d)/P(q)

The Language Modeling approach thus attempts to model the query generation process: Documents are ranked by the probability that a query would be observed as a random sample from the respective document model. Most common way --- the multinomial unigram language model, which is equivalent to a multinomial Naive Bayes model.