IS2140_Ruiting Yi: IS2140_Reading notes

1, Parallel Query Processing

There are many ways in which parallelism can help a search engine process queries faster. The

two most popular approaches are index partitioning and replication.

Multiple queries can be processed in parallel but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we can realize an n-fold increase of the search engine’s service rate (its theoretical throughput) without affecting the time required to process a single query.

(2) intra-query parallelism

To split the index into n parts and have each node work only on its own small part of the index. Each query is processed by multiple servers in parallel.

2, Index partitioning schemes

The two predominant index partitioning schemes are document partitioning and term partitioning.In a document-partitioned index, each node holds an index

for a subset of the documents in the collection. In a term-partitioned index, each node is responsible for a subset of the terms in the collection.

3,Document Partitioning

In a document-partitioned index, each index server is responsible for a subset of the documents

in the collection. Each incoming user query is received by a frontend server, the receptionist,

which forwards it to all n index nodes, waits for them to process the query, merges the search

results received from the index nodes, and sends the final list of results to the user.

4, Term Partitioning

Term partitioning addresses the disk seek problem by splitting the collection into sets of terms

instead of sets of documents. Each index node vi is responsible for a certain term set Ti and

is involved in processing a given query only if one or more query terms are members of Ti.

5,Replication.

We can maintain multiple copies of the same index node, as described in the previous section on hybrid partitioning schemes, and have them process queries in parallel. If one of the r replicas for a given index node fails, the remaining r−1 will take over the load of the missing replica.

6, Partial Replication.

Instead of replicating the entire index r times, we may choose to replicate index information only for important documents. The rationale behind this strategy is that most search results aren’t vital and can easily be replaced by an equally relevant document.

7, Dormant Replication.

Suppose the search engine comprises a total of n nodes. We can divide the index found on each node vi into n − 1 fragments and distribute them evenly among the n−1 remaining nodes, but leave them dormant (on disk) and not use them for query processing. Only when vi fails will the corresponding n−1 fragments be activated inside the remaining nodes and loaded into memory for query processing.

8, MapReduce

MapReduce was inspired by the map and reduce functions found in functional programming

languages, such as Lisp. The map function takes as its arguments a function f and a list of

elements l = hl1, l2, . . . , lni. It returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.

The reduce function (also known as fold or accumulate) takes a function g and a list of elements

l = hl1, l2, . . . , lni. It returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).

MapReduces are highly parallelizable, because both map and reduce can be executed in

parallel on many different machines.

IS2140_Ruiting Yi

Thursday, March 27, 2014

IS2140_Reading notes_Unit 11

No comments:

Post a Comment