Thursday, March 27, 2014

IS2140_Reading notes_Unit 11


1, Parallel Query Processing
There are many ways in which parallelism can help a search engine process queries faster. The
two most popular approaches are index partitioning and replication.
(1)   inter-query parallelism
 
 Multiple queries can be processed in parallel but each individual query is processed sequentially.  By creating n replicas of the index and assigning each replica to a separate node, we can realize an n-fold increase of the search engine’s service rate (its theoretical throughput) without affecting the time required to process a single query.
 
(2)   intra-query parallelism
 
To split the index into n parts and have each node work only on its own small part of the index. Each query is processed by multiple servers in parallel.
 
2, Index partitioning schemes
 
The two predominant index partitioning schemes are document partitioning and term partitioning.In a document-partitioned index, each node holds an index
for a subset of the documents in the collection. In a term-partitioned index, each node is responsible for a subset of the terms in the collection.
 
3,Document Partitioning
In a document-partitioned index, each index server is responsible for a subset of the documents
in the collection. Each incoming user query is received by a frontend server, the receptionist,
which forwards it to all n index nodes, waits for them to process the query, merges the search
results received from the index nodes, and sends the final list of results to the user.
 
 
4, Term Partitioning
Term partitioning addresses the disk seek problem by splitting the collection into sets of terms
instead of sets of documents. Each index node vi is responsible for a certain term set Ti and
is involved in processing a given query only if one or more query terms are members of Ti.
 
5,Replication.
We can maintain multiple copies of the same index node, as described in the previous section on hybrid partitioning schemes, and have them process queries in parallel. If one of the r replicas for a given index node fails, the remaining r1 will take over the load of the missing replica.
 
6, Partial Replication.
Instead of replicating the entire index r times, we may choose to replicate index information only for important documents. The rationale behind this strategy is that most search results aren’t vital and can easily be replaced by an equally relevant document.
 
7, Dormant Replication.
Suppose the search engine comprises a total of n nodes. We can divide the index found on each node vi into n 1 fragments and distribute them evenly among the n1 remaining nodes, but leave them dormant (on disk) and not use them for query processing. Only when vi fails will the corresponding n1 fragments be activated inside the remaining nodes and loaded into memory for query processing.
 
8, MapReduce
MapReduce was inspired by the map and reduce functions found in functional programming
languages, such as Lisp. The map function takes as its arguments a function f and a list of
elements l = hl1, l2, . . . , lni. It returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.
 
The reduce function (also known as fold or accumulate) takes a function g and a list of elements
l = hl1, l2, . . . , lni. It returns a new element l, such thatl= reduce(g, l) = g(l1, g(l2, g(l3, . . .))).
 
MapReduces are highly parallelizable, because both map and reduce can be executed in
parallel on many different machines.
 

No comments:

Post a Comment