1, Parallel
Query Processing
There are many ways in which parallelism can
help a search engine process queries faster. The
two most popular approaches are index partitioning and replication.
(1)
inter-query parallelism
Multiple queries can be processed in parallel
but each individual query is processed sequentially. By creating n replicas of the index and assigning each replica to a separate node, we
can realize an n-fold increase of the
search engine’s service rate (its theoretical throughput) without affecting the
time required to process a single query.
(2)
intra-query parallelism
To split the index into n parts and have
each node work only on its own small part of the index. Each query is processed
by multiple servers in parallel.
2, Index partitioning schemes
The two predominant index partitioning
schemes are document partitioning and term partitioning.In a
document-partitioned index, each node holds an index
for a subset of the documents in the
collection. In a term-partitioned index, each node is responsible for a subset
of the terms in the collection.
3,Document Partitioning
In a document-partitioned index, each index
server is responsible for a subset of the documents
in the collection. Each incoming user query
is received by a frontend server, the receptionist,
which forwards it to all n index nodes, waits for them
to process the query, merges the search
results received from the index nodes, and
sends the final list of results to the user.
4, Term Partitioning
Term partitioning addresses the disk seek
problem by splitting the collection into sets of terms
instead of sets of documents. Each index
node vi is responsible for a
certain term set Ti and
is involved in processing a given query only
if one or more query terms are members of Ti.
5,Replication.
We can maintain multiple copies of the same
index node, as described in the previous section on hybrid partitioning
schemes, and have them process queries in parallel. If one of the r replicas for a given index
node fails, the remaining r−1 will take over the load of the missing replica.
6, Partial Replication.
Instead of replicating the entire index r times, we may choose to
replicate index information only for important documents. The rationale behind
this strategy is that most search results aren’t vital and can easily be
replaced by an equally relevant document.
7, Dormant Replication.
Suppose the search engine comprises a total
of n nodes. We can divide the
index found on each node vi into n − 1
fragments and distribute them evenly among the n−1 remaining nodes, but
leave them dormant (on disk) and not use them for query processing. Only when vi fails will the
corresponding n−1
fragments be activated inside the remaining nodes and loaded into memory for
query processing.
8, MapReduce
MapReduce was inspired by the map and reduce functions found in
functional programming
languages, such as Lisp. The map function
takes as its arguments a function f and a list of
elements l = hl1, l2, . . . , lni. It
returns a new list map(f, l) = h f(l1), f(l2), . . . , f(ln) i.
The reduce function (also known as fold or accumulate) takes a function g and a list of elements
l = hl1, l2, . . . , lni. It
returns a new element l′, such thatl′ = reduce(g, l) = g(l1, g(l2, g(l3, . . .))).
MapReduces are highly parallelizable,
because both map and reduce can be executed in
parallel on many different machines.
No comments:
Post a Comment