I want to know if MapReduce paradigm is better than MPI (Message Passing Interface)? Which type of parallelism i.e. data or task parallelism is followed in MPI and MapReduce?
MPI is a message passing library interface specification for parallel programming.
MapReduce is a Google parallel computing framework. It is based on user-specified
map and reduce functions
It also says: "In general, MapReduce is suitable for non-iterative algorithms where nodes require little data exchange to proceed (non-iterative and independent); MPI is appropriate for iterative algorithms where nodes require data exchange to
MPI is a message passing library interface specification for parallel programming.
MapReduce is a Google parallel computing framework. It is based on user-specified
map and reduce functions
It also says: "In general, MapReduce is suitable for non-iterative algorithms where nodes require little data exchange to proceed (non-iterative and independent); MPI is appropriate for iterative algorithms where nodes require data exchange to
In my mind, MapReduce is utilized to do a single query (or update) on a large data set (big data) accessed by multiple nodes simultaneously. So if you're searching for something within the large dataset, then MapReduce is what you want to use.
The easiest way of contrasting these is that with MPI on traditional clusters,
a copy of the same MPI program runs on each procr]essor core of the "compute nodes" and the data the data flows from "I/O nodes" to the "compute nodes"
With Hadoop, the compute nodes and the I/O nodes are the same (each node
has a copy of some of the data). So you could write an MPI version of
a Hadoop program and run it on the I/O nodes and it would work.
The Hadoop filesystem (HDFS) has some advantages over the
Hadoop as MPI on the I/O servers approach. HDFS has a namenode
which works as a metadata server, similar to Lustre, and HDFS has
redundant copies of the data for failover. These are automatically
replicated on another fileserver if one of the hadoop nodes goes down,
so the filesystem is resilient in the face of hardware failure, though the
namenode is a single point of failure. Hadoop should alleviate some
of this problem with better support for a standby namenode.
Assuming the work in the map function in map-reduce programming style
is proportional to the amount of data processed,
(or some function of the data size) this just means you want the
same amount of data on each server. If you are the only one with data in HDFS,
this should be done automatically.
Now if one of the hadoop nodes fails, then that will affect performance, likely throwing the load balance of the map reduce off, but the map--reduce program
Interestng this discussio guys, I would like to know if on MPI there is an automatic fault tolerance system, similar to MapReduce or we have do to it by the programmers.
MPI is not fault tolerant at all. For every receive there needs to be a matching send. Otherwise your program will block. There are test functions if there is something ready to be received, and also some non-blocking functions. However, I guess that no sent message will be lost. The library should handle this properly through a TCP connection. So, if any message gets lost it is probably a programming error on your side. Nevertheless, the most common programming error is that one process does not come to a matching send because it is in a conditional branch, while another process is waiting to receive. This will block indefinitely. Or maybe the expected sender has crashed and will never send. It is very hard to recover from this (if not even impossible).