I just want to know is there any performance benefit to deploy a big dataset in hadoop using AWS than in a real physical multi-node cluster setup done on series of machines having a sequential IP's?
If one don't have a physical cluster and want to learn Hadoop, then only it is useful to try it on AWS. However, there is performance degradation due to the stacks of layers. Installation of Hadoop over the cloud is not a good option et al. Because Hadoop is designed to provide reliability and robustness towards fault-tolerance on commodity hardware. On the other hand, the cloud itself also provide all these features. Also, the cloud services are delivered using reliable hardware. So my conclusion is that one can trust the services offered by AWS. Hence, keeping extra layers is only waste of the computing resources and will lead to performance degradation.
The best method to be used on the cloud for data analytics is OpenMPI [but one can blame it for no fault-tolerance on commodity hardware]. You can use StarCluster to run OpenMPI programs in Amazon EC2.
Thanks Dinesh for your contribution. But I just want to know that which one is much more acceptable, using Hadoop in AWS or a physical set up with 3 or 4 machines?