Some workloads or even inputs perform well on GPUs, while others perform well on multicores. How do we decide which machine to buy for a generic problem base for optimal performance? Cost is NOT taken as a factor here.
Besides cost, there are many factors you have to consider.
First, you are asking which hardware to use for a given algorithm or implementation, which I think is not the right question because a parallel algorithm is developed taking into consideration the hardware.
I'm not going to take the same approach if my solution is for a cluster using MPI, a multicore processor using OpenMP, or a manycore processor using CUDA.
So, before deciding the hardware and the algorithm you have to analyze the problem (you may be interested in looking for Foster's methodology). How can it be decomposed (many independent tasks, few coarse grain tasks, etc)? Is it regular (in its memory access pattern, in the operations done on data)? What's the size of the data (can it be fitted in a GPU memory or in the main memory)? Is it memory bound or computation intensive?
After this process, you can take the decision about the hardware and after that you can start developing the program.
Finally, when you have a functional program, you should start with a performance tuning process for maximizing the performance indexes of your interest (speedup, efficiency, power consumption, throughput, etc.).
Dear Ahmad, I think it really depends on the kind of problem you are going to work.
GPU is a great way to improve your performance in overall, but if the threads have a lot of dependency, it will probably cause overhead, which will lead you to lose performance
GPUs are well suited to processes with many operations to perform. But keep your eyes on the overhead situation.
Besides cost, there are many factors you have to consider.
First, you are asking which hardware to use for a given algorithm or implementation, which I think is not the right question because a parallel algorithm is developed taking into consideration the hardware.
I'm not going to take the same approach if my solution is for a cluster using MPI, a multicore processor using OpenMP, or a manycore processor using CUDA.
So, before deciding the hardware and the algorithm you have to analyze the problem (you may be interested in looking for Foster's methodology). How can it be decomposed (many independent tasks, few coarse grain tasks, etc)? Is it regular (in its memory access pattern, in the operations done on data)? What's the size of the data (can it be fitted in a GPU memory or in the main memory)? Is it memory bound or computation intensive?
After this process, you can take the decision about the hardware and after that you can start developing the program.
Finally, when you have a functional program, you should start with a performance tuning process for maximizing the performance indexes of your interest (speedup, efficiency, power consumption, throughput, etc.).