I have a program with 6 nested loops in C. I need to rewrite the code for parallelization and map the variable into two new variables in order to minimize atomic token and fore independence thread execution. Is there a tool that can help?
I think that the best way is to rewrite your loops in a single loop. In cuda you can use three indexes for parallelization, building your thread grid, but if possible limit to 1D grid. You should find a relation between the your index with a single index.It is not difficult. In case of two indexes, 0≤i
Second, the problem you are dealing is not clear to me. For example, in case you want to manipulate each pixels in an image independently, then you can parallelize the problem using indexing as Gianpeiro has explained.
However, in case of situations like matrix multiplication you might need to include a for-loop within your kernel.
Please take note that, having loops inside the kernel is not bad and if rightly used could give better performance (persistent programming).
I would be able to give you more advise if you could post your problem fully. Also parallelizing is not a big deal but extracting performance milestone is!