There are a few tools that can be had, see http://en.wikipedia.org/wiki/Automatic_parallelization_tool. Additionally, some proprietary compilers have options to do auto parallelism - though anything machine generated will only be a start.
OpenMP directives can be manually added easily to loops such that each iteration is actually a separate thread running concurrently. The caveat on this is that updating a shared variable between the threads becomes problematic, and will automatically reserialize the code, even if it is running as separate threads.
There is lots of research in this area because the growth of multicore/manycore chips is forcing the rethinking of key algorithms currently in use. Just making them multi-threaded may not be enough to guarantee correct results - the algorithm itself may have to be rewritten.
There are a few tools that can be had, see http://en.wikipedia.org/wiki/Automatic_parallelization_tool. Additionally, some proprietary compilers have options to do auto parallelism - though anything machine generated will only be a start.
OpenMP directives can be manually added easily to loops such that each iteration is actually a separate thread running concurrently. The caveat on this is that updating a shared variable between the threads becomes problematic, and will automatically reserialize the code, even if it is running as separate threads.
There is lots of research in this area because the growth of multicore/manycore chips is forcing the rethinking of key algorithms currently in use. Just making them multi-threaded may not be enough to guarantee correct results - the algorithm itself may have to be rewritten.
In case of nested loops, you can help automatic or manual parallelization by setting the corrrect order of loops if possible (outer loop should be the candidate for parallelization)
There are books written on this topic, and it all depends on your algorithm and how you are implementing it. In a nutshell, to get high performance, 1) don't compute what you don't have to, and 2) do things in parallel as much as you can to use all your CPU cores.
For a traditional compute-intensive application, focus on the main loops in your code. Make the iterations as independent of each other as possible, so each iteration can run independently of others. Automatic parallelization has come a long way, but I would suggest explicit parallelization using tools such as OpenMP for extracting better performance.
After parallelizing the code, you should tune it (change the number of loop iterations, as well as the amount of processing in each loop) for best performance on your target hardware.
There is no easy way to do that. If you don't have time to re-design your code and you
1. Avoid vector dependencies to use SIMD instructions in loops. Use "#pragma ivdep" to inform compiler that there are no dependencies. Use vectorization reports (-ftree-vectorizer-verbose in gcc, -vec-report in icc) to confirm that a piece of code was vectorized
2. OpenMP is a good way for converting non-parallel code into vectorized one, but it doesn't give you threading control. TBB is better, but you still need some tool for vectorization (don't count on auto-vectorization only)
3. Data organization, Data organization and Data organization. Poor data organization - you will end up with one thread trying to access data that is in the memory of another thread. Most cases that I've seen with parallelization DECREASING performance were caused by threads starving for resources and cache misses.
Some compilers or solutions like OpenMP can help with automatic code parallelization, using some pragmas in the case of OpenMP. However, all of these solutions give low performance parallel codes. This is mainly due to the unnecessary automatic insertion of mutexes that serialize the code. This is mainly because OpenMP or other parallel compilers do not understand the code context during the compilation.
So the best way to obtain a parallel code with very good performance is to write this code by hand and not to use automatic tools. This obviously require to think in parallel, which is not an easy task, and requires experience to achieve a good code. Most of the time, parallelizing a given code is also a bad idea, and re-designing the initial algorithm is a much better way.
From the point of view of change to your sorce code, OpenMP is a very good choice. And OpenMP is also very efficient. But OpenMP is mainly for shared memory model or a CPU with an accelerator (GPU or Xeon Phi).
But, if you are planning migration to a distributed memory system, like supercomputers, clusters, etc, MPI is a better way. Unfortunately, MPI imposes significant change to source code.
If you intend to use GPU, I suggest CUDA. You may seek loops where you could explore parallelism. Anyway, you may contact me if you need help: [email protected]
You could try several things in C/C++. For example, some of them:
1. Try to look for special flags in the compiler in order to use vector code such as AVX or AVX2 (for instance -mavx in GCC or -xavx in ICC).
2. Use easy programming models like CilkPlus in order to create parallel and vector code by using very simple sentences and some tokens like cilk_spawn or cilk_for. This programming model is integrated with GCC in the last release.
3. Use OpenMP for multithread or OpenACC for parallelism in heterogeneous system like CPU-GPU.
In Java there are some things, for example:
1. There is a new API in JDK8 in java.util.stream where you can use some parallel pattern like map-reduce by using lambda expressions. Internally, if you call to parallel method ( IntStream.range(0, n).parallel.map( /* lambda */ ) ) , Java launches a set of threads and the lambda computation is processed in parallel.
2. There are a set of classes and utilities in java.util.concurrent. This package contains a set of classes which make easier multi-thread programming.
3. You can use a binding for CUDA or OpenCL in Java like jcuda or javacl.