It is better to use 128 threads/256 threads per block. There is a some calculation to find the most suitable number of threads per block. The following points are more important to calculate number of threads per block;
Maximum number of active threads (Depend on the GPU)
Number of warp schedulers of the GPU
Number of active blocks per Streaming Multiprocessor etc.
However, according to the CUDA manuals, it is better to use 128/256 thread per blocks if you are not worry about deep details about GPGPUs.
Conference Paper Meta-programming and auto-tuning in the search for high perf...
We performed experiments varying these numbers and generated heat maps for a couple of simple kernels. One surprise was how large the hot areas were. Another surprise was that the occupancy calculator from NVIDIA seemed to be off target for some of these kernels.
I think that may be hard to answer. But maybe you approaching this from the wrong direction. Think about the algorithm you are implementing instead. What is the optimal decomposition of that computation onto the resources offered by the GPU? I really dont think there is one single answer to these questions that is applicable in all situations. Some experimentation is probably always needed.
Thank you again for your reply. Actually I found by experiments that 256 threads per block is the optimal option for my program, but a reviewer asked me to give an explanation or justification for this not just proofing it by experiments.
I am fond of the experimental + tuning approach. If you want to show this by pointing at architectural specifics of the individual GPU you use, it will be very particular to that one setup. If someone else tries your code on a different GPU, they in turn may need to tweak it for their specifics. Experiments I did at one point showed that the "Good" performing setting can be different between GPUs (of course). I would like to run more such experiments over a larger selection of GPUs.
Did you run your parameters through the nvidia occupancy calculator as a starting point? I think, though, that even when using the occupancy calculator you cannot be sure to get the best performing setting. But if the occupancy calculator points at 256 being great, then you have the answer from the horse's mouth (the GPU manufacturer).
Thank you very much Dear Bo Joel Svensson. I understand now the point. I will use occupancy calculator to be sure of what I get. I agree with you, the performance depends on each machine and I can not always generalize what I got in my machine to others
You want to use a multiple of your warpsize and your number of cores per MP: eg 32, 64 etc. Then you want to use a large a number of threads as the architecture is optimized for calculating and swapping out threads waiting on memory reads.
In my experience, though I have not tested on the latest architectures, I would start with the largest number of allowed threads eg 512. I would get good speed up with this, but I found that the optimal was less than the max ie 256 threads.
The last architecture I tested for this was Pascal and the best speedup was using 256 threads. I imagine that this might increase in the future as the architecture scales up. This also could change with an application that is doing greater calculations/communication at the kernel level, however if your calculations are that intensive per core, you probably are not utilizing your memory to it's full potential.
So, I would go with 256 threads for most applications.
We recently published an article testing different configurations of the number of threads per block on the NPB benchmarks. You can find the article here: Article NAS Parallel Benchmarks with CUDA and beyond