I want to synchronize all threads of a child kernel before executing other operations in a parent kernel in CUDA. How can I do this? I have many threads in many blocks. I used 1D blocks and 1D grid.
Thanks, Maheshya. It means that all threads in child kernel are terminated before executing other instructions in parent kernel without using __syncthreads() ?
Thank you Mr Mohammed for the PDF. If I understand __syncthreads stop just threads within the same block. For this reason, I need to use cudaDeviceSynchronize to be sure that all threads of the child kernel had finished their work before continuing others instructions in the parent kernel. I used also device global arrays and atomic operations
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is default behaviour) are executed sequentially. When you want your GPU to start processing some data, you typically do a kernal invocation. When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.