I am implementing an algorithm on GPU. The scatter version of the algorithm uses atomic token extensively, distributing N^2*h*w values on N^4 output location where N ranges from 5-40. The conversion to gather algorithm was expected to speedup the solution as a result of removing atomic token. Anyone has some clarifications?