I am trying to optimize my simulator by leveraging real-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not __global__), in order to "override" an existing function?
Additional information:
- The function is a normal __device__ function.
- It is not part of a class nor structure.
- The difference is not the data type, so I cannot rely on templates.
- I actually must change the calculations performed in the function (i.e., propensity calculations) according to the model that I am simulating.
Thank you very much indeed for your answers