Since, there's a latency gap between the host (CPU) - Device (GPU) - Main Memory (RAM), sometimes the performance of GPGPU computing is effected.
How to solve this throughput problem of data input/output on GPGPU computing other than optimising the data transfer (as shown in the provided link) on GPU??
Can anyone provide some reference to good academic papers (conference or journal) on this problem / solution / topic?
http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-fortran/