After reading the proceeding: "Best Practices in Running Collaborative GPU Hackathons: Advancing Scientific Applications with a Sustained Impact" I came across the paper: "Porting the MPI Parallelized LES Model PALM to Multi-GPU Systems – An Experience Report". It is evident that GPU's are Single Instruction, multiple threads and therefore it is not a cache-optimized computer architecture. My experience is that unless you changed the whole kernel the Hybrid approach MPI+GPU affects the performance a lot. However, there is a boom in HPC with GPU's. In some instances, MPI alone performs better than the hybrid since we do not need to move information back and forth to the GPU and CPU, respectively. We have different good practices but there is no a standard or a reference that we can always take to the bank. The picture becomes even worse when high order schemes are used.
1) Which is your common approach for this issues in CFD?
2) Should we keep the structure of the source code as functional programming, imperative programming with a bunch of do .. end do in each subroutine that affects the performance a lot?
3) Should we use data region (where the data remains local in the GPU) where we packed all the computations even though we are hurting the readability of the source code?
4) Should we update the ghost cells in every time step?
Again the focus is only CFD