I'm trying to calculate the effectiveness of my implementation of the D2Q9 solver using ArrayFire, which uses JIT compilation and other optimization techniques behind the scenes to output near-optimal CUDA/OpenCL code.
On lid-driven cavity test with 3000x3000 domain, I'm getting 3500 MLUPs. For the MLUPs calculation I'm using this formula:
float mlups = (total_nodes * iter * 10e-6) / total_time;
This is how the first 1000 iterations look like:
100 iterations completed, 2s elapsed (4645.152 MLUPS).
200 iterations completed, 5s elapsed (3716.1216 MLUPS).
300 iterations completed, 8s elapsed (3483.864 MLUPS).
400 iterations completed, 10s elapsed (3716.1216 MLUPS).
500 iterations completed, 13s elapsed (3573.1936 MLUPS).
600 iterations completed, 16s elapsed (3483.864 MLUPS).
700 iterations completed, 19s elapsed (3422.7437 MLUPS).
800 iterations completed, 21s elapsed (3539.1633 MLUPS).
900 iterations completed, 24s elapsed (3483.864 MLUPS).
1000 iterations completed, 27s elapsed (3440.853 MLUPS).
I want to calculate how far this is from the theoretical maximum and what is the effectiveness of the memory layout.