I think you might have to try out diffent mpi and thread (-ntomp ) settings and benchmark based on a short subset of the simulation. At least for a non gpu based simulation that can make a huge difference.
I would say that gromacs automatically does a pretty good job for most cases. The gpu utilization depends a lot on the overall size and kind of system you are simulating. It utilization will be more if the system is large and is dominated by nonbonded interactions. I have yet to encounter gpu utilization of more than 50-60 percent.