MD simulation setup at GPU-Gromacs?

25 June 2018 7 7K Report

I would like to run Gromacs-2016.4 on GPU, but I couldn't setup the appropriate parameters of mdrun option of Gromacs and resulted in immediate termination of simulation.

I given 2 nodes for calculation, each consist of 20 cores and 2 GPUs.

I have attached the job script file and error file for more clarity.

my command to run simulation is

>module load compiler/intel-2017 cuda/9.0 cudampi/mvapich2-2.2 applic/v100/gromacs-2016.4

>srun /applic/applications/gromacs/2016.4/bin/gmx_mpi mdrun -v -deffnm step7_production -pin on -nb gpu

Running on 2 nodes with total 40 cores, 40 logical cores, 4 compatible GPUs

Cores per node: 20

Logical cores per node: 20

Compatible GPUs per node: 2

All nodes have identical type(s) of GPUs

Using 2 MPI processes

Using 20 OpenMP threads per MPI process

On host tesla18 2 compatible GPUs are present, with IDs 0,1

On host tesla18 1 GPU auto-selected for this run

Mapping of the GPU ID to the 1 PP rank in this node: 0

Note: potentially sub-optimal launch configuration, gmx mdrun started with less PP MPI process per node than GPUs available.

Each PP MPI process can use only one GPU, 1 GPU per node will be used.

This resulted in error file saying something like below

Fatal error: Your choice of number of MPI ranks and amount of resources results in using 20 OpenMP threads per rank, which is most likely inefficient. the optimum is usually between 2 and 6 threads per rank. If you want to run with this setup, specify the -ntomp option. But we suggest to change the number of MPI ranks.

Could expert of GPU computing suggest a setting for Gromacs?

Thanking you in advance.

Norman Geist

You need to differ between two schemes of parallelisation, which is the underlying idea of supercomputing and also GPU computing. Parallelisation in principle describes the use of multiple computing units on only one problem to achieve an acceleration. This can be the cores of a processor, multiple processors, multiple computers with a fast network or even GPUs.

Away from that, when you do such a parallelisation you can either do:

1. Shared Memory

Computing units share the same main memory. (Single node, e.g. OpenMP; often states as SMP or Threading). Here usually few processes are launched that use multiple cores and show a CPU usage > 100%. Btw. a GPU is such a SMP system.

2. Distributed Memory

Computing units don't share the same Memory, but are connected with a network. (Multiple nodes (computers), e.g. MPI). Here many processes are started with a CPU usage usually ~100%. Btw, multiple GPUs are such a system, the network here is then the PCIE bus,

Away from GPUs Gromacs is capable of both schemes in the 1st place and you usually combine them for best performance. This is because SMP cannot use multiple nodes and MPI is inefficient with many communication endpoints. So when using Gromacs on multiple nodes, you need at least one process per node, and COULD use the rest of the cores with SMP (-ntomp). BUT, as gromacs pointed out having 20 OpenMP workers is not very fast usually. ADDITIONALLY you would need at least one process per GPU, since one process cannot use multiple GPUs in gromacs. Therefore you are basically supposed to launch 2 processes per node with each having 10 threads.

How you specifically do this in your environment is up to u. Usually the MPI implementation is aware of SLURM's (the queueing system) allocation, so specifiying this information at submit time (sbatch) is enough.

How I would do it:

Jobscript snippet:

[...]

mpirun gmx_mpi mdrun -ntomp $SLURM_CPUS_PER_TASK

[...]

At submit time:

sbatch -N 2 --ntasks-per-node=2 -c 10 -p partition ./script.sh

Maybe telling the same to srun works aswell ,altough it seems a little bit dirty to me, using srun at all.

srun -N 2 --ntasks-per-node=2 /applic/applications/gromacs/2016.4/bin/gmx_mpi mdrun -v -deffnm step7_production -pin on -nb gpu -ntomp 10

Let us know ;)

Changdev G. Gadhe

Dear Norman, Many thanks for your explanation.

I did changes according to your suggestion, but it failed with an error of Invalid numeric value "tasks-per-node=2" for number of tasks.

I did remove #SBATCH --ntasks=20 line from the script file and run the simulation. Is this the reason why it shows the error.

Looking forward to hear your expert opinion.

Norman Geist

I think you should still keep the #SBATCH --ntasks=20 but also the parameter for srun is --ntasks-per-node (note the two - in front).

Changdev G. Gadhe

Hello Norman,

Greetings of the day!

I set up the script file according to your suggestion and submitted the job, but it fails again.

Note: Your choice of number of MPI ranks and amount of resources results in using 10 OpenMP threads per rank, which is most likely inefficient. The optimum is usually between 2 and 6 threads per rank. Please let me know how to solve this problem. I have attached to script file and error file.

Looking forward to hearing from you.

Norman Geist

Well, break down the Threading by incresing the number of Processes then. So e.g. running 4 processes per node, with each having 5 Threads.

You just need to follow all the boundary conditions:

1. Gromacs want you to have between 2 and 6 Threads not more.

2. Your machines have 20 Cores and 2 GPUs

3. You need at least one Process per GPU, but total number of Processes should be a multiplier of GPUs if higher.

This results in only few reasonable setups for gromacs e.g.

--ntasks-per-node=4 and -ntomp 5 (uses all 20 cores)

--ntasks-per-node=6 and -ntomp 3 (looses 2 cores)

--ntasks-per-node=8 and -ntomp 2 (looses 4 cores)

--ntasks-per-node=10 and -ntomp 2 (uses all 20 cores)

--ntasks-per-node=20 and -ntomp 1 (uses all 20 cores)

You could benchmark which of these is the fastest, the difference might be huge.

Changdev G. Gadhe

Dear Norman,

Thanks for your help.

I am succeeded running MD simulation on GPU.

I used --ntasks-per-node=4 and -ntomp 5

it run but have higher load imb: force 9.4%

how to solve this one now.

Regards,

Changdev

James Montantes

If anyone is struggling with getting GPU acceleration for GROMACS, there are pre-built workstations & servers available.

https://www.exxactcorp.com/GROMACS-Certified-GPU-Systems

How can I calulate the concentration of a particular solute in MD simulation system?

How to visualize h-bonds between ligand-protein simulation trajectory in VMD?

What should I upgrade in my computer to speed up simulation in CST?

I'm looking for a tool to evaluate the risk of bias in... Are there any recommended applications or platforms?

We are currently compiling the results in network pharmacology. Could you please assist me with this?

Which one is the most potential target for Colorectal cancer?

Is it possible to use GPU for Material Studio simulations on a local computer?

Request for help in compiling Lammps files (including cpp an dh header files) using cmake in ubuntu ?

Problem w/ linking Abaqus 2023 to Fortran compiler (Intel OneAPI 2024 - Visual Studio 2022) - solution?

How to reduce the optimization time in Ansys?

How can I solve C++ Compiler Link Problem with Abaqus2023?

How can I run Desmond simulation software on my institute's HPC?