29 September 2024 0 7K Report

Dear Scientific Community,

I am currently running a Delft3D FM model using parallel computing with DIMR for the Mekong Delta. The model has about 180,000 nodes, and I have tried running it with 20 to 256 parallel processors. It runs fine for around 6 to 6.5 months, but after that, it crashes without giving any results or error messages.

If the model was unstable, I would expect to see some strange or negative values before it crashes, but instead, it just fails suddenly. Could this be related to memory issues, resource limits, or code bugs ?

Any advice on what might be causing this problem would be really helpful!

The only error message I get is: " x1001c1s2b1n1.hostmgmt2001.cm.asp2a.nscc.sg: rank 17 died from signal 11 and dumped core

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Resource Usage on 2024-09-29 13:08:28.377466:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

JobId: 8259853.pbs101

Project: personal-123

Exit Status: 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NCPUs: Requested(20), Used(20)

CPU Time Used: 788:26:58

Memory: Requested(200gb), Used(55744524kb)

Vmem Used: 118512128kb

Walltime: Requested(120:00:00), Used(39:38:21)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Execution Nodes Used: (x1001c1s2b1n1:ncpus=20:mem=209715200kb)

"

More Sonu Kumar's questions See All
Similar questions and discussions