Dear Scientific Community,
I am currently running a Delft3D FM model using parallel computing with DIMR for the Mekong Delta. The model has about 180,000 nodes, and I have tried running it with 20 to 256 parallel processors. It runs fine for around 6 to 6.5 months, but after that, it crashes without giving any results or error messages.
If the model was unstable, I would expect to see some strange or negative values before it crashes, but instead, it just fails suddenly. Could this be related to memory issues, resource limits, or code bugs ?
Any advice on what might be causing this problem would be really helpful!
The only error message I get is: " x1001c1s2b1n1.hostmgmt2001.cm.asp2a.nscc.sg: rank 17 died from signal 11 and dumped core
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Resource Usage on 2024-09-29 13:08:28.377466:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
JobId: 8259853.pbs101
Project: personal-123
Exit Status: 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NCPUs: Requested(20), Used(20)
CPU Time Used: 788:26:58
Memory: Requested(200gb), Used(55744524kb)
Vmem Used: 118512128kb
Walltime: Requested(120:00:00), Used(39:38:21)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Execution Nodes Used: (x1001c1s2b1n1:ncpus=20:mem=209715200kb)
"