Some differences to consider: OS, build tools and process, MPI infrastructure. word size.
Have you tried the code on a network of PCs?
You could try modifying your program to generate a log while it is processing. This should not require modification to lammps. What are the differences in that log when the code is run on various machines?
Does your simulation pause while the job still shows as "running". If so then it can happen at the timestep when a dump is being executed. At this point, depending on the supercomputer architecture it's just playing around with the -ncpus you use till you don't get stuck at the particular dumping timestep.