Hi there,
I was clustering a silent files with my 10% lowest energy decoys and the cluster.mpi.linuxgccrelease just stopped and issued the following on screen:
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 19956 on node compute-1-5 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Well, it seems some problem with MPIRUN rather than with the cluster.mpi.linuxgccrelease binary.
I'm running cluster.mpi.linuxgccrelease with the following command line:
mpirun -x LD_LIBRARY_PATH=$LIB --mca btl_tcp_if_include eth0 -np 20 --host compute-1-11,compute-1-12,compute-1-13,compute-1-14,compute-1-15,compute-1-16,compute-1-17,compute-1-18,compute-1-19,compute-1-20 $BIN/cluster.mpi.linuxgccrelease -in:file:fullatom -in:file:silent_struct_type binary -in:file:silent ecut_10.out -cluster:radius -1
Did I miss some special MPIRUN option?
Thanks in advance.
The clustering code was never multi-processor-ized to my knowledge. I don't think it should actually fail in MPI, but it certainly won't work better than the non-MPI.
Hi smlewis,
Thanks for your replay. Judging by the output on screen, the mpi version seems to work reasonable well, but it doesn't writes the expected clusters before die. So, I thought I had missed some MPIRUN option. Well, if you don't use the mpicluster, who am I to use it? Thanks for sharing.
Best.
EDIT: the information bellow might be useful to another user and/or author.
Feb 25 14:59:59 compute-1-20 kernel: Out of memory: Kill process 25129 (cluster.mpi.lin) score 445 or sacrifice child
Feb 25 14:59:59 compute-1-20 kernel: Killed process 25129, UID 1006, (cluster.mpi.lin) total-vm:7986656kB, anon-rss:7777940kB, file-rss:2620kB
For some reason the process has been killed with the status "Out of memory". The same jobs was completed with the non-mpi version of cluster.default.linuxgccrelease.
This problem has been solved decreasing the number of process per worknode.
See https://www.rosettacommons.org/node/3619
Hope it helps.