You are here

score_jd2.mpi.linuxgccrelease failed

3 posts / 0 new
Last post
score_jd2.mpi.linuxgccrelease failed

Hi there,
I was trying to score 100k decoys in a silent file using the Rosetta's score_jd2.mpi.linuxgccrelease with the command line:
> score_jd2.mpi.linuxgccrelease -in:file:fullatom -in:file:silent query_silent.out -out:file:silent_struct_type binary -out:file:silent scored_silent.out

I've got the following on screen:

core.pack.dunbrack: (12) Reading /cluster/apps/fred/rosetta_2014wk05/main/database/rotamer/ExtendedOpt1-5/asp.bbdep.rotamers.lib
core.pack.dunbrack: (12) Reading /cluster/apps/fred/rosetta_2014wk05/main/database/rotamer/ExtendedOpt1-5/asp.bbdep.densities.lib
core.pack.dunbrack: (12) Reading /cluster/apps/fred/rosetta_2014wk05/main/database/rotamer/ExtendedOpt1-5/asp.bbind.chi2.Definitions.lib
core.pack.dunbrack: (6) Reading /cluster/apps/fred/rosetta_2014wk05/main/database/rotamer/ExtendedOpt1-5/thr.bbdep.rotamers.lib
core.pack.dunbrack: (6) Reading /cluster/apps/fred/rosetta_2014wk05/main/database/rotamer/ExtendedOpt1-5/val.bbdep.rotamers.lib
mpirun noticed that process rank 2 with PID 17664 on node compute-1-22 exited on signal 9 (Killed).

Several mpi Rosetta applications are running just fine. So, such error do not seem to be related to the MPIRUN launcher nor to the command line itself. It works for the non parallel version of score_jd2.
Any comments are welcome.

EDIT: perhaps this complementary information could help. I've sent this process to 3 worknodes and the two lines bellow are what I got from the system. It seems the worknode 22 got out of memory.

Mar 26 17:21:33 compute-1-22 kernel: Out of memory: Kill process 17664 (score_jd2.mpi.l) score 125 or sacrifice child
Mar 26 17:21:33 compute-1-22 kernel: Killed process 17664, UID 1006, (score_jd2.mpi.l) total-vm:2396372kB, anon-rss:2141644kB, file-rss:264kB

Post Situation: 
Wed, 2014-03-26 13:54

It looks like your job distribution system or kernel is killing your jobs because they are exhausting your available memory. As a rough rule of thumb, each Rosetta process needs about 1 GB of memory in the default state. (This depends highly on which protocol you're running, how big your proteins are, etc. - just scoring is going to be less memory intensive than other protocols, but I'd estimate at least 0.75 GB per process.) So if you're running MPI on one node with 12 processor, you're looking at around 12 GB of memory usage. If that node only has, say, 8 GB of memory, you'll run out of memory and you may get the jobs killed.

There's several ways of reducing memory usage. Probably the easiest is to disable patches you're probably not using. Adding "--chemical:exclude_patches LowerDNA UpperDNA Cterm_amidation SpecialRotamer VirtualBB ShoveBB VirtualDNAPhosphate VirtualNTerm CTermConnect sc_orbitals pro_hydroxylated_case1 pro_hydroxylated_case2 ser_phosphorylated thr_phosphorylated tyr_phosphorylated tyr_sulfated lys_dimethylated lys_monomethylated lys_trimethylated lys_acetylated glu_carboxylatedcys_acetylated tyr_diiodinated N_acetylated C_methylamidated MethylatedProteinCterm" to your commandline or flags file will save you about 200-250 MB (edit as appropriate if you need hydroxylated proline, say). If you want to go further, in the main/database/chemical/residue_type_sets/fa_standard/residue_types/ directory there's "*.slim" versions of residue_types.txt and patches.txt, which, if used to replace the non-slim versions, will likely net you another ~200 MB memory savings - you'll then only be able to deal with simple protein-only structures, though. You'll still need somewhere around 500 MB of memory for each process even with those changes.

The easiest way to fix things, though, is to simply run fewer processes on the node, so you limit the number of processors based on available memory, rather than available processors, and potentially spread your MPI runs out across more nodes, if available.

Thu, 2014-03-27 08:36

Thanks for your replay. You are right.
The cluster I'm running has 8 processors and 16 GB ram per node. In the situation explained in my first post above I fired 24 process to 3 nodes. Well, I thought 2GB per process was pretty fine.
In this meanwhile, I've got the jobs finished firing 9 process in the same 3 nodes.
This is a typical situation where less is more.

Tue, 2014-04-01 08:24