# jd2 quits before reaching -nstruct # in roseta_scripts.mpi.linuxgccrelease

7 posts / 0 new
jd2 quits before reaching -nstruct # in roseta_scripts.mpi.linuxgccrelease
#1

Dear Rosetta community,

I am looking for a reason why my simulations stop before -nstruct is reached. This happens when I use rosettascripts.mpi.linuxgccrelease for some flexible docking. As many decoys don’t pass the Rosetta filters, I set -ntrials to 100.

And then, with:

-nstruct 500

-ntrials 100

I get exactly 500 decoys (which I anticipate), but for:

-nstruct 30 000

-ntrials 100

the simulation stops after ~700 decoys. When I restart it, ~700 new decoys are added to the silent file but then it stops again. I attached the last lines of the log file.

What may be the problem? It there any time limit that may be exceeded, or can this be a hardware problem? I am running the simulations on 32 cpu’s and 128 GB RAM, Ubuntu 14.04.2 LTS, mpiexec (OpenRTE) 1.6.5, Rosetta_2017.45.59812.

Best regards,

Filip

AttachmentSize
492.88 KB
Category:
Post Situation:
Mon, 2017-11-27 02:55

As I'm sure you saw from the log file, the error is either not getting caught because of log file buffering, or it's hard to find because of multiplexing from the N processes.

1) "The simulation stops" - do you know any more than that?  Perhaps try running it in a screen/tmux to capture an error message from the system (segfault, or out of memory , or disk full, or whatever)?

2) You can use -mpi_tracer_to_file $FILESTEM and it will de-multiplex the output - each process will write to$FILESTEM_\$RANK.  This is still susceptible to the buffering problem so I'm not sure it will help.  Combining ideas 1 and 2 will give you clean logs in the log files and probably a more interpreatable error in the screen/tmux.

I agree with you that it smells like a memory leak, but I'm not aware of any memory leaks on OUTPUT.  There is a known leak on input; -jd2:delete_old_poses fixes it, but I don't think it will help here.  If it's a leak in some sub-component of your RosettaScript it will be interesting but hard to identify.

Mon, 2017-11-27 11:01
smlewis

Thank you for the tips smlewis!

First of all I updated Ubuntu to 16.4 and OpenMPI to 3.0.0. This didn’t help. (Btw. the problem also existed with the older version of Rosetta: 2016.32.58837)

Screen terminates without any error, just like when the job is complete. The log from a single core doesn’t bring more information. The simulations simply stop without an error (last lines from 2 cores attached).

It seems to be a memory leak on the output though; the memory consumption gradually increases (top screen at 2 time points attached). If I change -ntrials Ito 1 get to higher -nstruct numbers but the number of generated decoys is still ~700. -jd2:delete_old_poses indeed doesn't help (there is only one pdb file as input).

Do you have any further ideas...? At the moment, I can simply use the bash script to restart the simulation but this is perhaps not the most elegant solution…

Thanks for help!

Filip

Thu, 2017-11-30 02:22

Generally speaking, a memory budget of 1-2 GB per running Rosetta process is normal. You're on the high side of that in the second screen, but depending on what you're doing that's not necessarily too bad. How close to the 700 limit was the second screen taken? What does that amount to in output structures per processor? Also how many processors are you running?

Debugging memory leaks is rather hard. If you're willing to bundle up everything we'd need to reproduce the run on our machines, we can take a look at it. Short of that, I'd probably recommend just doing the restart proceedure -- Rosetta should be relatively robust to restarting runs.

Fri, 2017-12-01 12:32
rmoretti

700 decoys are reached after ~1000 min which indeed corresponds to the memory limit (attached %MEM vs. time. I am using mpi with 32 cores, so I guess 31 are calculating trajectories). A single core dumps roughly 23 decoys. Shouldn’t the memory be released once the pdb is dumped...?

File attachments:
Tue, 2017-12-05 01:49