You are here

setting timeout option in MPI run

6 posts / 0 new
Last post
setting timeout option in MPI run


I'm now trying to set timout option in my rosetta MPI run

which means that after certain period of time, whole process are automatically stopped

I run with following commands

mpirun -np 10 rosetta_scripts.mpi.linuxgccrelease @options

where options is

-in:file:s crystal_complex.pdb
-in:file:extra_res_fa LIG.params
-out:path:all dock_res
-nstruct 100
-packing:no_optH false
-packing:flip_HNQ true
-packing:ignore_ligand_chi true
-parser:protocol dock.xml
-mistakes:restore_pre_talaris_2013_behavior true
-ignore_zero_occupancy false
-qsar:max_grid_cache_size 1

Is there any specific options to be included in options file to set timeout?


Post Situation: 
Mon, 2024-02-26 01:09

There's an option `-run::maxruntime` which should set an upper limit on the time which Rosetta will run jobs. Note that that's a time (in seconds) of when Rosetta will stop launching new output structures runs. Rosetta will not (cannot) stop structures which are still in the middle of running when that timeout is reached, so you'll typically run a bit longer than the nominal maxruntime.

To help compensate for this somewhat, there's also a `-run::maxruntime_bufferfactor` option. When enabled, this keeps track of how long it usually takes to output a structure, and will not start new outputs if you're within the bufferfactor of the average runtime. (The number is a multiplier for the average time. 1.5 means don't start jobs if you're within one and a half times the average runtime of the -run::maxruntime.)

Mon, 2024-02-26 07:48

Thanks for your reply

When I set '-run::maxruntime', rosetta seems to stop launching new output structures

as you wrote. but whole rosetta process is still hanging ( e.g. when I set the maxruntime == 50sec

processes are hanging over 5min!!)

I'm running rosetta with MPI. is this problem related with MPI?

or Is there any other thing that I should consider?



Mon, 2024-02-26 16:29

It could be due to MPI.

As mentioned, the timeout will only stop Rosetta from launching new execution, but won't affect anything that's already running on the other MPI nodes. So if the run needs a particularly long time to complete, then you'll see issues. (How long are outputs taking to complete? There should be some "XYZ reported success in XYZ seconds" messages printed which should give you a sense of how long each output takes. If you're not seeing those, the jobs may be taking longer than you expected.)

One thing that can happen with MPI runs is that the worker nodes can "hang" -- they've completed their calcuations but the communication with the coordinating node has broken down, so the "job is over" message never actually gets recieved. That's hard to fix, but there is a `-mpi_timeout_factor` option which causes Rosetta to be more agressive in shutting down, even when worker nodes are not responding. (It takes a value which is a multiplicative factor of the average job runtime.)

Tue, 2024-02-27 11:10

when I use mpi_time_factor, I got the following warning

protocols.jd2.JobDistributor: (1) [ WARNING ] The following options have been set, but have not yet been used:
    -jd2:mpi_timeout_factor 1.5

It seems I didn't use mpi_timeout_factor properly.

Is there something that I should consider ?

and when I use maxruntime option, in the final part of log file

I could see the following

Error: (9) [ ERROR ] Run terminating because runtime of 75 s exceeded maxruntime of 60 s
protocols.jd2.JobDistributor: (9) 100 jobs considered, 3 jobs attempted in 75 seconds

In the normal process the logfile ends with following
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Finished sending spin down signals to slaves
protocols.jd2.JobDistributor: (2) 50 jobs considered, 3 jobs attempted in 91 seconds

The process with maxruntime options seems never stop and it does not produce any structure after run terminating signal

what could be the problem?

Tue, 2024-02-27 17:16

Hmm, Rosetta has a number of different MPI job distributers. The -mpi_timeout_factor option is specific to one of them (MPIFileBufJobDistributor), which is the default, but apparently only if you're outputting silent files (not PDB files). For PDB output, it's the MPIWorkPoolJobDistributor being used instead. (Appologies for not recognizing that earlier.)

For debugging why things aren't working properly, you'll want to look at the end of the logs for each node. The nodes are labeled with the node number `(9)` for node 9, `(2)` for node 2 and `(0)` for node zero. For the MPIWorkPoolJobDistributor, node 0 is the "master" node, the one which hands out the jobs to the other worker nodes. For each of the worker nodes, you should (ideally) see the lines like "Run terminating because runtime of 75 s exceeded maxruntime of 60 s" near the end of the run.

Then for node zero, you'll see a bunch of "Master Node: Waiting for job requests..." and "Master Node: Received message from ... with tag ..." messages. Then once everything is done, you'll see a "Master Node: Finished handing out jobs" and then a bunch of "Master Node: Waiting for NNN slaves to finish jobs" messages, then finallly you'll see the "Master Node: Finished sending spin down signals to slaves" message once all the worker nodes have successfully completed.

What might be happening is that the MPIWorkPoolJobDistributor might not be working well with the timeout factor, such that all the worker nodes terminate their runs, but node 0 doesn't properly recognize that, which means that it's stuck at the "Master Node: Waiting for job requests..." state as the last message. As currently programmed, it will sit at that state forever.

One thing to try is to attempt to switch to the MPIFileBufJobDistributor instead, which would require you to put something like `-out:file:silent dock_res.out` one the command line. This will output the structures as a Rosetta silent file instead of PDBs directly, but that should work for ligand docking results. You can then extract the score file from the silent file with something like `grep SCORE: dock_res.out >`, and then there's an extract_pdbs application which will allow you to extract all (or just a subset of) the structures from the silent file. ("tags" here is the name in the description field of the scorefile -- also remember to pass the same -extra_res_fa option to extract_pbs such that it's aware of your ligand.)

Wed, 2024-02-28 09:20