You are here

Rosetta paralle running

2 posts / 0 new
Last post
Rosetta paralle running
#1

Dear Rosetta people,

I have been installed rosetta3.9 on our local cluster. To do that, I have used the following command to compile the Rosetta and enable mpi ability after I copy the $ROSETTA_PATH/main/source/tools/build/site.settings.topsail file to the $ROSETTA_PATH/main/source/tools/build/site.settings:

./scons.py mode=release extras=mpi bin -j 20

compilation have been done seccussfully without any error massage. I solve a problem using the relax.mpi.linuxgccrelease executable successfully, but there is an ambiguity when I use the rosetta_scripts.mpi.linuxgccrelease one.

I run the following command:

mpiexec -np 10 rosetta_scripts.mpi.linuxgccrelease @Rosetta_flags -mpi_tracer_to_file logdir

When -nstruct is set to 100, ten output files are created seems that each of them contains output of one processor, and 10 jobs are assigned to each processor. Running is terminated when the first processor finishes 10 assigned job, however any other processors do not complete their jobs. It can be seen the following massage at the end of the output file of the processor that completes the 10 assigned jobs:

protocols.jd2.JobDistributor: (6) 100 jobs considered, 10 jobs attempted in 1029 seconds
Error: (6) [ ERROR ] Exception caught by rosetta_scripts application:

File: src/protocols/jd2/JobDistributor.cc:329
10 jobs failed; check output for error messages
Error: (6) [ ERROR ]

However, the last few lines of other output is different. For example:

protocols.docking.DockingLowRes: (7) ////////////////////////////////////////////////////////////////////////////////
protocols.docking.DockingLowRes: (7) ///                       Docking Low Res Protocol                           ///
protocols.docking.DockingLowRes: (7) ///                                                                          ///
protocols.docking.DockingLowRes: (7) /// Centroid Inner Cycles: 50                                                ///
protocols.docking.DockingLowRes: (7) /// Centroid Outer Cycles: 10                                                ///
protocols.docking.DockingLowRes: (7) /// Scorefunction:                                                           ///
protocols.docking.DockingLowRes: (7) ScoreFunction::show():
weights: (interchain_pair 1) (interchain_vdw 1) (interchain_env 1) (interchain_contact 2) (backbone_stub_linear_constraint 10)
energy_method_options: EnergyMethodOptions::show: aa_composition_setup_files:

or,

protocols.docking.DockingLowRes: (0) EnergyMethodOptions::show: voids_penalty_energy_voxel_grid_padding_: 1
protocols.docking.DockingLowRes: (0) EnergyMethodOptions::show: voids_penalty_energy_voxel_size_: 0.5
protocols.docking.DockingLowRes: (0) EnergyMethodOptions::show: voids_penalty_energy_disabled_except_during_packing_: TRUE
protocols.docking.DockingLowRes: (0) EnergyMethodOptions::show: hbnet_bonus_ramping_function_: "quadratic"

I expect that, each processor should complete every 10 assigned jobs and the output of all files should be the same. Also, when -nstruct is set to 1000 non of the processors complete their assigned job and run is terminated suddenly without any error massage.

May I ask you how can I solve the problem?

Best Regards

Bahareh Bamdad

 

 

 

 

 

Post Situation: 
Mon, 2018-07-23 21:28
bahareh

For runs terminated by that particular error message ("File: src/protocols/jd2/JobDistributor.cc:329") you should be able to add the option `-jd2:failed_job_exception false` to the command line to keep Rosetta from exiting if any of the jobs failed.  (Though for MPI it should be that condition shouldn't trigger.)

Thu, 2018-09-20 08:28
rmoretti