You are here

Rosetta3.2.1-LAM-MPI Run-Problem-More than 2 processor Jobs

11 posts / 0 new
Last post
Rosetta3.2.1-LAM-MPI Run-Problem-More than 2 processor Jobs
#1

Hi All:

I am a new Rosetta user. I had just finished compiling parallel Rosetta (rosetta3.2.1/gcc-4.4.3/LAM-MPI-7.1.4)
and can run 2 processor jobs with no issues. But jobs fail for more than two processors (error message below).
Any help would be greatly appreciated.

Thanks

Ravi

Linux System:
-------------

Linux node2n29 2.6.32-29-server #58-Ubuntu SMP Fri Feb 11 21:06:51 UTC 2011 x86_64 GNU/Linux

Parallel version of Rosetta was compiled using GCC/LAM-MPI-7.1.4

Run Command (memory used was 8 GB)
-----------------------------------

bin/AbinitioRelax.mpi.linuxgccrelease @flags

----flags-------

-in:file:native inputs/1l2y.pdb
-in:file:frag3 inputs/aa1l2yA03_05.200_v1_3
-in:file:frag9 inputs/aa1l2yA09_05.200_v1_3
-out:nstruct 1
-out:file:silent 1l2y_silent.out
-no_prof_info_in_silentout
-mute core.io.database
-run:constant_seed
-run:jran 1111111
-database /opt/nasapps/build/Rosetta/rosetta_database

Error Message:

---------------------------------------
...................................
Stage 2
Folding with score1 for 2000
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 11933 failed on node n0 (129.43.63.71) due to signal 11.

Post Situation: 
Tue, 2011-05-24 08:24
ravichandrans

A) The Abinitio executeable is non-obviously not MPI compatible. It shouldn't crash, but it won't actually work in MPI; it just runs concurrent non-MPI jobs (that overwrite each others' output).

B) I suspect it is crashing due to the filesystem getting angry at files overwriting each other, it isn't giving me a Rosetta error message to work with.

There is an abinitio MPI patch for 3.2 floating around. Would you like me to email it to the address you gave when you signed up for the message boards?

Tue, 2011-05-24 08:39
smlewis

Thanks for your reply. Yes, please email me the fix. Thanks.

Tue, 2011-05-24 08:45
ravichandrans

It's on the way.

Tue, 2011-05-24 08:47
smlewis

Hi Ravi, I am using 3.3 version and also looking forward to run Abinitiorelax function in mpi. Can you please forward me the patch on and tell me how I can do the run with that?

Thanks

Thu, 2011-12-08 21:48
Gaurav_kumar

on the way

Fri, 2011-12-09 08:07
smlewis

Hi Lewis, I have put the Abintion_mpi.cc file in the src/apps/public directory. I have opened the src/apps.src.settings file......

sources = {
"" : [],

"curated": [],
"benchmark": [ "benchmark" ],
"benchmark/scientific": [
"design_contrast_and_statistic",
"ddg_benchmark",
"rotamer_recovery",
],
"public/bundle" : [ "minirosetta", "minirosetta_graphics" ],
"public/ligand_docking" : [
"ligand_rpkmin",
"ligand_dock",
"extract_atomtree_diffs",
],
"public/docking" : [
"docking_protocol",
"docking_prepack_protocol",
],
"public/flexpep_docking" : [ "FlexPepDocking" ], # /* Barak,doc/apps/public/flexpep_docking/barak/FlexPepDocking.dox, test/integration/tests/flexpepdock/ */
"public/enzdes" : [
"enzyme_design",
"CstfileToTheozymePDB"
],
"public/rosettaDNA" : [ "rosettaDNA" ],
"public/design" : ["fixbb"],
"public/loop_modeling" : [ "loopmodel" ],
"public/match" : [
"match",
"gen_lig_grids",
"gen_apo_grids"
],
"public/membrane_abinitio" : [ "membrane_abinitio2" ],

"public/comparative_modeling" : [
"score_aln",
"super_aln",
"full_length_model",
"cluster_alns",
],

"public/electron_density" : [
"mr_protocols",
"loops_from_density",
],

"public" : [
"score_jd2",
"relax",
"idealize",
"idealize_jd2",
"cluster",
"combine_silent",
"extract_pdbs",
"AbinitioRelax",
"AbInitio_MPI",
"backrub",
"sequence_tolerance",
"SymDock"
],
"public/rosetta_scripts" : [
"rosetta_scripts",
"revert_design_to_native"
],
"public/scenarios" : [
"FloppyTail", # /* Steven Lewis, doc/apps/public/scenarios/FloppyTail.dox, test/integration/tests/FloppyTail/ */
# "FloppyTailACAT", # /* Barak Raveh */
"ca_to_allatom", # /* Frank DiMaio, doc/apps/public/scenarios/ca_to_allatom.dox */
],
}
include_path = [ ]
library_path = [ ]
libraries = [ ]
subprojects = [ "devel", "protocols", "core", "numeric", "utility", "ObjexxFCL", "z" ]

Now just wandering where to add a line as been mentioned to you for "Abinitio_mpi" compilation.

Does i have to need any extra flag after for building this along with other programs of the rosetta_source directory??

Tue, 2011-12-13 20:43
Gaurav_kumar

You already did add it to the "public:" group.

Editing this file tells SCons to compile it, so you don't need a flag, just recompile.

Wed, 2011-12-14 06:59
smlewis

Hi Steven:

I have applied the patch that you had sent me

---------PATCH-----
[ravi@torkv rosetta_source]$ ./scons.py bin mode=release extras=mpi
scons: Reading SConscript files ...
svn: '.' is not a working copy
scons: done reading SConscript files.
scons: Building targets ...
mpiCC -o build/src/release/linux/2.6/64/x86/gcc/mpi/apps/public/AbInitio_MPI.o -c -std=c++98 -pipe -ffor-scope -W -Wall -pedantic -Wno-long-long -O3 -ffast-math -funroll-loops -finline-functions -finline-limit=20000 -s -Wno-unused-variable -DNDEBUG -DUSEMPI -Isrc -Iexternal/include -Isrc/platform/linux/64/gcc -Isrc/platform/linux/64 -Isrc/platform/linux -Iexternal/boost_1_38_0 -I/usr/local/include -I/usr/include src/apps/public/AbInitio_MPI.cc
mpiCC -o build/src/release/linux/2.6/64/x86/gcc/mpi/AbInitio_MPI.linuxgccrelease -Wl,-rpath=/opt/nasapps/build/Rosetta/rosetta_source/build/src/release/linux/2.6/64/x86/gcc/mpi build/src/release/linux/2.6/64/x86/gcc/mpi/apps/public/AbInitio_MPI.o -Llib -Lexternal/lib -Lbuild/src/release/linux/2.6/64/x86/gcc/mpi -Lsrc -L/usr/local/lib -L/usr/lib -L/lib -L/lib64 -ldevel -lprotocols -lcore -lnumeric -lutility -lObjexxFCL -lz
Install file: "build/src/release/linux/2.6/64/x86/gcc/mpi/AbInitio_MPI.linuxgccrelease" as "bin/AbInitio_MPI.linuxgccrelease"
mpiCC -o build/src/release/linux/2.6/64/x86/gcc/mpi/AbInitio_MPI.mpi.linuxgccrelease -Wl,-rpath=/opt/nasapps/build/Rosetta/rosetta_source/build/src/release/linux/2.6/64/x86/gcc/mpi build/src/release/linux/2.6/64/x86/gcc/mpi/apps/public/AbInitio_MPI.o -Llib -Lexternal/lib -Lbuild/src/release/linux/2.6/64/x86/gcc/mpi -Lsrc -L/usr/local/lib -L/usr/lib -L/lib -L/lib64 -ldevel -lprotocols -lcore -lnumeric -lutility -lObjexxFCL -lz
Install file: "build/src/release/linux/2.6/64/x86/gcc/mpi/AbInitio_MPI.mpi.linuxgccrelease" as "bin/AbInitio_MPI.mpi.linuxgccrelease"
scons: done building targets.
---------------------------------------------------------------------

---------MPIRUN with NP 2 WORKS FINE-----
mpirun -np 2 $rosetta_home/bin/AbinitioRelax.mpi.linuxgccrelease @flags

......
......
......
Total weighted score: 24.862

===================================================================
Finished Abinitio

protocols.abinitio.AbrelaxApplication: (1) Finished _0001 in 7 seconds.
protocols::checkpoint: (1) Deleting checkpoints of ClassicAbinitio
protocols::checkpoint: (1) Deleting checkpoints of Abrelax
protocols.jobdist.JobDistributors: (1) Node: 1 next_job()
protocols.jobdist.JobDistributors: (1) Slave Node 1 -- requesting job from master node; tag_ 1
protocols.jobdist.JobDistributors: (0) Master Node --available job? 0
protocols.jobdist.JobDistributors: (0) Master Node -- Spinning down node 1
protocols.jobdist.JobDistributors: (0) Node 0 -- ready to call mpi finalize
protocols.jobdist.JobDistributors: (1) Node 1 -- ready to call mpi finalize
protocols::checkpoint: (0) Deleting checkpoints of ClassicAbinitio
protocols::checkpoint: (0) Deleting checkpoints of Abrelax
protocols::checkpoint: (1) Deleting checkpoints of ClassicAbinitio
protocols::checkpoint: (1) Deleting checkpoints of Abrelax
----------------------------------------------------------

--MPIRUN with NP >2 FAILS----------------------------------
===================================================================
Stage 2
Folding with score1 for 2000
-----------------------------------------------------------------------------
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 7798 failed on node n0 (129.43.63.50) due to signal 11.
-----------------------------------------------------------------------------

Could LAM-7.1.4 be an issue?

Thanks

Ravi

Tue, 2011-05-24 09:44
ravichandrans

Steven:

I forgot to mention that I am using Python version, python-2.7

Thanks

Tue, 2011-05-24 09:49
ravichandrans

I hate taking this off the boards, but you and another user are reporting similar problems, I've emailed you both to try to figure out if there's a shared root cause.

Wed, 2011-05-25 13:28
smlewis