# mpi / jd2 with AbinitioRelax and relax (3.2)

26 posts / 0 new
mpi / jd2 with AbinitioRelax and relax (3.2)
#1

Hi,

I have a working mpi compile of Rosetta 3.2, and intend to use AbinitioRelax and relax on a cluster.

Things are currently working:
- with NO EXTRA PARAMETERS in my flags file (so same flags file as for single processor). Note that I have "-run:constant_seed" and "-run:jran 1111111" in my flags file.
- using "mpirun -np 8 scriptfile" (looks like one core is used for management, and 7 used for computations, and speed improvement is ~6X versus non-mpi version)

My question relates to potential JD2 parameters I could use in my flags file.

From the 3.2 "relax documentation" : "All JD2 options apply (see JD2 Documentation ofr more details)"
From the 3.2 "AbinitioRelax documentation" : there is actually no mention of JD2
From the 3.2 "How To Use The New Job Distributor" :
"There are four MPI flavors to choose from.
1) MPIWorkPartitionJobDistributor: this one is MPI in name only. Activate it with jd2::mpi_work_partition_job_distributor. Here, each processor looks at the job list and determines which jobs are its own from how many other processors there are (in other words, divide number of jobs by number of processors; each processor does that many). Its files contain an example. This KD has the best potential efficiency with MPI but isn't useful for big stuff, because you have to load-balance yourself by setting it up so that (njobs % nprocs) is either zero or close to nprocs. It is recommended only for small jobs, or jobs with tightly controlled runtimes. I use it for generating a small number of trajectories (100 jobs on 100 processors) to determine where to set my filters, then run large jobs under the next jd.
2) MPIWorkPoolJobDistributor: this is the default in MPI mode. Here, one node does no Rosetta work and instead controls all the other nodes. All other nodes request jobs from the head node, do them, report, and then get a new one until all the jobs are done. If you only have one processor, then you get only a head node, and no work is done: make sure you're using MPI properly! This one has "blocking" output - only one job can write to disk at a time, to prevent silent files from being corrupted. It's a good choice if the runtime of each individual job is relatively large, so that output happens sparsely (because then the blocking is a non-issue). If your jobs are very short and write to disk often, or you have a huge number of processors and they write often just because there's a ton of them, this job distributor will be inefficient. As of this writing, this is the job distributor of choice when writing PDB files.
3) MPIFileBufJobDistributor: this is the other default for MPI. It allocates two nodes to non-rosetta jobs. One head node as before and one node dedicated to managing output. It is nonblocking, so it is the best choice if job completion occurs often relative to how long filesystem writes take (short job times or huge numbers of processors or both). At the moment it works only with silent file output. It is the best choice if you have short job times or large (many thousands) of processors."

1. I assume I am using MPIWorkPoolJobDistributor (based on documentation and the fact that I get 7 silent files with "mpirun -np 8")

2. Do flags like "jd2::mpi_work_partition_job_distributor" apply to AbinitioRelax and relax?

3. If so, what are the other flags I could try ? (i.e. where do I find additional docs... no mention of available jd2 flags from "relax -help"

4. What is the fourth MPI flavour mentioned in the docs ? ( "How To Use The New Job Distributor" )

5. Any suggestions for improved setting of distributed Abinitio and relax runs (I have attached my flags and mpi scripts) ?

Cheers,

Stéphane

AttachmentSize
419 bytes
290 bytes
287 bytes
Post Situation:
Wed, 2011-02-09 10:39
smg3d

Indeed, Rosetta JD2 defaults to the WorkPool implementation, and indeed, it uses node 0 for management, and nodes 1-N for computation.

1. I assume I am using MPIWorkPoolJobDistributor (based on documentation and the fact that I get 7 silent files with "mpirun -np 8")

Yes, you are, if you are using JD2.

2. Do flags like "jd2::mpi_work_partition_job_distributor" apply to AbinitioRelax and relax?

All JD2 applications accept this flag. I can describe how it works if you care. Relax definitely uses jd2.

AbinitioRelax is harder to answer. I'm 100% positive that there ARE abinitio methods using JD2, because I've communicated regularly with their developer, but I can't tell if any of them are in the release. It appears to me that vanilla AbinitioRelax STILL (ARGH!) uses the older job distributor (to which none of these flags or documentation applies). I don't know how the older one works but I think it's analogous to WorkPool.

I think you can access the jd2 versions of abinitio via the minirosetta executeable (which is what goes out on BOINC?) None of this organization makes sense to me.

3. If so, what are the other flags I could try ? (i.e. where do I find additional docs... no mention of available jd2 flags from "relax -help"

The only ones I ever use are jd2::mpi_work_partition_job_distributor (because I schedule jobs such that a master node is not needed; this is less true for fast abinitio than my slow code) and mpi_tracer_to_file, which ensures that you get non-garbled output (one output log file per CPU). The job distributor itself doesn't have too many flags. What do you want options for...?

4. What is the fourth MPI flavour mentioned in the docs ? ( "How To Use The New Job Distributor" )

There are a bunch of MPI job distributors specialized for ultra-large BlueGenes running super-fast-outputting abinitio. I don't know how any of them work. It sure sounds like they're relevant to you so I'll try to get Oliver (their author) interested...

5. Any suggestions for improved setting of distributed Abinitio and relax runs (I have attached my flags and mpi scripts) ?

Same as 4), I'll see what Oliver thinks.

I know that his job distributors are specialized for silent-file-only output. I'm a little confused by your options files, where you seem to request both silent and PDB output. Surely you aren't getting both?

Wed, 2011-02-09 11:11
smlewis

Steven is correct: use minirosetta to get jd2 version of abrelax.

Thu, 2011-02-10 03:24
olange

(that would be with the run:broker flag)

Thu, 2011-02-10 07:08
smlewis

Unfortunately, I have not found any info on run:broker... searching RosettaCommons, the only occurrence appears to be in the above comment...

Thu, 2011-02-10 11:05
smg3d

Regarding minirosetta...

1. I had never really looked into it. Just read quickly about it... so it is an all-in-one app. Somewhere I read it is the "developper" version of the protocols, whereas the individual apps are the "release" version... is that true?

2. I tried minirosetta, but right now it does not look like it is using jd2 for abinitio (see below).

3. I looked into the output logs of AbinitioRelax, relax and minirosetta for JobDistributor statements.

AbinitioRelax has only the following statement:
protocols.jobdist.JobDistributors:

relax has the following statements :
protocols.jd2.JobDistributor:
protocols.jd2.MPIWorkPoolJobDistributor:
protocols.jd2.PDBJobInputter:

minirosetta has only the following statement (ran with the same flags as I used for AbinitioRelax, see attachment):
protocols.jobdist.JobDistributors:

So, unless I am doing something wrong, minirosetta -abinitio DOES NOT use jd2 (i.e. to me, it looks like "AbinitioRelax -abinitio" or "minirosetta -abinitio" are using the same code...)

Cheers,

Stéphane

Thu, 2011-02-10 11:02
smg3d

Oliver says to use broker to get JD2 abrelax.

You are correct that minirosetta + -abinitio runs the same code as the AbinitioRelax executeable.

I don't know quite what minirosetta is, I've never used it. I sort of wish it would go away since it breaks the pattern of one-function-per-executeable. I think it's actually the BOINC-distributed executeable (which has to be monolithic), so everything that someone wants to run on Rosetta@home has to set up a way to let it go through minirosetta.cc.

I will email Oliver and ask him to let us know how to USE the broker mode - hopefully it's just the same flags as abinitio.

Fri, 2011-02-11 13:07
smlewis

Thanks for the very useful information.

3. I was not looking for anything in particular. Just looking for what is available since not everything is readily available in the documentation (I am not complaining... I fully understand that in projects like Rosetta, we cannot expect 100% complete docs). Thanks to forums, we can fill the gaps and be in touch with people involved.

5. yes, I realize there is no real need to output both PDB and silent. my initial intention were to have them just in case... (yes, I am getting both, but no I am not using both... just using the PDB so far...). But your point is good... I should just remove that line...

Thu, 2011-02-10 10:33
smg3d

Hi all,

I am new to Rosetta and I am not exactly in the right wavelength to understand all of the posts in this thread. I am just wondering if there is a quick way for me to run AbinitioRelax on multiple processors. I would appreciate if the explanation is delivered in plain Rosetta-beginner English perhaps :D.

I have Rosetta installed in our server but I don't think it was an mpi compile version. Also, I read that condor can be used for running parallel jobs. Is this correct? Any suggestion is highly appreciated. Thank you very much in advance.

Surya

Tue, 2011-03-01 19:38
ssetiyaputra

There is an MPI-capable abinitio relax application for Rosetta3.2, activated by using the "minirosetta" application, using the "broker" flag.

There is a not-MPI-capable abinitio/relax application at AbinitioRelax.

There is not speed benefit to running in MPI; the only benefits are organizational (you can get all your output in one folder) and sysadmin (your sysadmin might force you to use MPI).

I can also hand you a patch to 3.1 (not 3.2) which crams the AbinitioRelax executeable into MPI compatibility.

Compiling in MPI is simple with respect to rosetta: you just add extras=mpi to your scons command line.

Condor can be used for running parallel jobs. How to do so is a question for condor, not a question for Rosetta. Some of the Rosetta labs (not mine) do use condor, so Rosetta is compatible with it.

Wed, 2011-03-02 07:26
smlewis

Hi smlewis,

Thank you for your reply. If I want to use minirosetta for running MPI-capable abinitio relax, how should I execute the minirosetta? I have it installed and it is working, although I am not entirely sure what this broker flag does. Should I just include -broker in my flags file?

I tried the abinitiorelax.script that smg3d posted above. It worked except that the output file has multiple entry of the same tag, e.g. I have multiple S_0000001.pdb structure in my silent file.

I really hope this post makes sense. Thank you for your assistance.

Surya

Wed, 2011-03-02 21:32
ssetiyaputra

I have no idea how to use the broker code either. I think it obeys the same flags as abinitiorelax, but its author has declined to comment. Perhaps smg3d figured it out...

The option is specifically

-run:protocol broker

"I tried the abinitiorelax.script that smg3d posted above. It worked except that the output file has multiple entry of the same tag, e.g. I have multiple S_0000001.pdb structure in my silent file."

This is what is expected to happen when you run non-MPI code in an MPI setup. The MPI in rosetta only communicates to prevent the outputs from overlapping - one processor takes S_001, one takes S_002, etc. With no MPI, there's no communication, so all N processors create S_001 and overwrite each others' work.

Thu, 2011-03-03 11:11
smlewis

I got the following error message after adding the -run:protocol broker in my flags. Anyone has any idea? It seems that the AbrelaxMover is not working properly.

Thank you for all the pointers. They're highly appreciated. I think I understand more about what you guys were talking about in the older posts too.

Run script (I copied from smg3d and omitted the line that says module load) -> wondering if this causes the crash?:

#!/bin/tcsh
#$-N centr-17 #$ -P eer-775-aa
#$-l h_rt=00:40:00 #$ -cwd
#$-o script1.log #$ -e script1.err
#\$ -pe default 8

time mpirun -np 15 /programs/nmr/x86_64/rosetta/rosetta-3.2-mpi/bin/minirosetta.linuxgccrelease @flags

Error message:
........
protocols.general_abinitio: AbrelaxMover: S_00001
core.scoring.constraints: Constraint choice: ./input/kf2.cen_cst
core.scoring.constraints: Constraint choice: ./input/kf2.cen_cst
core.scoring.constraints: Constraint choice: ./input/kf2.tetraL.fa_cst
protocols.general_abinitio: AbrelaxMover: S_00001
core.scoring.constraints: Constraint choice: ./input/kf2.cen_cst
core.scoring.constraints: Constraint choice: ./input/kf2.cen_cst
core.scoring.constraints: Constraint choice: ./input/kf2.tetraL.fa_cst
--------------------------------------------------------------------------
mpirun noticed that process rank 13 with PID 27046 on node bombobear.mmb.usyd.edu.au exited on signal 11 (Segmentation fault).

Thu, 2011-03-03 23:27
ssetiyaputra

A) bombobear is an awesome name for a node

B) a segfault is likely to be an error internal to rosetta, not an error related to MPI - try running without MPI (one processor) and seeing if it is duplicable. There's not enough information here to say anything else about it.

C) Try running the mpi version directly instead of the symlink. Instead of using /programs/nmr/x86_64/rosetta/rosetta-3.2-mpi/bin/minirosetta.linuxgccrelease, use /programs/nmr/x86_64/rosetta/rosetta-3.2-mpi/build/src/release/a/bunch/of/folders/mpi/minirosetta.mpi.linuxgccrelease. On my particular system, a/bunch/of/folders becomes linux/2.6/64/x86.

Fri, 2011-03-04 08:08
smlewis

Lol... we have four nodes which are not yet connected to one another called rupert, tofu, bombo and superbears... :D Our IT guy was very creative. I will try the recommendations and see if I could get my run to work..

Fri, 2011-03-04 12:12
ssetiyaputra

It seems that the addition of -run:protocol broker in my flags file is the cause the segmentation fault. Any idea how I should go about fixing that issue?

smlewis, I would like to try your patch for version 3.1 to get AbinitioRelax working with mpi. Would you send it to my email please? My email address is ssetiyaputra@gmail.com. Thank you.

Sun, 2011-03-06 16:50
ssetiyaputra

I don't know how any of the broker code works or what's wrong with it. Obviously someone who does know needs to write documentation for it for the 3.3 release (grumble, grumble).

I have emailed you the patch.

Tue, 2011-03-08 11:15
smlewis

Hi smlewis,

The patch works like a magic bullet. Our IT tech guy even managed to get it running smoothly on version 3.2 by doing some minor hacking. I will ask him to write up what he did so that it might be useful for other people too. I'll send it to your email once he has it written down.

Btw, how do you normally deal with the multiple out files? You probably had mentioned this before. Do you just need combine all the out files into one? Thanks.

Mon, 2011-03-14 17:06
ssetiyaputra

Do you mean multiple silent files? I think there's an application called combine_silent (in 3.1 or maybe 3.2?) I think they are mostly directly concatenateable, but I've never tried. (For what it's worth, it's not MY patch, Grant Murphy made it).

Score files are certainly just concatenateable.

cat *sc > all_scores.sc

Tue, 2011-03-15 07:01
smlewis

In my experience, Segmentation faults mean that the Rosetta code made an assumption that didn't hold for your run. (Occasionally it's something as simple as an extra blank line in one of the input files.)

If you are able to easily do so, running a debug mode compile should give you more information. (For debug mode, omit the mode=release on the scons line, and use the resulting *.linuxgccdebug executables - I'm not sure how well debug mode plays with MPI, though.) Even without running it in a debugger, the debug compile has extra "sanity checks" that, while slowing the execution down, should give you more information about what went wrong. (i.e. The program should exit with an error message, rather than giving a segmentation fault.)

Fri, 2011-03-04 10:32
rmoretti

Hi smlewis,

I sent you a patch file for mpi to work on version 3.2. Hope you can get it to work on yours.

Thu, 2011-04-07 21:14
ssetiyaputra

Received - I'll pass it along if anyone needs it. Thanks!

Sat, 2011-04-09 11:05
smlewis

I want to use the 4 cores on my computer to finish a rosetta relaxation protocol ~4 times as fast. I ran >mpirun -np 4 relax.linuxgccrelease -s input.pdb @flags. NB the -overwrite flag was not present. I had scores of 4 structures named input_0001 and input_0002 (with different scores), but only one copy structure (presumably the last one, the others were overwritten).

Perhaps I need to install the mpi executables?

I would like the mpi patch for 3.2 - 3.3 is not working as well at this point. I tried to recompile rosetta to include protocols like relax.mpi.linixgccrelease (right now I just have relax.linuxgccrelease, relax.default.linuxgccrelease, relax.linuxgccdebug) by running
>scons mode=release bin extras=mpi

however I got the following error

svn: '.' is not a working copy
scons: Building targets ...
mpiCC -o build/src/release/linux/2.6/64/x86/gcc/mpi/apps/public/AbinitioRelax.o -c -std=c++98 -pipe -ffor-scope -W -Wall -pedantic -Wno-long-long -O3 -ffast-math -funroll-loops -finline-functions -finline-limit=20000 -s -Wno-unused-variable -DNDEBUG -DUSEMPI -Isrc -Iexternal/include -Isrc/platform/linux/64/gcc -Isrc/platform/linux/64 -Isrc/platform/linux -Iexternal/boost_1_38_0 -I/usr/local/include -I/usr/include src/apps/public/AbinitioRelax.cc
scons: *** [build/src/release/linux/2.6/64/x86/gcc/mpi/apps/public/AbinitioRelax.o] Error 127
scons: building terminated because of errors.

Fri, 2011-07-29 16:33
gw

"I had scores of 4 structures named input_0001 and input_0002 (with different scores), but only one copy structure (presumably the last one, the others were overwritten). Perhaps I need to install the mpi executables? "

You have correctly identified the error and the solution.

"I would like the mpi patch for 3.2 - 3.3 is not working as well at this point."
The MPI patch discussed in this thread will have no effect on your current problem. It creates a new abinitio-relax executable; it won't affect the relax executable at all. I can forward it along to the email address to which your forum account is registered, if you'd like.

This error means you either don't have MPI installed on your computer, or don't have it in a place that SCons can path to. If it's not installed, try installing packages "mpich-bin" and "libmpich1.0-dev" or newer versions, or a different MPI of your choice. If it's not pathing properly, try "which mpiCC" to find where it is, then figure out how to get SCons to recognize it. You may have success by copying the file tools/build/site.settings.topsail to tools/build/site.settings; the topsail settings' purpose is to make SCons patch MPI compilers on a particular system; maybe it will work for yours.

Mon, 2011-08-01 14:13
smlewis

It worked. I already had mpiCC but I needed to direct scons to it.

In the end I did the following things:

I found the path with ">which mpiCC"
I moved tools/build/site.settings.topsail to tools/build/site.settings AND commented out the line ""include_path" : os.environ["INCLUDE"].split(":")," since the INCLUDE variable was giving me problems.
I ran ">scons mode=release bin -j3 extras=mpi MPICXX=mpiCC MPI_INCLUDE=/usr/lib64/mpi/gcc/openmpi/include/ MPI_LIBDIR=/usr/lib64/mpi/gcc/openmpi/lib64/ MPI_LIB=mpi"
This website was useful for knowing what set the MPICXX MPI_INCLUDE etc variables as: http://xmipp.cnb.csic.es/twiki/bin/view/Xmipp/BuildingWithSCons#mpi

Tue, 2011-08-02 16:00
gw

Hi Gw, I am new user to rosetta and wants to invoke mpi for running multiple jobs at same time for rosetta application. I am not founding any head or tail for the same. Can you please tell me how to command my jobs for multiple processing and the things which i required to do before setting up the things.

Do guide me considering my novice to this field?

Thanks

Thu, 2011-12-08 18:59
Gaurav_kumar