You are here

Did RosettaScripts recently switch to fully parallel MPI?

8 posts / 0 new
Last post
Did RosettaScripts recently switch to fully parallel MPI?
#1

Hi, trying to figure out if something isn't working properly or if it's working as intended and I missed it. I was using the Topology Broker via Rosetta Scripts from the 2016/Week 46 build of Rosetta and when I ran my script via MPI, it would create one nstruct per core, not including the controlling core. I was able to see this in the log; different commands were marked with (0), (1), (2) etc, for presumably which core the message came from. So if my command was: 

mpiexec -np 4 rosetta_scripts @flags -nstruct 10

It would output 30 structures to my silent file. However, I recently upgraded to the 2017/Week 18  version of Rosetta and have been trying to make use of the Hybridize protocol. With this version and script, rosetta now will only output 10 structures; I only see messages from core (0), and although I THINK it might be sending multiple repeat messages, in the log I only see it export each nstruct once.

However, if I run with -parser:view, run on four cores, and only request -nstruct 3, the first nstruct brings up 4 unique structures... so that extra rogue structure has me suspicious! I'm worried the script is wasting energy making a structure per core, but then only outputting one per nstruct, which gets overwritten as each core finishes. 

One possibility I considered is my hybridize protocol is outputting PDB files, which are getting overwritten, whereas the topology broker output to a silent file, from which I was able to extract unique PDBs when they had the same name. I'm going to do some more testing, but wanted to get this out there to see if anyone on the forum might have any ideas. 

Post Situation: 
Thu, 2017-07-13 09:02
beowulfey

Are you running the version of Rosetta compiled with MPI? The default compilation of Rosetta doesn't have MPI support - you have to compile a special "extras" build of Rosetta to get MPI support. If you attempt to run the non-MPI version of Rosetta under MPI, then you'll get multiple independent runs, each unaware of each other, and yes, they'll have issues because they're all attempting to overwrite each other's output. (And output to silent files will mask this somewhat, as structures will be appended to the silent file.)

The fact that you're getting a '(0)' in the tracer output argues against this, though, as that should normally only be present for MPI runs.  

What does your flags file look like? I'm a little confused as why you're getting 30 output structures with an -nstruct of 10 -- with a properly working MPI run, the number of output structures per input structure should be same as the -nstruct. That is, the nstruct is for the MPI run as a whole, not for each processor.  Getting a better sense of your flags would help making sense of your observations.

 

Thu, 2017-07-13 09:11
rmoretti

Hi rmoretti
I came across your reply here when looking for a solution to my own little problem, hoping maybe you can advise.   (see https://www.rosettacommons.org/node/10669
Basically I'm getting 12 PDB files (nstruct is 15, but I hit the walltime and the job was killed), but the score file had a lot more lines of output than this; 353 scores to be exact.  is it possible that the parallel processors are generating lots of structures simply overwriting the PDBs?  The command is below: 
 

mpiexec $HOME/rosetta_src_2019.22.60749_bundle/main/source/bin/fixbb.mpi.linuxgccrelease -s filename.pdb -ex1 -ex2 -resfile resfile.txt -nstruct 15 -overwrite -linmem_ig 10​

 

thanks

dan

Mon, 2019-11-04 08:42
dantimatter

Responded in other thread.

Mon, 2019-11-04 11:56
rmoretti

RosettaScripts remains parallelized only at the "embarassingly parallel" level where separate processes run separate PDBs.  

If your old code said "nstruct 10" and produced 30 structures - that's a bug, not a feature.  It should only produce nstruct, not nstruct x number of processes.  I would assume that it's not actually running in MPI, and the silent file machinery is being nice to you and automatically remangling the duplicated structure names (although that's not consistent with you seeing more than one process ID number in the tracers).

If your current MPI code only ever has reports from process 0 (the head node) - most likely every process thinks it is the head node and it's not properly in MPI - although, again, I would expect to see duplicate lines from the job distributor.

I don't know what the view option does to comment.

You have not mentioned that you COMPILED in MPI - only running in MPI.  It's definitely an "is it plugged in" question, but it's worth asking.  

 

Thu, 2017-07-13 09:15
smlewis

I did compile Rosetta with MPI, but I just realized I never specified a site.settings file. Actually, that might be a good explanation (can it find the headers if they are in default locations?), although Scons didn't complain when I tried to build! This is true for both versions I've used, so I'm surprised that some sort of MPI seemed to be happening with the first one.

I attached two sets of flag files here: the "topo" file was used with my previous rosetta build, and the Topology Broker protocol. The "hybrid" file was used with the more recent version, designed for the Hybridization protocol. Let me know if anything stands out.

I also thought the fact that I was getting [nstruct * # of cores]  was pretty weird, but technically speaking neither of these runs really worked properly. As @rmoretti pointed out, at least for this latest version it seems that it's running with pseudo-MPI, but is actually a bunch of indepenent runs that aren't aware of each other. I'll try recompiling it with a proper site.settings file real quick -- sometimes the simplest solution is the right one...

File attachments: 
Thu, 2017-07-13 10:56
beowulfey

As an update, I think I found the problem. It wasn't the missing site.settings; instead I think it may have been after an Ubuntu upgrade, where I installed mpich without realizing I still had OpenMPI on the same machine. It was compiling with the mpich libraries but perhaps was getting confused (or maybe mpich doesn't work as well? not sure). Anyway, I recompiled after removing mpich and used OpenMPI, and now the behavior is working as intended. It's properly splitting the number of nstructs by the number of nodes, not including the master node. I can see the (1), (2), (3) etc in the trace as well. 

Anyway, thanks for the help!

Thu, 2017-07-13 14:21
beowulfey

That would explain it -- if you're compiling with MPICH but trying to use the OpenMPI launcher (or vice versa), the MPI framework won't be set up appropriately, and it might appear to Rosetta that you're running multiple independent serial jobs. (Though you'd still get the '(0)' in the tracer output.)

Thu, 2017-07-13 14:47
rmoretti