I am new to MPI so please correct any misuse of terms. My question is about optimizing for MPI on an HPC or a second (preferred) solution to my issue is to optimize my RAM usage so I don't have to run MPI, serial would actually improve overall models produced anyways.
First I have a very simple question, in the output of rosetta how do I know how many cores are being used in the MPI? I see (0), (1), and (2) at the beginning of each line so I'm assuming this mean 1 2 and 3 cores being used for that output line, is this correct? Basically if I'm using 2 cores with MPI in the output how do I know they actually got used?
I'm trying to optimize my rosetta MPI run on an HPC and my code currently is running 3X slower than expected and from reading documentation I think it just might be an option incompatibility so hoping someone with experience can advise. I have a lot of experience running with HPCs charging by the CPU but I'm using stampede2 which charges by the node so things get complicated here. Normally I'd prefer to run completely in serial but the maximum RAM the node has is 96GB and my system uses 3GB per run limiting me to about 35 runs per node (emperically tested this) but I have 68 available cores thus the want to include MPI to take advantage of the 36 idle cores (assuming I play it safe and only run 32 runs/node). Again with the HPC I'm using you can only request an entire node not individual CPUS.
So this is my plan of attack:
Request 1 node which has 96GB of RAM and 68 Xeon PHI CPUs. run 32 independent jobs to prevent exceeding RAM but each job will have 2 CPUs working in MPI.
programs I'm using to submit (screenshots attached):
I have 4 scripts that pass things along. First I have my sbatch submission script which uses the launcher module. The main objective in this is to request 32 tasks to be run at a time with 2 CPUs per task. Then I have a jobfile that essentially lists the total number of commands (tasks) I'd like to run, for example 320 tasks would run 10 rosetta runs on 32 (X 2 for MPI) CPUs in parallel and independently. This job file runs the rosetta command file which must use mpiexec (not mpirun) command explained below. Finally this also calls the rosetta.xml but this doesn't really play a role in the sbatch but it is there for completeness.
Double checking run:
When checking top on the node the job is submitted to 65 CPUs are in use (64 for rosetta jobs and 1 for node managment?). However, I do notice that there are 2474 total, 61 running and 2413 sleeping so not sure what this refers to maybe threads (NOTE: these numbers are for my latest 30 CPU test not 32 as I'm describing)? Unfortunately, top is the only montoring program (I use htop) so if there is a command that can be used to officially check the CPUs being used I'd appreciate it. also in the rosetta output it has a (0) and a (1) for each line. I'm assuming this refers to 1 or 2 cpus being used but please clarify if I'm wrong. When I tested with 4 CPUs the max number I saw was (2) so maybe a solution is to run 22 tasks and MPI CPU=3?
output files (don't think it is helpful for troubleshooting but might be helpful in advising where to ask me to start looking):
Jobfile also has the rosetta terminal output saved with a file embedded with a unique stampede2 job id, launcher job # (1-320), and a launcher task id which I believe is the CPU id number running the script. Each Rosetta output silent files also has these unique embedded codes and saves all silent files to unique results directory.
So everything runs and I believe it is doing MPI in the way that I'd like it to but don't know how to check other than the (x) line mentioned above. But it is very ineffecient and takes about 13 hours versus 4-6 hours per model on my laptop. I'm doing a homology model with symmetry, membrane, and loops of a protein that is 2100 AA in total and I've spent a lot of time optimizing this part. My preference obviously is to run completely in serial but because of RAM limitations this isn't possible. I'd assume the MPI would be faster, not 3X slower, than serial so I'm just wondering if I can optimize the options here. I read this about "bind to none" option in openmpi but again I've never dealt with MPI and so the terminaology is difficult for me to understand. Again in the top there are a lot of sleeping tasks so this is why I think it could be optimized. I've attached screenshots of the slurm submission script, the jobfile script, and the mpiexec rosetta run script.
I've played around with options but because of the queue and the 13 hours to make a model it is taking days-weeks to get any meaningful results. I did test with a shorter 5 min run but it didn't scale. I'm continueing to try new things but thought I could save time by asking for advice. I have a feeling it has something to do with either the launcer slurm script or the way I'm calling for MPI. The launcher slurm script calls for -n = 32 # number of tasks ---cpus-per-task=2. In the rosetta MPI running file I have this line:
mpiexec.hydra -genv I_MPI_FALLBACK=1 -n 2 rosetta_scripts.cxx11mpi.linuxiccrelease \
-genv I_MPI_FALLBACK=1 #must be included otherwise the job fails. Something with Fabric and fallback disabled so don't understand this bit but included because it works.
-n 2 #number of cpus per task? but the document I read suggested this is how many time the command is run so I'm confused by this, maybe it should be 32?
There is also this passage:
Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives:
- Bind to core:
- when the number of processes is <= 2
- Bind to socket:
- when the number of processes is > 2
- Bind to none:
- when oversubscribed
- It sounds like I should bind to core since I'm running with 2 but really I'm running with 32X2 so should I bind to none?
- MPIrun document:
- Sorry if this isn't a problem so much as helping me improve my code and unfortunately no one in my network has experience running rosetta in MPI.
Thank you for your help and please let me know if more information or clarification is needed,