I'm quite new to Rosetta (and computational approaches in general, I've only been using linux and bash-based interfaces for about 6 months) and I've spent a few weeks trying to understand the docking process, which I think I understand fairly well. I've run into a problem that I hope you can help me with. Apologies for any naivety/confusion/incorrect terms that I use, I'm still very much an amateur so I might not explain myself in the best way.
What I want to do
I've organised my docking flag and relaxed my input structure, and now I'm at the stage where I wish to do a global docking production run with ~100,000 models or more over 100 CPU cores. Naturally, for this purpose I'm using my university's High Performance Computing hub. The hub uses SLURM as the resource manager and implements MPI, and the IT team have installed and compiled Rosetta with MPI. I'm attempting to run my simple global docking script over a number of CPUs and nodes. I've read the Rosetta MPI information and as far as I understand it, it should be as straightforward as executing the MPI version of the docking program and adding the relevant information to SLURM to allocate CPU/node resources.
The problem I'm having
The problem is, after I do this I can see that the resources have been allocated, but I don't think the CPUs are being utilised. I've done a few trial runs to produce 10 models on 1 node (that has 28 CPU cores), with each run assigning more tasks to the node (1, 2, 4, 6, 8, 10 tasks per node in 6 different runs, respectively, with 1 CPU core allocated to each task). The issue is, I'm not seeing a linear reduction in the processing time relative to the amount of tasks, and I would think that (for example), a node running 10 tasks using MPI would take ~1/10th the time as 1 node using 1 task. I'm not particularly seeing any improvement in processing time with the increase of task number, so I'm thinking that I have an issue with my code. I'm pretty sure the SLURM script to add to the resource queue is fine, because I can see that there have been (for example in the 10 task test run) 10 CPUs allocated to the job, which makes me suspect that the Rosetta script isn't working as intended. If anyone could have a look and suggest what I might be doing wrong, I'd be forever grateful! I'll leave the plot of tasks per node vs processing time and script information down below.
Processing time relative to tasks allocated per job
|Tasks per node||Process time|
Rosetta script (saved as test_script.sh)
#!/bin/bash #SBATCH --job-name=test_job #SBATCH --partition=test #SBATCH --nodes=1 #SBATCH --ntasks-per-node=<either 1, 2, 4, 6, 8 or 10> #SBATCH --cpus-per-task=1 #SBATCH --time=00:10:00 #SBATCH --mem-per-cpu=1000M module load apps/rosetta/2018.33 export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so srun docking_protocol.mpi.linuxgccrelease @test_flag
Docking flag (saved as test_flag)
-in:file:s F1_didomain_global.pdb -nstruct 10 -partners A_B -dock_pert 3 24 -spin -randomize1 -randomize2 -ex1 -ex2aro -out:suffix _test -score:docking_interface_score 1
Files in my working directory
[n00baccount@topsecretHPC tester]$ ls -lct total 512 -rw-r--r-- 1 n00baccount bioc 292 Aug 17 10:22 test_script.sh -rw-r--r-- 1 n00baccount bioc 175 Aug 17 09:34 test_flag -rw-r--r-- 1 n00baccount bioc 359148 Aug 16 22:05 F1_didomain_global.pdb
Example of submission
[n00baccount@topsecretHPC tester]$ sbatch test_script.sh Submitted batch job 3961469
Example of SLURM showing resources being allocated for two different jobs (1 task per node vs 10 tasks per node with 1 CPU per task)
[n00baccount@topsecretHPC tester]$ sacct -u n00baccount JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3961468 test_job test default 1 RUNNING 0:0 3961468.0 docking_p+ default 1 RUNNING 0:0 3961469 test_job test default 10 RUNNING 0:0 3961469.0 docking_p+ default 10 RUNNING 0:0
Any help would be very very much appreciated.
Thanks very much for reading!