You are here

Mpirun launches processess but does not run

1 post / 0 new
Mpirun launches processess but does not run
#1

Dear Rosetta Community,

I have been frequently and successfully using the mpi installed version of rosetta_scripts on my institute high performance computing cluster for the past couple of years. However, I have been running into parallelisation issues of late. The command I use to submit jobs on our cluster using PBS is as follows:

qsub -P chemical -N rosetta -l select=200:ncpus=1 -l walltime=168:00:00 design.sh

The above command requests for 200 cores (not necessarily on the same node) for 168 hours and runs a script called design.sh. The script is as follows:

## design
cd $PBS_O_WORKDIR                               # cd to current working directory
module load apps/Rosetta/2020.03/intel2019      # Load rosetta module

folder=outputs
nstruct=9999

rm tracer*
rm -rf $folder
mkdir $folder

mpirun -np $PBS_NTASKS $ROSETTA_BIN/rosetta_scripts.mpi.linuxiccrelease -parser:protocol design.xml \
-s complex.pdb \
-nstruct $nstruct \
-overwrite -write_all_connect_info \
-jd2:failed_job_exception false \
-out:path:pdb ./$folder/ \
-out:file:scorefile design.fasc \
-mpi_tracer_to_file tracer.log

This script instructs mpirun to launch as many cores as were requested for (here 200). These scripts worked fine for many months until the past few weeks where the tracer.log_* files are not being produced. Although the "rosetta_scipts" process is seen running on the compute nodes with zero % memory and CPU utilisation. I have attached a screenshot to demonstrate this. The screenshot shows that five "rosetta_scripts" processes were launced on a node named csky111 but without actually utilising the resources. Most importantly, the issue is not persistent, the same scripts works perfectly sometimes but fails quite often too. I am unable to find any pattern why the same script fails or succeeds.

Some observations that might help with resolving the issue:

  1. Scripts work fine when running on < 50 cores.
  2. Memory allocated to each process was not a limiting factor.
  3. -lselect=1:ncpus=96 (all 96 cores on the same node) is more likely to succeed instead of -lselect=96:ncpus=1 (possibly scattered).

 

What could be the likely issue and resolution? Would you recommend re-installing the application ? Any thoughts on this would be really helpful.

Thank you,

Akshay

AttachmentSize
hpc_jan22_error.png133.03 KB
Category: 
Post Situation: 
Wed, 2022-02-16 03:09
chenna