You are here

UnfoldedStateEnergyCalculator MPI Error

2 posts / 0 new
Last post
UnfoldedStateEnergyCalculator MPI Error
#1

Hi all,

I'm trying to use UnfoldedStateEnergyCalculator for some nonstandard Amino Acids and I'm having a persistent MPI_Recv error when run in parallel mode.  I'm using Rosetta 3.8 compiled with icc 17.0.4 and impi 17.0.3 and running on TACC Stampede2. Error happens for every input I've tried (different AAs, stock / modified databases, smaller sets of input PDBs), and I've been able to successfully run the Rosetta 3.4 version of UFSEC with the same inputs on an older cluster (Stampede).  UFSEC R3.8 runs in serial mode, but seems to be taking much longer than it should to run (5-10min per PDB). SysAdmins are stumped, and I've tried all of the solutions easily google-able without any luck.  As far as I can tell, it looks like one of the MPI_Recv() calls from UnfoldedStateEnergyCalculatorMPIWorkPoolJobDistributor is choking (see error output below).

Is this a bug, or can someone suggest a way to get this working?

Running with command:

ibrun $TACC_ROSETTA_BIN/UnfoldedStateEnergyCalculator.cxx11mpi.linuxiccrelease -database /work/02984/cwbrown/stampede2/Data/Rosetta/rosetta3.8_database_nsAAmod -ignore_unrecognized_res -ex1 -ex2 -extrachi_cutoff 0 -l /work/02984/cwbrown/stampede2/Data/Rosetta/ncAA_rotamer_libs/scripts/cullpdb_list.txt -mute all -unmute devel.UnfoldedStateEnergyCalculator -unmute protocols.jd2.PDBJobInputer -residue_name NBY -no_optH true -detect_disulf false > ufsec_log_NBY.txt&

===========

Error:

===========

...

protocols.jd2.MPIWorkPoolJobDistributor: (2) Slave Node 2: Rprotocols.jd2.MPIWorkPoolJobDistributor: (3) Slave Node 3: Requesting new jobprotocols.jd2.MPIWorkPoolJobDistributor: (4) Slave Node 4: Requesting new job id fprotocols.jd2.MPIWorkPoolJobDistributor: (5) Slave Node 5: Requesting new job id from mequesting new job id from master

 id from master

rom master

aster

protocols.jd2.MPIWorkPoolJobDistributor: (1) Slave Node 1: Requesting new job id from master

protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Getting next job to assign from list id 1 of 5

protocols.UnfoldedStateEnergyCalculator.UnfoldedStateEnergyCalculatorMPIWorkPoolJobDistributor: (0) Master Node: Waiting for job requests...

TACC:  MPI job exited with code: 14 

TACC:  Shutdown complete. Exiting. 

Fatal error in MPI_Recv: Message truncated, error stack:

MPI_Recv(224)...........................: MPI_Recv(buf=0x7ffd7a5fb534, count=1, MPI_INT, src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7ffd7a5fb520) failed

MPIDI_CH3_PktHandler_EagerShortSend(455): Message from rank 1 and tag 10 truncated; 8 bytes received but buffer size is 4

Post Situation: 
Mon, 2017-11-13 09:43
colin.walsh.brown

There was a bugfix for this code - although not necessarily this ISSUE - in mid-May (after 3.8).  I would suggest you try the most recently weekly to see if the problem goes away.  I'll also tag Doug (the author of this code).

Mon, 2017-11-13 10:43
smlewis