You are here

AbInitioRelax.mpi Hangs - Waiting for Job Request

3 posts / 0 new
Last post
AbInitioRelax.mpi Hangs - Waiting for Job Request
#1

Hi guys,

 

I recently downloaded and compiled Rosetta with MPI capabilities to take advantage of the 32 core processor we have on our workstation. Compilation went well, and I can call protocols - but they all seem to hang.

 

To help narrow things down, I am working out of the DeNovo Structure Prediction tutorial demo directory - I can call the protocol and it seems to start running as normal:

 

mpirun -n 32 $ROSETTA_MPI/main/source/bin/AbinitioRelax.mpi.linuxgccrelease @input_files/options

 

Everything starts up like normal, but it always ends of hanging on this output:

 

~$: protocols.jobdist.JobDistributors: (0) Master Node -- Waiting for job request; tag_ = 1

 

I dug around these forums - seems that the code is still trying to run off of only one core - not sure why. Is there a way to specify I want to run on many cores? I thought this was the purpose of running compiled binaries with extras=mpi.

 

I looked into the code where it gets stuck, seems like it is forever waiting on a return from the MPI_Recv( ) function.. I could be wrong - I cant read C++ all that well:

(From protocols.jobdist.JobDistributors)

   418          while ( true ) {
   419                  int node_requesting_job( 0 );
   420
   421                  JobDistributorTracer << "Master Node -- Waiting for job request; tag_ = " << tag_ << std::endl;
   422                  MPI_Recv( & node_requesting_job, 1, MPI_INT, MPI_ANY_SOURCE, tag_, MPI_COMM_WORLD, & stat_ );
   423                  bool const available_job_found = find_available_job();
   424
   425                  JobDistributorTracer << "Master Node --available job? " << available_job_found << std::endl;
   426
   427                  Size job_index = ( available_job_found ? current_job_ : 0 );
   428                  int struct_n  = ( available_job_found ? current_nstruct_ : 0 );
   429                  if ( ! available_job_found ) {
   430                          JobDistributorTracer << "Master Node -- Spinning down node " << node_requesting_job << std::endl;
   431                          MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
   432                          break;
   433                  } else {
   434                          JobDistributorTracer << "Master Node -- Assigning job " << job_index << " " << struct_n << " to node " << node_requesting_job << std::endl;
   435                          MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
   436                          MPI_Send( & struct_n,  1, MPI_INT, node_requesting_job, tag_, MPI_COMM_WORLD );
   437                          //              ++current_nstruct_; handled now by find_available_job
   438                  }
   439          }
   440
   441          // we've just told one node to spin down, and
   442          // we don't have to spin ourselves down.
   443          Size nodes_left_to_spin_down( mpi_nprocs() - 1 - 1);
   444
   445          while ( nodes_left_to_spin_down > 0 ) {
   446                  int node_requesting_job( 0 );
   447                  int recieve_from_any( MPI_ANY_SOURCE );
   448                  MPI_Recv( & node_requesting_job, 1, MPI_INT, recieve_from_any, tag_, MPI_COMM_WORLD, & stat_ );
   449                  Size job_index( 0 ); // No job left.
   450                  MPI_Send( & job_index, 1, MPI_UNSIGNED_LONG, node_requesting_job, tag_, MPI_COMM_WORLD );
   451                  JobDistributorTracer << "Master Node -- Spinning down node " << node_requesting_job << " with " << nodes_left_to_spin_down << " remaining nodes."  << std::endl;
   452                  --nodes_left_to_spin_down;
   453          }
   454
   455  }

 

Any help is appreaicted!

 

Thanks!
 

Nathan

Post Situation: 
Thu, 2019-10-31 07:30
nleroy

In your output, are you getting any '(1)' or other such (non-zero) labels?

The other thing I would double check is that the MPI libraries you compiled with are the proper "flavor" and version to go with the mpirun command you're using. If you have a "flavor" mismatch (e.g. running a Rosetta compiled with OpenMPI with a MPICH2 mpirun), you might have issues getting Rosetta to recognize that it's running under MPI.

Mon, 2019-11-04 13:02
rmoretti

I just ran it again, and it apepars that all outputs have '(0)' as a label - no non-zero labels.

 

I need to double check the MPI libraries. Do you have a suggestion as to how I can check that? I am attempting to run the protocols using mpirun. I have OpenMPI installed, and when I compiled Rosetta, it was calling mpicc to compile the source. I also had to comment out all the header file environment variables in the site.settings file to get the code to compile with extras=mpi - I am not sure if this is necessary information, but it seems that both the INCLUDE and LD_LIBRARY_PATH environment variables were empty when I compiled - and it was able to compile after I told it to ignore those.

 

I am not sure if this is sufficient information! Let me know... Thank you!

Fri, 2019-11-08 12:44
nleroy