You are here

MPI nodes hanging and output log incomplete

2 posts / 0 new
Last post
MPI nodes hanging and output log incomplete
#1

I'm using MPI with the PeptideDeriver pilot app (which I reworked to support JD2). I am running into some problems:

1. nodes that have finished jobs stall with 100% cpu usage while the entire MPI run isn't complete.
2. some tracer.log files don't contain all of the output -- for instance, the master node might have gotten a success message from a node, but not all the output of that job appear in the tracer.log file (the last bit is missing)
3. some jobs aren't finishing.

While (3) might possibly be a problem with my protocol, (1) and (2) are clearly problems with either MPI or the way I am using it.

It would be possible that this has something to do with my source code. Since I'm just using JD2's go() method, and since all but the last job output normally to the log file, I don't think that's the cause. The place where I'm doing my output in the code is the report() method on a filter that I wrote. I created a FilterReporterMover (similar to FilterMover; actually, one that encapsulates it, and just calls report() right after apply()) and give it as the argument for the go() method. Perhaps I *should* call Tracer::flush() at the end of each report()?

I was thinking that the cause for (2) can be an OS buffering issue - if the process never finishes (because of (1)) then the buffer might be stuck in some pipe before it's written to a file, and when I kill my MPI processes, the buffer gets lost. But I don't know if it is.

I can't figure out the reason for (1), and think that solving that might solve most of my problems.

I'm running on OpenMPI 1.4.5 on a 15-node cluster. The command line (copied from the log file) is:

/path/to/bin/PeptideDeriver.mpi.linuxgccrelease -database /path/to/database -in:path /path/to/pdbs -in:file:l /path/to/pdblist -nstruct 1 -out:mpi_tracer_to_file tracer.out.48 -out:chtimestamp -randomize_missing_coords -peptide_deriver:length_to_derive 10 -ignore_unrecognized_res

The cluster has 15 nodes.

I took one of the structures that apparently finished but wasn't output correctly, and indeed it finished OK when running it on the non-MPI executable locally.

N.B. this is quite similar to what happened in this post: https://www.rosettacommons.org/node/3460

I'd appreciate any help anyone has to offer.
Thanks,
Yuval

Category: 
Post Situation: 
Wed, 2014-06-18 05:40
yuvals

Some updates --

I have debugged some processes, and they all seem to be hanging on some loop in libmpi, which continuously calls opal_cr_test_if_checkpoint_ready and opal_progress.

However, I tried running the same MPI setup (with the same PDB list) with score_jd2, and it didn't hang.

So I'm quite confused about whether it is a problem with my implementation (Mover/Filter/JD2 use) or whether it is a problem with the libraries.

Sat, 2014-06-21 02:17
yuvals