You are here

Inquiry about MPI-Rosetta: Issue with -nstruct Parameter

2 posts / 0 new
Last post
Inquiry about MPI-Rosetta: Issue with -nstruct Parameter

Subject: Inquiry about MPI-Rosetta: Issue with -nstruct Parameter

Hello Rosetta community,

My name is Serena, and I have a question regarding the MPI-Rosetta forum. I hope to articulate my issue clearly.

I am using a CPU with 14 cores and 28 threads. I have downloaded the mpi-rosetta application. When running an APP with the -nstruct parameter, such as the antibody_h3 app, I often encounter a problem. After the run completes, there are no apparent error messages, but despite specifying -nstruct as 100, I only get around 20+ pdb files generated. 

The runtime duration also seems off. According to the literature, the expected runtime should be several days, but my runs typically complete in approximately 1-2 hours. I have experimented with various parameters, including 10, 20, 100, and 1000. When testing with parameters up to 20, the results seem somewhat normal, and I can barely obtain around 10 structure files. However, when I increase the parameter beyond 20, the runs essentially finish with a maximum of 20+ structures. Is it a memory issue, or could there be another problem? I checked the Rosetta forum but couldn't find a specific reason.

I am curious to know if this is normal behavior or if there might be a specific reason causing this discrepancy. Your assistance in understanding and resolving this issue would be greatly appreciated.

Thank you,

Post Situation: 
Wed, 2024-01-24 00:59

If you're giving the mpirun command the directive to run on ~28 processes, and you only end up with ~20 outupt structures, my guess is that there's some issue which  causes premature termination after ~1 output structure, which crashes the whole run before all the processors can move on to the next structure. That's supported by the fact that you're seeing it finish after 1-2 hours, rather than the full expected run.

One would hope that something is being printed to the tracer output. One possibility is that if it's a hard crash, the program might be killed before it is able to flush all the output to the file. One potential way of accounting for this is to run the program with output directly in the terminal, without redirecting it at all. This should cause all the output to be printed immediately, and if there is any error messages, they'll be visible. Rosetta should print something when it crashes, or if it's exiting "successfully" but prematurely, there should be a clear message to that effect as well.

The other thing to check for is to look into your MPI system and job running system, to see if there's any way you can get extra diagnostic information from them. It could be the MPI job launching system which is cancelling your job.


A final note, likely unrelated: Rosetta tends to be CPU-limited. When you have hyperthreading enabled (14 physical cores but 28 threads) that tends to work best when the processes are either IO limited, or if they're doing different sorts of work (e.g. one job doing floating point calculations while the other does integer-based work.) If both are CPU-limited and doing the same sort of work, you don't get the scaling you expect. So when running Rosetta calculations, you typically want to scale things based on physical CPU numbers, rather than the number of availible hyperthreads. (Though you can run some tests to see how your particular CPU and protocol combination behaves.)

Wed, 2024-01-24 08:46