You are here

Can I assume the outputs from different runs are from a same batch?

5 posts / 0 new
Last post
Can I assume the outputs from different runs are from a same batch?
#1

Dear friends,
I am using "minirosetta.linuxgccrelease" for homology modelling. It is recommended at least 1000 outputs are needed. However, after running 3 days, my PC got a problem and only ~500 outputs have been generated. If I run a second run, can I just run another 500 outputs and combine them together with the first batch outputs, and assume those 1000 outputs are generated from same batch?

To extend the question, for most of the time, can I always assume the outputs from different runs are from a same batch? (Of course, all the input files and command line remain identical)

Thank you very much.

Yours sincerely
Cheng

Post Situation: 
Mon, 2014-11-03 06:34
lanselibai

Yes, for most Rosetta protocols (although not all) each output structure for a given input (starting files and options) is exactly the same protocol as any of the others. The only difference is the random number draws used at various points in the protocol. So doing two runs, each with 500 outputs should be equivalent to doing a single run with 1000 outputs (or 100 runs with 10 outputs each).

The only caveat on this is that Rosetta uses *pseudo*random numbers, rather than true random numbers, so each of the individual runs must be done with a different seed - using the same seed will result in the same structures. The default behavior should work for this - by default Rosetta will seed the PRNG with a number drawn from your systems entropy source, which means that different runs will most probably have different seeds. Some people don't trust that, though, so will explicitly set the seed to use with "-constant_seed -jran XXXXXXXXX" where XXXXXXXXX is the integer seed value to use. If you do this, you simply have to make sure that different runs always have a different value for the seed. - As I understand it, this includes restarting. If you manually set a seed, you'll want to change it before you restart a run, otherwise you can recreate structures. (The default behavior will pick a new random seed on restarting, so doesn't have this issue.)

Mon, 2014-11-03 07:27
rmoretti

Hi R Moretti,
Thank you for your help. For my first run, I was using:

-run:constant_seed

-run:jran 1111111

So for my second run, I should
1) Restart my Ubuntu
2) use:
-run:constant_seed

-run:jran 1111110 #just another number

Is this correct?

Can I ask so the "jran 1111111" is the seed number? If I can manually change the seed number, why it is called "constant" in the "-run:constant_seed"?

Based on your explanation so far, my understanding is that each seed value corresponds to a fixed number. Is this right? Thank you.

Yours sincerely
Cheng

Mon, 2014-11-03 08:48
lanselibai

Yes, the number being passed to -run:jran is the seed value, which is just a number (specifically, a 32 bit signed integer). This number "primes" the PRNG. For a given seed, the sequence of results from the PRNG will be the same every time it's restarted with that seed - though each element in the sequence should look random. (See http://en.wikipedia.org/wiki/Pseudorandom_number_generator for more)

The reason it's called "constant_seed" is that instead of auto-determining a seed which changes every time Rosetta runs, when -constant_seed is used, Rosetta will always use the user-provided value for the seed, regardless of how often it's restarted. - So it's constant from Rosetta's perspective, rather than a user-doing-multiple-runs perspective.

Offsetting the seed by one should work for standard single processor runs, but when Rosetta does MPI/multithreaded runs it has an internal scheme for offsetting the random seeds, so doing a consistent offset pattern like that can be risky. Much better is to pick an arbitrary seed which isn't related to previous ones (but which you know to be different from the others you use.) You can use a command like " python -c 'import random; print random.randint(-2**31,2**31)'; " to print out an arbitrary number in the appropriate range for use as a seed.

Mon, 2014-11-03 12:16
rmoretti

Hi R Moretti,
Thank you so much for your help not only theoretically but also practically.

Now I have two jobs running in parallel on my Ubuntu, which almost occupies 100% of my CPU and RAM. :)

(Our cluster is out of work recently)

Yours sincerely
Cheng

Mon, 2014-11-03 13:15
lanselibai