You are here

Coverage of sequence space?

5 posts / 0 new
Last post
Coverage of sequence space?
#1

Hi, I have some questions on protein design using Rosetta, specifically on the coverage of sequence space.

Say, a protein with 100 aa, allow the design of all positions with all the 20 aa residue species, to optimize the score to improve the stability of the protein.

I want to know
1) If Monte Carlo is not used for iterative design, then can I say the sequences generated in the output are ALL the sequences Rosetta has tested? 
2) If Monte Carlo is used, by any means could I retrieve what designs (.pdb and score) has Rosetta discarded during MC minimization?
3) Is there a way to prevent Rosetta from searching already-searched sequence space (i.e. stop Rosetta from generating redundant sequences)?
4) How does Rosetta generate the design sequence at the very beginning for minimization and scoring exactly? Does it start with random mutation of the sequence?

I know that people have been using the low energy sequences from Rosetta design to compare with the sequence space observed naturally from evolution.
However, from my experience of using backrub to generate designs, I have got a lot of redundant sequences (i.e. only 1000 sequences are non-redundant out of 50k of designs). I wondered how the comparison would be valid if the search of sequence space using Rosetta is not exhaustive (maybe I'm wrong). That's why I want to know how exactly the design is started, and how does Rosetta process the designs afterward.

Thanks!

 

Category: 
Post Situation: 
Sun, 2019-04-21 23:25
johnnytam100

> 1) If Monte Carlo is not used for iterative design, then can I say the sequences generated in the output are ALL the sequences Rosetta has tested? 

No.  During a single trajectory, Rosetta tests many  hundreds of thousands or millions of sequences, and returns the lowest-energy sequence encountered.

> 2) If Monte Carlo is used, by any means could I retrieve what designs (.pdb and score) has Rosetta discarded during MC minimization?

Not easily.  During the trajectory, Rosetta is not building full structures.  Our sequence design algorithm, called the packer, needs to be very fast to consider so many possibilities.  As such, all of the 3D geometry is only considered during a precalculation, in which onebody and twobody energies for each rotamer or pair of rotamers are calculated.  During the Monte Carlo trajectory itself, an abstract graph structure is used that only stores node and edge energies (not geometry).  Updates are very fast because the substitution of one rotamer for another requires only calculation of the difference in energy that results from that substitution, which is based only on the onebody and twobody energies of that position.

I do have a pull request open to add a special score term that is evaluated once per step in the packing trajectory, and which builds a full 3D model of the protein at that point and dumps it out to disk.  This slows packing down by many orders of magnitude, though -- a calculation that should take seconds now takes hours or days.  It is intended only for visualization.


> 3) Is there a way to prevent Rosetta from searching already-searched sequence space (i.e. stop Rosetta from generating redundant sequences)?

No.  Backtracking is an essential part of Monte Carlo.  Moreover, any scheme that kept track of regions of sequence space already explored would require comparison of the current sequence to every sequence in a list of hundreds of thousands or millions, which would be very slow.

> 4) How does Rosetta generate the design sequence at the very beginning for minimization and scoring exactly? Does it start with random mutation of the sequence?

At the start, all rotamers are initialized to a fictitous null state.  Moves _to_ the null state at any position are prohbited; moves _away_ from the null state at any position are automatically accepted, so once a move has been considered at every position, you are guaranteed to have a real rotamer at every position.  (So the net effect is similar to a random rotamer chosen at every position, initiallly.  It's not quite identical because the null state allows some local optimization at the first positions that get a non-null state before all positions have a non-null state.)

>However, from my experience of using backrub to generate designs, I have got a lot of redundant sequences (i.e. only 1000 sequences are non-redundant out of 50k of designs). I wondered how the comparison would be valid if the search of sequence space using Rosetta is not exhaustive (maybe I'm wrong). That's why I want to know how exactly the design is started, and how does Rosetta process the designs afterward.

If you're getting a lot of redundant sequences given the same problem, that's a good thing -- despite the vastness of the search space, the Monte Carlo algorithm is converging to something that may well be the global optimum.  If you want sequence diversity, you need to rely on something else to inject random variations into the design problem, not rely on the optimizer (which is ideally convergent) to give you diverse solutions to a problem that should have a unique optimum.  (So, for example, jitter the backbone with the Small mover to vary the pro blem.)

Mon, 2019-04-22 08:26
vmulligan

> 1) If Monte Carlo is not used for iterative design, then can I say the sequences generated in the output are ALL the sequences Rosetta has tested? 

No.  During a single trajectory, Rosetta tests many  hundreds of thousands or millions of sequences, and returns the lowest-energy sequence encountered.

In that case, if I use a GenericMonteCarlo mover (ntrials = 100) to do iterative design, is the final output after the 100 MC cycles the best-minimized output chosen from 100 trajectories, where each trajectory has actually tested hundreds of thousands or millions of sequences? Does it mean I have done something MC of MC?

 

> 3) Is there a way to prevent Rosetta from searching already-searched sequence space (i.e. stop Rosetta from generating redundant sequences)?

No.  Backtracking is an essential part of Monte Carlo.  Moreover, any scheme that kept track of regions of sequence space already explored would require comparison of the current sequence to every sequence in a list of hundreds of thousands or millions, which would be very slow.

May I know how would backtracking lead to the output of same sequences?


 

> 4) How does Rosetta generate the design sequence at the very beginning for minimization and scoring exactly? Does it start with random mutation of the sequence?

At the start, all rotamers are initialized to a fictitous null state.  Moves _to_ the null state at any position are prohbited; moves _away_ from the null state at any position are automatically accepted, so once a move has been considered at every position, you are guaranteed to have a real rotamer at every position.  (So the net effect is similar to a random rotamer chosen at every position, initiallly.  It's not quite identical because the null state allows some local optimization at the first positions that get a non-null state before all positions have a non-null state.)

Before the rotamer randomization, how was the side chain (i.e. which aa species) determined? 


And I have three more questions:

5) Recently, out of curiosity, I have tried tandem design (packer1 -> packer2 -> packer3, each packer performs 100 trials using GenericMonteCarlo mover), to see if the output would be different from the usual single packer design. In such a case, I am expecting the minimized output from packer1 will be the starting point of packer2, and the minimized output from packer2 will be the starting point for packer3. One thousand structures were generated from both tandem design and single packer design. What I found, in the end, 1) the average energy of the whole population of decoys from tandem design is significantly lower than single packer design and 2) there were more decoys at the lowest energy range. If a single packer design + generating 1k structures has explored a sufficient sequence space and sufficiently done the minimization, why would a tandem design generate more low energy designs than single packer design? 

6) Although I still haven't tried it myself, do you know if Rosetta is capable of generating designs which have no detectable sequence similarity with the original structure? I think it is capable. But in case sometimes it is difficult, is there any barrier that hinders the jump to a distant sequence space?

7) Although there is no simple way to estimate the coverage of sequence space, if Rosetta is sampling hundreds of thousands or millions of sequences and we are able to observe convergence to the minimum, say if there is not a strong preference of aa species at the design positions (i.e. without some aa species that must be mutated to, or without some aa species must not be mutated to), is it a better choice to just use all 20 aa and allow Rosetta do the sampling? If the sampling is sufficient, Rosetta will anyway capture those sequences with design rationale (i.e. from previous knowledge e.g. what aa to mutate to and thus the specific sequence space where good designs are believed to be found)?   


Thanks a lot!!!

Tue, 2019-04-23 07:47
johnnytam100

In that case, if I use a GenericMonteCarlo mover (ntrials = 100) to do iterative design, is the final output after the 100 MC cycles the best-minimized output chosen from 100 trajectories, where each trajectory has actually tested hundreds of thousands or millions of sequences? Does it mean I have done something MC of MC?
 

Yes.  If all you're doing inside a GenericMonteCarlo mover is packing, there's no reason to use the GenericMonteCarlo mover.  The packer alone is doing long Monte Carlo trajectories internally.  (If you're packing *and* perturbing in some way for each GenericMonteCarlo step -- e.g. jittering the backbone each time and then packing -- then this can make sense.)

May I know how would backtracking lead to the output of same sequences?
 

I might have misunderstood your initial question.  If you're saying, "I ran the packer three times, and got three sequences, and now I want to tell it not to give me any of those particular sequences on the fourth run," we might be able to make that possible (though we don't currently have a way to do that).  I was thinking that you meant, "The packer has been doing its Monte Carlo search for a while, and it's on move 573,211 of 1,000,000.  I want to make sure that this move doesn't consider any of the 573,210 sequences that it has considered to this point."  That would be hard.  If you have N sequences that you want to penalize, you effectively have to have the packer compare the current sequence to all N at every step in its Monte Carlo trajectory.  Maybe there are ways to make that somewhat efficient, but it's nontrivial for large N.  But for small N, we could probably add a non-pairwise score term to penalize a user-defined sequence.  Indeed, you might be able to coax the MHCEpitopeEnergy (https://www.rosettacommons.org/docs/latest/rosetta_basics/scoring/MHCEpitopeEnergy) into giving you the behaviour you want (penalizing a sequence that you've already seen, so that you're unlikely to get it again).

>why would a tandem design generate more low energy designs than single packer design? 

If you're allowing the input side-chains, when the packer returns the lowest-energy configuration encountered, it includes in that the input configuration.  That means you're guaranteed to get the same or better energy on  the second run, so on average it will  be lower.  That's my guess for why you're seeing that.

> is there any barrier that hinders the jump to a distant sequence space

No barrier to distance space, no.  There's a barrier to "narrow wells" in the sequence space landscape, though -- rotamer combinations that are themselves favourable, but for which all the "nearby" rotamer combinations are unfavourable.  Monte Carlo explores funnel-like landscapes well, but golf course-like landscapes poorly.

The more common reason, though, is that the input is very close to the global optimum, given the backbone conformation.  If you jitter the backbone conformation a bit, you're more likely to drive up the energy of the current sequence and drive down the energy of some alternative sequence, and thereby find more distant sequences.

> is it a better choice to just use all 20 aa and allow Rosetta do the sampling?

For small design problems (say, ~2000 to ~3000 rotamers), perhaps.  Keep in mind that, even for very small design problems, a trajectory that samples a million (106) possibilities is sampling a trivial fraction of the total.  Imagine that you have 20 allowed amino acids per position, and 10 designable positions.  You have 2010, or about 1013, possible sequences. That's ten million-fold more sequences than you have sampled (and that's not even considering all the rotamer possibilities for a given sequence, so that's a big underestimate).  So if you want to maximize your chances of finding the global minimum, you're definitely better off limiting amino acid possibilities at each position, with things like LayerDesign/Layer selector- based control of allowed amino acid identity in core, boundary, and surface layers.

Tue, 2019-04-23 14:05
vmulligan

Thank you very much vmulligan!!! You are very helpful!

I really appreciate that Rosetta is very well-documented online. 

In any case would the Rosetta group consider publishing a bible (more on practical application) of Rosetta?

I'd love to have one!

Enjoy your weekend!

Thu, 2019-04-25 19:28
johnnytam100