Rosetta 3.3
Documentation for the sequence_tolerance application
Author:
Colin A. Smith

Metadata

This document was last updated August 10, 2010, by Colin A. Smith. The corresponding principal investigator is Tanja Kortemme <kortemme@cgl.ucsf.edu>.

Code and Demo

The code for the sequence_tolerance application is in src/apps/public/sequence_tolerance.cc. An integration test and demo is located in test/integration/tests/sequence_tolerance. The abstract genetic algorithm classes are located in the protocols::genetic_algorithm namespace. The concrete subclasses are located in the protocols::multistate_design namespace.

References

Purpose

The general uses of this application are specificity prediction and library design (i.e. for phage display, yeast display, or other selection techniques). To do so, this application determines the scores of a large number of sequences on a given input structure. The scores can be reweighted to emphasize inter or intramolecular interactions.

Algorithm

When this application starts, it precalculates the one-body and two-body rotamer energies for all specified amino acids at positions defined in the resfile. A genetic algorithm is used to take an initial population of sequences and generate successive generations of sequences that should be lower scoring overall than the previous generations.

Every time a given sequence is scored, it uses simulated annealing to optimize the rotamers for the given sequence. The rotamer configurations are optimized using a score function which equally weights all structural interactions. After optimization of the rotamers, the sequence score can be reweighted so that individual chains, as well as interactions between chains, are emphasized or de-emphasized.

The first generation is seeded with the sequence that results from full rotamer optimization using all allowed amino acids in the provided resfile. Additional sequences can optionally be generated by mutating this sequence. The rest of the initial population consists of random sequences.

In subsequent generations, the sequence with the best fitness (see -seq_tol:fitness_master_weights) is always propagated to the next generation. The remaining sequences are generated by mutation and recombination of sequences from the previous generation. Sequences for mutation and recombination are chosen using tournament selection, in which two sequences are randomly chosen from the parent population, and the sequence with the better fitness is selected.

Limitations

Though this application makes use of classes designed to incorporate multiple states, it does not currently support multi-state design.

Input Files

Options

General/Packing Options

Sequence Scoring/Rescoring Options

The best way to understand how the fitness function works is to examine sample output from a two-chain protein:

app.sequence_tolerance: Residue Decomposition: [Set 1 (A): [1, 2, ...], Set 2 (B): [104, 105, ...]]
app.sequence_tolerance: Fitness Function Master Weights: [1, 1, 1, 2]
app.sequence_tolerance: Fitness of Starting Structure:
---------------------------------------------------------------------
Score Type                  Other        A        B      A-B    Total
---------------------------------------------------------------------
fa_atr                      0.000 -266.124  -10.049  -43.778 -319.952
...
hbond_sr_bb               -17.074    0.000    0.000    0.000  -17.074
hbond_lr_bb               -24.484    0.000    0.000    0.000  -24.484
...
---------------------------------------------------------------------
Total                     -41.558  -57.662    4.319  -17.429 -112.330

Based on the chain identifiers given in the PDB file, the residues of the protein are decomposed into multiple sets, for which the absolute residue numbers are listed in the first line above. The fitness function is constructed such that energies within and between those sets can be independently weighted. It is defined as a vector of real numbers which set the weights for those energy components. Each element in the vector corresponds to a column in the table above.

The first weight in the fitness function is for those elements of the energy function that are not decomposed into residue or residue-residue energies. It is called "Other" in the output above. In the standard scoring function, the only terms for which this is the case are those for short-range and long-range backbone-backbone hydrogen bonds. Except for some proline residues, these terms will not change upon side chain mutation.

The next weights in the fitness function correspond to the energies within each set of residues. They are called "A" and "B" in the output above. In a three chain protein, there would be three such weights.

The final weights in the fitness function correspond to energies between each pair of residue sets. In the output above, there is just a single such weight called "A-B". For a three chain protein with chains A, B, and C, there would be three such weights, A-B, A-C, and B-C. The expansion continues following that pattern for proteins with more chains.

In the fitness function above, the interactions between chain A and B are upweighted by a factor of 2. The best way to know which element in the fitness function weight vector corresponds to which set or set interaction is to run the sequence_tolerance application first, and then use the output table as a reference.

Genetic Algorithm Options

Output Options

The checkpoint files serve to both preserve simulation state and provide a record afterwards about what sequences were evaluated. To enable checkpointing, both -ms:checkpoint:prefix and -ms:checkpoint:interval must be specified.

Tips

As noted above, previous publications have used a population size of 2,000 and 5 populations. However, optimization of the genetic algorithm parameters has not been attempted with this implementation.

Higher resolution rotamer libraries are recommended. The following flags have been used for publication: -ex1 -ex2 -extrachi_cutoff 0. While the -ex1aro -ex2aro flags were used in the command line for the publication, they are redundant and not necessary.

The runs using a population size of 2,000 and 5 generations for a recent PDZ specificity prediction paper (Smith & Kortemme 2010) took an average of 33.5 minutes for a single structure. The simulations were each run on a single core of a heterogeneous cluster of 8 core Xeon workstations with E5345, E5430, and E5520 processors.

Expected Outputs

The sequence_tolerance application produces comparatively little printed output. It does provide summary statistics about each evolved generation. For example:

app.sequence_tolerance: Generation 2:
app.sequence_tolerance: Distinct new entities: 1914
app.sequence_tolerance: Duplicate new entities: 4
app.sequence_tolerance: Entities from previous generation: 82
app.sequence_tolerance: Entities resurrected from earlier generations: 0
app.sequence_tolerance: Fitness Percentiles: 0%=-129.236 25%=-116.689 50%=-113.139 75%=-94.8569 100%=2234.36
app.sequence_tolerance: Best new entity:
MultiStateEntity with traits: AA:106:L AA:107:A AA:108:Y AA:109:W AA:110:V and fitness -127.504
 SingleState 1 with fitness: -127.504

The first two counts give the number of unique and duplicate sequences evaluated for the first time in the given generation. The next two counts give the number of sequences included in this generation that were present in the immediately preceding generation or in generations before. A summary of the distribution of fitnesses is also given, including the best, the first quartile, the median, the third quartile, and worst fitness. Finally, the best sequence evaluated for the first time in this generation is given.

Checkpoint Files

Most of the relevant output for the sequence_tolerance application is given in the checkpoint files. The first file, prefix.ga.generations, contains all sequences from every generation listed in output like the following:

generation 1
AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V
AA:106:M AA:107:I AA:108:G AA:109:I AA:110:H
...
AA:106:Y AA:107:E AA:108:H AA:109:R AA:110:S
generation 5
AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V
AA:106:R AA:107:G AA:108:W AA:109:T AA:110:G
...
AA:106:N AA:107:D AA:108:E AA:109:N AA:110:F

Each generation starts with a "generation N" line. On subsequent lines, each sequence in that generation is space delimited and has residues formatted AA:absolute_residue_number:one_letter_amino_acid_type.

The second file, prefix.ga.entities, contains all sequences from all generations listed in output like the following:

traits AA:106:N AA:107:D AA:108:Q AA:109:H AA:110:N fitness -83.2501 states 1
 fitness -83.2501 metrics 1 fitness_comp Real[ -34.1734 -36.8645 1.63354 -6.92286 ]
traits AA:106:D AA:107:H AA:108:M AA:109:H AA:110:W fitness 648.546 states 1
 fitness 648.546 metrics 1 fitness_comp Real[ -34.1734 -10.28 6.03952 343.48 ]
...

The data for two sequences, each using two lines, is shown above. The output format is somewhat redundant, as it is designed to hold information about multiple structural states for each sequence. (This redundancy is one of the reasons compression is recommended.) For this application, only a single state is used. The sequence is formatted the same as in the generations file. The weighted fitness ("fitness") is given twice. The unweighted components ("fitness_comp") of the fitness score are also given.

Sequence Enumeration and Structure Output

Beyond postprocessing, pregeneration of checkpoint files can also be used to accomplish several tasks, including sequence enumeration and structure output. To do either, you should make a prefix.ga.generations file, in the format shown above, containing a single generation with all sequences to be evaluated/output. Then run sequence_tolerance with -ms:generations set to 1 and -ms:pop_size set to the number of sequences in your generations file. An empty prefix.ga.entities file must exist prior to running sequence_tolerance, otherwise the input generations file will be overwritten with random sequences. After running sequence_tolerance with a compatible resfile, a prefix.ga.entities file will be created with scores for all the sequences you specified. To generate structures, simply add -ms:numresults with the number of structures you want to output, up to the number of sequences specified in the generations file.

For example, here is a generations file, myprefix.ga.generations, that could be used to both score and output structures for the three sequences given:

generation 1
AA:106:F AA:107:N AA:108:E AA:109:W AA:110:I
AA:106:F AA:107:E AA:108:T AA:109:W AA:110:V
AA:106:F AA:107:D AA:108:T AA:109:W AA:110:V

This is the command line that would be used to run sequence_tolerance:

sequence_tolerance -database minirosetta_database -s mystructure_0001_low.pdb 
-resfile myresfile.resfile -ex1 -ex2 -ex1aro -ex2aro -extrachi_cutoff 0 
-seq_tol:fitness_master_weights 1 1 1 2 -ms:generations 1 -ms:pop_size 3 
-ms:checkpoint:prefix myprefix -ms:checkpoint:interval 200 
-out:prefix myprefix -ms:numresults 3

If you are not interested in outputting structures, change -ms:numresults 3 to -ms:numresults 0.

Post Processing

The checkpoint files are designed to be parsed and processed computationally. Several R functions for doing so are provided in analysis/apps/sequence_tolerance.R. It includes functions for reading *.ga.entities and *.ga.generations checkpoint files. It also has a function for writing *.ga.generations checkpoint files for sequence enumeration and structural output (see immediately above). Finally, it contains a function for reweighting a set of sequences using different fitness function weights. See the comments above each function for information about their use.

To determine the profile of tolerated sequences for a given protein, the sequence_tolerance application is typically run multiple times, once for every member in an ensemble of backbone structures. The sequence_tolerance.R file has a convenience function for processing such output called process_specificity(). The simplest way to run it is to change to a directory containing all the entities files you wish to process, run R, and execute the following commands:

> source("path/to/analysis/apps/sequence_tolerance.R")
> process_specificity()

It produces several output files:

The process_specificity() function has several parameters, which default to those found to work well for wild-type PDZ-peptide specificity prediction. The fitness function weights can be changed with "fitness_coef". The Boltzmann temperature (default 0.228) or sequence cutoff can be changed with "temp_or_thresh". The type of sequence weighting can be changed by specifying the string "boltzmann" (default) or "cutoff" for the "type" parameter. Finally, the percentile cutoff for merging PWMs from different backbone structures (default 0.5) can be changed with "percentile". A more detailed explanation of the parameters and analysis performed is given in Smith and Kortemme (2010).

If you want to follow the guideline determined for specificity prediction on mutants, increase the default temperature by 0.021 for every fixed mutation you make to the starting structure. For instance, with one mutation you would run this instead:

> process_specificity(temp_or_thresh=0.249)

New things since last release

This is the first released version, there are no changes.

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines