|Rosetta 3.2 Release Manual|
Every time a given sequence is scored, it uses simulated annealing to optimize the rotamers for the given sequence. The rotamer configurations are optimized using a score function which equally weights all structural interactions. After optimization of the rotamers, the sequence score can be reweighted so that individual chains, as well as interactions between chains, are emphasized or de-emphasized.
The first generation is seeded with the sequence that results from full rotamer optimization using all allowed amino acids in the provided resfile. Additional sequences can optionally be generated by mutating this sequence. The rest of the initial population consists of random sequences.
In subsequent generations, the sequence with the best fitness (see -seq_tol:fitness_master_weights) is always propagated to the next generation. The remaining sequences are generated by mutation and recombination of sequences from the previous generation. Sequences for mutation and recombination are chosen using tournament selection, in which two sequences are randomly chosen from the parent population, and the sequence with the better fitness is selected.
app.sequence_tolerance: Residue Decomposition: [Set 1 (A): [1, 2, ...], Set 2 (B): [104, 105, ...]] app.sequence_tolerance: Fitness Function Master Weights: [1, 1, 1, 2] app.sequence_tolerance: Fitness of Starting Structure: --------------------------------------------------------------------- Score Type Other A B A-B Total --------------------------------------------------------------------- fa_atr 0.000 -266.124 -10.049 -43.778 -319.952 ... hbond_sr_bb -17.074 0.000 0.000 0.000 -17.074 hbond_lr_bb -24.484 0.000 0.000 0.000 -24.484 ... --------------------------------------------------------------------- Total -41.558 -57.662 4.319 -17.429 -112.330
Based on the chain identifiers given in the PDB file, the residues of the protein are decomposed into multiple sets, for which the absolute residue numbers are listed in the first line above. The fitness function is constructed such that energies within and between those sets can be independently weighted. It is defined as a vector of real numbers which set the weights for those energy components. Each element in the vector corresponds to a column in the table above.
The first weight in the fitness function is for those elements of the energy function that are not decomposed into residue or residue-residue energies. It is called "Other" in the output above. In the standard scoring function, the only terms for which this is the case are those for short-range and long-range backbone-backbone hydrogen bonds. Except for some proline residues, these terms will not change upon side chain mutation.
The next weights in the fitness function correspond to the energies within each set of residues. They are called "A" and "B" in the output above. In a three chain protein, there would be three such weights.
The final weights in the fitness function correspond to energies between each pair of residue sets. In the output above, there is just a single such weight called "A-B". For a three chain protein with chains A, B, and C, there would be three such weights, A-B, A-C, and B-C. The expansion continues following that pattern for proteins with more chains.
In the fitness function above, the interactions between chain A and B are upweighted by a factor of 2. The best way to know which element in the fitness function weight vector corresponds to which set or set interaction is to run the sequence_tolerance application first, and then use the output table as a reference.
Higher resolution rotamer libraries are recommended. The following flags have been used for publication: -ex1 -ex2 -extrachi_cutoff 0. While the -ex1aro -ex2aro flags were used in the command line for the publication, they are redundant and not necessary.
The runs using a population size of 2,000 and 5 generations for a recent PDZ specificity prediction paper (Smith & Kortemme 2010) took an average of 33.5 minutes for a single structure. The simulations were each run on a single core of a heterogeneous cluster of 8 core Xeon workstations with E5345, E5430, and E5520 processors.
app.sequence_tolerance: Generation 2: app.sequence_tolerance: Distinct new entities: 1914 app.sequence_tolerance: Duplicate new entities: 4 app.sequence_tolerance: Entities from previous generation: 82 app.sequence_tolerance: Entities resurrected from earlier generations: 0 app.sequence_tolerance: Fitness Percentiles: 0%=-129.236 25%=-116.689 50%=-113.139 75%=-94.8569 100%=2234.36 app.sequence_tolerance: Best new entity: MultiStateEntity with traits: AA:106:L AA:107:A AA:108:Y AA:109:W AA:110:V and fitness -127.504 SingleState 1 with fitness: -127.504
The first two counts give the number of unique and duplicate sequences evaluated for the first time in the given generation. The next two counts give the number of sequences included in this generation that were present in the immediately preceding generation or in generations before. A summary of the distribution of fitnesses is also given, including the best, the first quartile, the median, the third quartile, and worst fitness. Finally, the best sequence evaluated for the first time in this generation is given.
generation 1 AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V AA:106:M AA:107:I AA:108:G AA:109:I AA:110:H ... AA:106:Y AA:107:E AA:108:H AA:109:R AA:110:S generation 5 AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V AA:106:R AA:107:G AA:108:W AA:109:T AA:110:G ... AA:106:N AA:107:D AA:108:E AA:109:N AA:110:F
Each generation starts with a "generation N" line. On subsequent lines, each sequence in that generation is space delimited and has residues formatted AA:absolute_residue_number:one_letter_amino_acid_type.
The second file, prefix.ga.entities, contains all sequences from all generations listed in output like the following:
traits AA:106:N AA:107:D AA:108:Q AA:109:H AA:110:N fitness -83.2501 states 1 fitness -83.2501 metrics 1 fitness_comp Real[ -34.1734 -36.8645 1.63354 -6.92286 ] traits AA:106:D AA:107:H AA:108:M AA:109:H AA:110:W fitness 648.546 states 1 fitness 648.546 metrics 1 fitness_comp Real[ -34.1734 -10.28 6.03952 343.48 ] ...
The data for two sequences, each using two lines, is shown above. The output format is somewhat redundant, as it is designed to hold information about multiple structural states for each sequence. (This redundancy is one of the reasons compression is recommended.) For this application, only a single state is used. The sequence is formatted the same as in the generations file. The weighted fitness ("fitness") is given twice. The unweighted components ("fitness_comp") of the fitness score are also given.
For example, here is a generations file, myprefix.ga.generations, that could be used to both score and output structures for the three sequences given:
generation 1 AA:106:F AA:107:N AA:108:E AA:109:W AA:110:I AA:106:F AA:107:E AA:108:T AA:109:W AA:110:V AA:106:F AA:107:D AA:108:T AA:109:W AA:110:V
This is the command line that would be used to run sequence_tolerance:
sequence_tolerance -database minirosetta_database -s mystructure_0001_low.pdb -resfile myresfile.resfile -ex1 -ex2 -ex1aro -ex2aro -extrachi_cutoff 0 -seq_tol:fitness_master_weights 1 1 1 2 -ms:generations 1 -ms:pop_size 3 -ms:checkpoint:prefix myprefix -ms:checkpoint:interval 200 -out:prefix myprefix -ms:numresults 3
If you are not interested in outputting structures, change -ms:numresults 3 to -ms:numresults 0.
To determine the profile of tolerated sequences for a given protein, the sequence_tolerance application is typically run multiple times, once for every member in an ensemble of backbone structures. The sequence_tolerance.R file has a convenience function for processing such output called process_specificity(). The simplest way to run it is to change to a directory containing all the entities files you wish to process, run R, and execute the following commands:
> source("path/to/analysis/apps/sequence_tolerance.R") > process_specificity()
It produces several output files:
If you want to follow the guideline determined for specificity prediction on mutants, increase the default temperature by 0.021 for every fixed mutation you make to the starting structure. For instance, with one mutation you would run this instead: