Author: Colin A. Smith

This document was last updated March 26th, 2015, by Shane Ó Conchúir. The corresponding principal investigator is Tanja Kortemme (kortemme@cgl.ucsf.edu).

# Code and Demo

The code for the sequence_tolerance application is in src/apps/public/sequence_tolerance.cc. An integration test and demo is located in rosetta/main/tests/integration/tests/sequence_tolerance. The abstract genetic algorithm classes are located in the protocols::genetic_algorithm namespace. The concrete subclasses are located in the protocols::multistate_design namespace.

# Purpose

The general uses of this application are specificity prediction and library design (i.e. for phage display, yeast display, or other selection techniques). To do so, this application determines the scores of a large number of sequences on a given input structure. The scores can be reweighted to emphasize inter or intramolecular interactions.

# Algorithm

When this application starts, it precalculates the one-body and two-body rotamer energies for all specified amino acids at positions defined in the resfile. A genetic algorithm is used to take an initial population of sequences and generate successive generations of sequences that should be lower scoring overall than the previous generations.

Every time a given sequence is scored, it uses simulated annealing to optimize the rotamers for the given sequence. The rotamer configurations are optimized using a score function which equally weights all structural interactions. After optimization of the rotamers, the sequence score can be reweighted so that individual chains, as well as interactions between chains, are emphasized or de-emphasized.

The first generation is seeded with the sequence that results from full rotamer optimization using all allowed amino acids in the provided resfile. Additional sequences can optionally be generated by mutating this sequence. The rest of the initial population consists of random sequences.

In subsequent generations, the sequence with the best fitness (see -seq_tol:fitness_master_weights) is always propagated to the next generation. The remaining sequences are generated by mutation and recombination of sequences from the previous generation. Sequences for mutation and recombination are chosen using tournament selection, in which two sequences are randomly chosen from the parent population, and the sequence with the better fitness is selected.

# Limitations

Though this application makes use of classes designed to incorporate multiple states, it does not currently support multi-state design.

# Input Files

• The starting structure must be in PDB format and can be specified using the -s option. Only a single structure can be used.

• Side chain sampling is controlled using the -resfile command line option. Please see the resfile documentation for more information about how to create a resfile. At least two amino acids should be specified as designable. Please keep in mind the combinatorial size of the sequence space (i.e. 20^2, 20^3, etc.) when setting the population size and number of generations. For designs with a small combinatorial size, direct enumeration is the best way of evaluating all sequences. This is not done automatically, but can be achieved by supplying a custom generations file (see below).

# Options

## General/Packing Options

• -database [file path] - the Rosetta 3 database location

• -in:file:s [file path] - the structure used for packing

• -packing:resfile [file path] - define which residues can be mutated/repacked

• -ex1, -ex2, -extrachi_cutoff [num residues], etc. - increase the resolution of the rotamer library

## Sequence Scoring/Rescoring Options

• -score:weights - specify custom score function weights/options for use with packing

• -seq_tol:fitness_master_weights [real vector] - fitness function master weights used for genetic algorithm optimization (defaults to unweighted)

The best way to understand how the fitness function works is to examine sample output from a two-chain protein:

app.sequence_tolerance: Residue Decomposition: [Set 1 (A): [1, 2, ...], Set 2 (B): [104, 105, ...]]
app.sequence_tolerance: Fitness Function Master Weights: [1, 1, 1, 2]
app.sequence_tolerance: Fitness of Starting Structure:
---------------------------------------------------------------------
Score Type                  Other        A        B      A-B    Total
---------------------------------------------------------------------
fa_atr                      0.000 -266.124  -10.049  -43.778 -319.952
...
hbond_sr_bb               -17.074    0.000    0.000    0.000  -17.074
hbond_lr_bb               -24.484    0.000    0.000    0.000  -24.484
...
---------------------------------------------------------------------
Total                     -41.558  -57.662    4.319  -17.429 -112.330

Based on the chain identifiers given in the PDB file, the residues of the protein are decomposed into multiple sets, for which the absolute residue numbers are listed in the first line above. The fitness function is constructed such that energies within and between those sets can be independently weighted. It is defined as a vector of real numbers which set the weights for those energy components. Each element in the vector corresponds to a column in the table above.

The first weight in the fitness function is for those elements of the energy function that are not decomposed into residue or residue-residue energies. It is called "Other" in the output above. In the standard scoring function, the only terms for which this is the case are those for short-range and long-range backbone-backbone hydrogen bonds. Except for some proline residues, these terms will not change upon side chain mutation.

The next weights in the fitness function correspond to the energies within each set of residues. They are called "A" and "B" in the output above. In a three chain protein, there would be three such weights.

The final weights in the fitness function correspond to energies between each pair of residue sets. In the output above, there is just a single such weight called "A-B". For a three chain protein with chains A, B, and C, there would be three such weights, A-B, A-C, and B-C. The expansion continues following that pattern for proteins with more chains.

In the fitness function above, the interactions between chain A and B are upweighted by a factor of 2. The best way to know which element in the fitness function weight vector corresponds to which set or set interaction is to run the sequence_tolerance application first, and then use the output table as a reference.

## Genetic Algorithm Options

• -ms:pop_size [integer] - number of sequences per generation (default 100, 2000 used in Humphris 2008)

• -ms:generations [integer] - number of generations (default 20, 5 used in Humphris 2008)

• -ms:pop_from_ss [integer] - number of sequences to include in the initial population that are created by mutation of the sequence determined by full redesign (default 0)

• -ms:mutate_rate [real] - probability of randomizing each residue when a sequence is mutated (default 0.5)

• -ms:fraction_by_recombination [real] - fraction of sequences to generate by recombination of sequences from the previous generation (default 0.5)

• -ms:num_packs [integer] - number of times to call the simulated annealer to find the best rotamer configuration for a given sequence (default 1)

## Output Options

The checkpoint files serve to both preserve simulation state and provide a record afterwards about what sequences were evaluated. To enable checkpointing, both -ms:checkpoint:prefix and -ms:checkpoint:interval must be specified.

• -ms:checkpoint:prefix [string] - prefix to add to the beginning of checkpoint file names

• -ms:checkpoint:interval [integer] - frequency in number of sequences scored with which the checkpoint files are written (default 0, 200 suggested)

• -ms:checkpoint:gz - compress the checkpoint files with gzip (recommended)

• -ms:checkpoint:rename - rename checkpoint files after the genetic algorithm completes, so that subsequent runs generate new output (not recommended)

• -seq_tol:unsat_polars - for every sequence, calculate the number of buried polar atoms with unsatisfied hydrogen bonds using BuriedUnsatisfiedPolarsCalculator and add it to the checkpoint file (makes execution slower)

• -seq_tol:surface - for every sequence, calculate the surface score using SurfaceCalculator and add it to the checkpoint file (makes execution slower)

• -ms:numresults [integer] - number of sequences to output structures for, starting with the lowest scoring sequence (default 1)

# Tips

As noted above, previous publications have used a population size of 2,000 and 5 populations. However, optimization of the genetic algorithm parameters has not been attempted with this implementation.

Higher resolution rotamer libraries are recommended. The following flags have been used for publication: -ex1 -ex2 -extrachi_cutoff 0. While the -ex1aro -ex2aro flags were used in the command line for the publication, they are redundant and not necessary.

The runs using a population size of 2,000 and 5 generations for a recent PDZ specificity prediction paper (Smith & Kortemme 2010) took an average of 33.5 minutes for a single structure. The simulations were each run on a single core of a heterogeneous cluster of 8 core Xeon workstations with E5345, E5430, and E5520 processors.

# Expected Outputs

The sequence_tolerance application produces comparatively little printed output. It does provide summary statistics about each evolved generation. For example:

app.sequence_tolerance: Generation 2:
app.sequence_tolerance: Distinct new entities: 1914
app.sequence_tolerance: Duplicate new entities: 4
app.sequence_tolerance: Entities from previous generation: 82
app.sequence_tolerance: Entities resurrected from earlier generations: 0
app.sequence_tolerance: Fitness Percentiles: 0%=-129.236 25%=-116.689 50%=-113.139 75%=-94.8569 100%=2234.36
app.sequence_tolerance: Best new entity:
MultiStateEntity with traits: AA:106:L AA:107:A AA:108:Y AA:109:W AA:110:V and fitness -127.504
SingleState 1 with fitness: -127.504

The first two counts give the number of unique and duplicate sequences evaluated for the first time in the given generation. The next two counts give the number of sequences included in this generation that were present in the immediately preceding generation or in generations before. A summary of the distribution of fitnesses is also given, including the best, the first quartile, the median, the third quartile, and worst fitness. Finally, the best sequence evaluated for the first time in this generation is given.

## Checkpoint Files

Most of the relevant output for the sequence_tolerance application is given in the checkpoint files. The first file, prefix.ga.generations, contains all sequences from every generation listed in output like the following:

generation 1
AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V
AA:106:M AA:107:I AA:108:G AA:109:I AA:110:H
...
AA:106:Y AA:107:E AA:108:H AA:109:R AA:110:S
generation 5
AA:106:F AA:107:A AA:108:T AA:109:F AA:110:V
AA:106:R AA:107:G AA:108:W AA:109:T AA:110:G
...
AA:106:N AA:107:D AA:108:E AA:109:N AA:110:F

Each generation starts with a "generation N" line. On subsequent lines, each sequence in that generation is space delimited and has residues formatted AA:absolute_residue_number:one_letter_amino_acid_type.

The second file, prefix.ga.entities, contains all sequences from all generations listed in output like the following:

traits AA:106:N AA:107:D AA:108:Q AA:109:H AA:110:N fitness -83.2501 states 1
fitness -83.2501 metrics 1 fitness_comp Real[ -34.1734 -36.8645 1.63354 -6.92286 ]
traits AA:106:D AA:107:H AA:108:M AA:109:H AA:110:W fitness 648.546 states 1
fitness 648.546 metrics 1 fitness_comp Real[ -34.1734 -10.28 6.03952 343.48 ]
...

The data for two sequences, each using two lines, is shown above. The output format is somewhat redundant, as it is designed to hold information about multiple structural states for each sequence. (This redundancy is one of the reasons compression is recommended.) For this application, only a single state is used. The sequence is formatted the same as in the generations file. The weighted fitness ("fitness") is given twice. The unweighted components ("fitness_comp") of the fitness score are also given.

## Sequence Enumeration and Structure Output

Beyond postprocessing, pregeneration of checkpoint files can also be used to accomplish several tasks, including sequence enumeration and structure output. To do either, you should make a prefix.ga.generations file, in the format shown above, containing a single generation with all sequences to be evaluated/output. Then run sequence_tolerance with -ms:generations set to 1 and -ms:pop_size set to the number of sequences in your generations file. An empty prefix.ga.entities file must exist prior to running sequence_tolerance, otherwise the input generations file will be overwritten with random sequences. After running sequence_tolerance with a compatible resfile, a prefix.ga.entities file will be created with scores for all the sequences you specified. To generate structures, simply add -ms:numresults with the number of structures you want to output, up to the number of sequences specified in the generations file.

For example, here is a generations file, myprefix.ga.generations, that could be used to both score and output structures for the three sequences given:

generation 1
AA:106:F AA:107:N AA:108:E AA:109:W AA:110:I
AA:106:F AA:107:E AA:108:T AA:109:W AA:110:V
AA:106:F AA:107:D AA:108:T AA:109:W AA:110:V

This is the command line that would be used to run sequence_tolerance:

sequence_tolerance -database /path/to/rosetta/main/database -s mystructure_0001_low.pdb
-resfile myresfile.resfile -ex1 -ex2 -ex1aro -ex2aro -extrachi_cutoff 0
-seq_tol:fitness_master_weights 1 1 1 2 -ms:generations 1 -ms:pop_size 3
-ms:checkpoint:prefix myprefix -ms:checkpoint:interval 200
-out:prefix myprefix -ms:numresults 3

If you are not interested in outputting structures, change -ms:numresults 3 to -ms:numresults 0.

# Post Processing

The checkpoint files are designed to be parsed and processed computationally. Several R functions for doing so are provided in analysis/apps/sequence_tolerance.R. It includes functions for reading *.ga.entities and *.ga.generations checkpoint files. It also has a function for writing *.ga.generations checkpoint files for sequence enumeration and structural output (see immediately above). Finally, it contains a function for reweighting a set of sequences using different fitness function weights. See the comments above each function for information about their use.

To determine the profile of tolerated sequences for a given protein, the sequence_tolerance application is typically run multiple times, once for every member in an ensemble of backbone structures. The sequence_tolerance.R file has a convenience function for processing such output called process_specificity(). The simplest way to run it is to change to a directory containing all the entities files you wish to process, run R, and execute the following commands:

> source("path/to/analysis/apps/sequence_tolerance.R")
> process_specificity()

It produces several output files:

• specificity_boxplot.pdf - one boxplot per page of frequencies for each residue
• specificity_boxplot.png - all boxplots on a single image for all residues
• specificity_boxplot_resnum.png - one boxplot per image for each residue
• specificity_pwm.txt - a position weight matrix (PWM) for all residues
• specificity_sequences.fasta - a FASTA file with 100 sequences that can be used to produce a sequence logo using WebLogo or another program

The process_specificity() function has several parameters, which default to those found to work well for wild-type PDZ-peptide specificity prediction. The fitness function weights can be changed with "fitness_coef". The Boltzmann temperature (default 0.228) or sequence cutoff can be changed with "temp_or_thresh". The type of sequence weighting can be changed by specifying the string "boltzmann" (default) or "cutoff" for the "type" parameter. Finally, the percentile cutoff for merging PWMs from different backbone structures (default 0.5) can be changed with "percentile". A more detailed explanation of the parameters and analysis performed is given in Smith and Kortemme (2010).

If you want to follow the guideline determined for specificity prediction on mutants, increase the default temperature by 0.021 for every fixed mutation you make to the starting structure. For instance, with one mutation you would run this instead:

> process_specificity(temp_or_thresh=0.249)

# Protocol capture

A revised and maintained version of the original protocol capture for sequence tolerance can be downloaded from the Macromolecular modeling and design benchmarks website. The sequence tolerance page also describes expected performance for the published benchmark sets.

# New things since last release

This is the first released version, there are no changes.