Rosetta 3.2.1 Release Manual

core::fragment::picking User's Guide.


This document was last edited 2010-X-18. The original author was Dominik Gront.

Table of Contents

The Basics

This document provides a basic description of the new fragment picker utility that has been recently introduced into minirosetta.

Running the fragment picker

To pick fragments, one has to run picker application with proper flags. Typical flag file for quota protocol is given below:

# Input databases
-in::path::database             /work/dgront/CPP/database
-in::file::vall                 ../../DATA/vall.dat.2006-05-05

# Query-related input files
-in::file::checkpoint           input_files/2jsvX.chk
# PDB is necessary for crmsd score
-in::file::s                    input_files/2jsvX.pdb
-frags::ss_pred                 input_files/2jsvX.psipred.ss2 psipred input_files/2jsvX.sam.ss2 sam input_files/2jsvX.jufo.ss2 jufo

# the name root for the output fragment files
-out::file::frag_prefix         output_files/frags

# Show score components for each selected fragment
-frags::describe_fragments      output_files/frags.fsc

# Weights file
-frags::scoring::config         psi_sam_jufo_L1-Q.cfg

# we need nine-mers and three-mers. In general, any numbers should work here, e.g. 3 4 5 6 7 8 9
-frags::frag_sizes              9 3

# Select 200 fragments from 1000 candidates. We need more candidates than fragments to fill quota pools. 
-frags::n_candidates            1000
-frags::n_frags                 200

# Quota.def file defines the shares between difefrent quota pools. The total should be 1.0
-frags::picking::quota_config_file      Quota.def

# Get rid of homologues fragments; the given file should provide excluded chains as 4-character strings
-frags::denied_pdb              input_files/2jsvX.homolog_vall


Currently there are two fragment picking protocols available: quota and select best.

Select best protocol

... is very simple: it goes through all fragment possibilities (referred later as fragment candidates), scores them and keeps the best N, where N is an integer parameter provided by a user. Finally the best scoring fragments are stored in an output file.

Quota protocol

It has been designed to substitute the famous nnmake program and should provide quite comparable results. Its key feature is the use of 9 quota pools constructed from 3 secondary structure predictions calculated for a query sequence. The protocol is intended to provide fragments for ab-initio protein structure predictions. It tries to select as best fragments as possible while preserving the necessary diversity.


There are many possible input files, depending on the picking protocol and scoring function. The most commonly used are:

file type description where does it come from who uses it

vall protein structures database, your fragments come from there. should be in SVN mandatory file

.cfg defines scoring system for fragment selection edit one of the examples provided below mandatory file

.fasta amino acid sequence you must already have it... mandatory file unless .chk is given

.chk sequence profile created with PSI-Blast with further modifications (pseudocounts added) script any sequence profile - based score, e.g. ProfileScoreL1; mandatory file unless .fasta is given

.ss2 secondary structure prediction in PsiPred format The easiest way is to run script. You may also try to run a secondary prediction software on your own and then convert the resulst to the proper format. A script can turn TALOS, Juffo, Porter and SAM into ss2. SecondarySimilarity or SecondaryIdentity scores

.cst distance (or dihedral) constraints Convert your data (distances or torsion angle values) into the proper format. AtomPairConstraintsScore or DihedralConstraintsScore scores

.tab chemical shifts in TALOS format NMR experiment; examples can be downloaded from BMRB database CSScore (CS-Rosetta protocol)

.pdb reference structure in PDB format used for fragments' quality assessment


There are two kinds of output files:

fragment file

Output fragments are written in Rosetta++ format.

fragment score file

Fragment scores are stored in a flat tabulated format, one score file for each fragment size. All columns from a single line describe a single fragment and provide:

Fragment scoring scheme


Weight file for fragment picking

A weight file has at least four columns, which provide: score name, its priority, weight and the maximum allowed value. If for a certain candidate a given score returned value higher that the maximum allowed, the fragment candidate is no longer considered and any further score won't be evaluated. The scores are evaluated according to the decreasing priority rather than the order how they are listed in a weight file. To be sure that all scores are evaluated for each fragment, put '-' (dash) character as the max_allowed score value.

Weight value 0.0 has a special meaning: such scores are evaluated only for the selected fragments, at the end of a program where output files are written. This allows reduce the time spent on fragments descriptive statistics evaluation, such as crmsd or Gunn cost.

Typical weight values are given below:

For ab-initio prediction (quota protocol):

# score name          priority  wght   max_allowed  extras 
SecondarySimilarity     350     1.0     -       psipred
SecondarySimilarity     300     1.0     -       sam
SecondarySimilarity     250     1.0     -       porter
RamaScore               150     2.0     -       psipred
RamaScore               150     2.0     -       porter
RamaScore               150     2.0     -       sam
ProfileScoreL1          200     1.0     -
PhiPsiSquareWell        100     0.0     -
FragmentCrmsd           30      0.0     -

CS-Rosetta style fragment picking:

# score name          priority  wght   max_allowed  extras 
CSScore                 375     3.0     -
RamaScore               400     2.0     -       talos
SecondarySimilarity     350     3.0     -       talos
ProfileScoreL1          200     1.0     -
PhiPsiSquareWell        100     0.0     -
FragmentCrmsd           30      0.0     -
GunnCostScore           20      0.0     -

Everything that starts at the fifth column goes to a score term maker as additional parameters. The most important application is to provide secondary structure prediction name for quota protocol.

Important scoring methods for fragment assessment

The fragment picker components

In brief, the picker process vall database one chunk after another. For each chunk it takes all possible fragment candidates, scores them and stores inside collectors. When all vall chunks are processed, the collectors' content is passed to a selector which selects the final fragments. These are saved into file(s). All parts of this machinery are briefly described below.

Fragment candidate

... is a fragment-to-be, if it survive the collection and selection stages.

Fragment collector

The collector collects fragments along with their scores; all the colectors are build on utility::vector1<>. Unfortunately there are more than 2M possible fragment candidates. To keep them all one would need about ... per each residue in a query sequence. Therefore a collector may keep only a small fraction of all candidates. BoundedColelctor keeps Ncand best candidates per each position in a query sequence, where "best" is defined by a comparator object that is used to sort the container.

Fragment selector

Fragment selection rule takes all fragment candidates and selects the final Nfrags fragments.

Quota system

In general the purpose for quota is to keep the diversity within fragments. If for example a given position in a query sequence has been predicted to be helical with 70% chance and loop with 30%, "select best" protocol will pick only helical fragments for this position, because they will be favored by the SecondarySimilarity scoring term. To the contrary, quota protocol will pick 30% (best scoring) loop fragments and 70% best scoring helices. The situation is more complicated by the fact that 3 secondary structure predictors are used. This makes in total 9 different categories of fragments (referred further as quota pools) collected and scored separately. Once final fragments are selected (separately for each quota pool), they are merged into a single set.

Quota protocol uses quota specific collectors and selectors. Scoring scheme is also altered.

Quota pools

In quota protocol there are several fragment categories (pools), that are kept separated from each other. They are collected, scored and selected separately. By default there are 3 secondary structure predictions used for fragment picking: PsiPred, SAM and Porter. The fragment candidates are also split by the secondary structure class (H, E or L) which makes 9 quota pools in total. The size of each pool is controlled by quota allowance and secondary structure probability.

From the implementation's point of view, a quota pool is a BoundedCollector whose size is based on quota allowance, sorted by slightly modified quota score. Note, that quota pools, similarly to fragment collectors, are position specific, so for a 100aa query sequence there are about 900 quota pools.

Quota.def file

#pool_id        pool_name       fraction
1               psipred 0.6
2               porter          0.2
3               sam             0.2

Quota allowance

is defined for each predictor by a Quota.def file. Default allocations are: PsiPred - 0.6 SAM - 0.2 Porter - 0.2 Final allowance for a quota pool is a product of predictor share and secondary structure probability. For example, if PsiPred predicted that a certain position is helical

Quota score - pool identification

As it has been mentioned in Quota score section, some scores are switched on and off for different pools. To have it working properly, the two config files: Weight file for fragment picking and Quota.def file must contain matching string identifiers. Although the above examples use the predictors' names (psipred, porter and sam) for this purpose, one can use any arbitrary strings. The only limitation is that the three :

Quota score

The only difference between the fragment total score and fragment quota score is in the use of proper secondary-structure variant of some scores. Currently this only implies to RamaScore and SecondarySimilarity score. So for example, a quota pools created from a prediction named "psipred" use only SecondarySimilarity score named "psipred".

Generated on Sun Mar 6 22:03:06 2011 for Rosetta Projects by  doxygen 1.5.9

© Copyright Rosetta Commons Member Institutions. For more information, see