|Rosetta 3.2 Release Manual|
rosetta_source/src/apps/public/AbinitioRelax.cc. See the
test/integration/tests/abinitiodirectory for an example ab initio run and input files.
Previously, the standard structure prediction protocol was to (1) generate a large sample of "low-resolution" models using the first step (typically up to 10,000), (2) cluster the low-energy models using a score cutoff of around 10-20 percent, and then (3) select cluster centers for all-atom refinement using the Relax application. The advantage of this protocol is that it is relatively time efficient since "Abinitio" folding is faster and "Relax" is more time-consuming. However, the potential drawback is that if no near-native models are sampled after the "Abinitio" folding step, it is impossible to correct them during the "Relax" stage. With more and more computational power available, the "abrelax" protocol (see Options section below) was created to streamline this process by doing "Abinitio" folding followed directly by "Relax". Obviously, this protocol is much more time-demanding and improvements are only realized with enough conformational sampling (partially due to the fact that the full-atom energy function is very sensitive to imperfect atomic interactions and more noise will exist with insufficient sampling); convergence towards the native structure may require a significant amount of sampling. Additionally, to increase your chance of sampling the correct topology, a diverse set of homologous sequences, preferably with sequence changes that may have a greater impact on sampling like deletions and differences in conserved positions, may also be run since a homologue may converge towards the native structure with significantly less sampling (see Bradley et al reference).
./bin/AbinitioRelax.linuxgccrelease -in:file:native 1l2y.pdb Native structure (optional) (or -in:file:fasta 1l2y_.fasta) Protein sequence in fasta format (required if native structure is not provided) -in:file:frag3 aa1l2yA03_05.200_v1_3 3-residue fragments (fragments file) -in:file:frag9 aa1l2yA09_05.200_v1_3 9-residue fragments (fragments file) -database ../minirosetta_database Path to rosetta database -abinitio:relax Do a relax after abinitio ("abrelax" protocol), default=false. -nstruct 1 Number of output structures -out:file:silent 1l2y_silent.out Use silent file output, use filename after this flag, default=default.out (or -out:pdb) Use PDB file output, default=false -out:path /my/path Path where PDB output files will be written to, default '.'
There are several optional settings which have been benchmarked and tested thoroughly for optimal performance (we recommend using these options):
-use_filters true Use radius of gyration (RG), contact-order, and sheet filters. This option conserves computing by not continuing with refinement if a filter fails. A caveat is that for some sequences, a large percentage of models may fail a filter. The filters are meant to identify models with non-protein like features. -psipred_ss2 1l2y_.psipred_ss2 psipred_ss2 secondary structure definition file (required for -use_filters) -abinitio::increase_cycles 10 Increase the number of cycles at each stage in ab initio by this factor. -abinitio::rg_reweight 0.5 Reweight contribution of radius of gyration to total score by this scale factor. -abinitio::rsd_wt_helix 0.5 Reweight env,pair,cb for helix residues by this factor. -abinitio::rsd_wt_loop 0.5 Reweight env,pair,cb for loop residues by this factor. -relax::fast Do a fastrelax which is significantly faster than the traditional relax protocol without a significant performance hit. -kill_hairpins 1l2y_.psipred_ss2 Setup hairpin killing in score (kill hairpin file or psipred file). This option is useful for all-beta or alpha-beta proteins with predicted strands adjacent in sequence since hairpins are often sampled too frequently.
For running multiple jobs on a cluster the following options are useful:
-constant_seed Use a constant seed (1111111 unless specified with -jran) -jran 1234567 Specify seed. Should be unique among jobs (requires -constant_seed) -seed_offset 10 This value will be added to the random number seed. Useful when using time as seed and submitting many jobs to a cluster. If jobs are started in the same second they will still have different initial seeds when using a unique offset. If using Condor (http://www.cs.wisc.edu/condor), the Condor process id, $(Process), can be used for this. For example "-seed_offset $(Process)" can be used in the condor submit file.
The standard command line for optimal performance is shown below (nstruct should be set depending on how many models you want to generate):
./bin/AbinitioRelax.linuxgccrelease \ -database ../rosetta_database \ -in:file:fasta 1l2y_.fasta \ -in:file:native 1l2y.pdb \ -in:file:frag3 aa1l2yA03_05.200_v1_3 \ -in:file:frag9 aa1l2yA09_05.200_v1_3 \ -abinitio:relax \ -relax:fast \ -abinitio::increase_cycles 10 \ -abinitio::rg_reweight 0.5 \ -abinitio::rsd_wt_helix 0.5 \ -abinitio::rsd_wt_loop 0.5 \ -use_filters true \ -psipred_ss2 1l2y_.psipred_ss2 \ -kill_hairpins 1l2y_.psipred_ss2 \ -out:file:silent 1l2y_silent.out \ -nstruct 10
./bin/score.linuxgccrelease \ -database ../rosetta_database \ -in:file:silent 1l2y_silent.out \ -in:file:fullatom \ -output \ -rescore:output_only
./bin/cluster.linuxgccrelease \ -database ../rosetta_database \ -in:file:silent 1l2y_silent.out \ -in:file:fullatom \ -cluster:radius -1
PDB files of the cluster members are extracted from the silent output file by the cluster application.
Additional cluster options include (see cluster.linuxgccrelease for more information):
-cluster:radius <float> Cluster radius in A (for RMS clustering) or in inverse GDT_TS for GDT clustering. Use "-1" to trigger automatic radius detection -cluster:gdtmm Cluster by gdtmm instead of rms -cluster:input_score_filter <float> Ignore structures above certain energy -cluster:exclude_res <int> [<int> <int> ..] Exclude residue numbers from structural comparisons -cluster:radius <float> Cluster radius -cluster:limit_cluster_size <int> Maximal cluster size -cluster:limit_clusters <int> Maximal number of clusters -cluster:limit_total_structures <int> Maximal number of structures in total -cluster:sort_groups_by_energy Sort clusters by energy.
In an ideal case, your sequence will have many homologs identified by search tools like PSI-BLAST. Sequence alignments can be extremely helpful in model selection. For example, conserved hydrophobic positions most likely represent the core of the protein so models that have sidechains exposed in such positions may be discarded. The same logic applies to conserved polar positions which are most likely on the surface. Additionally, conserved cysteine pairs may represent disulphides. Tools like Jalview to view alignments and PyMOL to view models are extremely helpful for model selection in this respect.
Score versus RMSD plots may be helpful for identifying convergence towards the native structure for the target sequence and homologs. For example, the lowest scoring model can be used for the
in:file:native input option when rescoring models with the score.linuxgccrelease score application. A score versus RMSD plot from the resulting score file may show convergence (an energy funnel) towards the lowest scoring model. If an energy funnel exists, the lowest scoring model has a greater chance of being near-native.
Lowest scoring models that are in a cluster and that have a topology represented in the PDB also have a greater chance of being correct. Structure-structure comparison tools like Dali or Mammoth can be used to search against the PDB database.