Symmetry Docking and Clustering with Ligands

10 posts / 0 new
Symmetry Docking and Clustering with Ligands
#1

Greetings Rosetta Team,

Now that I've resolved my ligand_docking issues (big thanks to Rocco!) I would like to generate and screen symmetric protein+ligand homo-multimer structures.

First, in order to get SymDock to recognize my ligand and generate output, I had to rename one of the ligand's carbon atoms "CA".
Not a big deal, but it required changing the name for each occurrence of the ligand (input .pdb, parameter files, and rotamer libraries).

For the test case, SymDock.linuxgccrelease was run using PBS on four cores with the following flags:
-in:path pathtofile/rosetta_database/
-in:file:s pathtofile/$input.pdb -extra_res_fa pathtofile/BCA.fa.params -extra_res_cen pathtofile/BCA.cen.params -symmetry:symmetry_definition pathtofile/C2.symm -symmetry:initialize_rigid_body_dofs -symmetry:symmetric_rmsd -packing:ex1 -packing:ex1aro -packing:ex2 -packing:ex2aro -out:nstruct$NUM
-out:file:fullatom true
-out:user_tag $PBS_ARRAYID -out:prefix Proc$PBS_ARRAYID
-out:file:silent Proc$PBS_ARRAYID-silent.out -out:file:silent_struct_type binary -out:file:scorefile Proc$PBS_ARRAYID.score
-score::weights mm_std

The results were then concatenated into a mastersilent.out
When I look at these, I notice that there seems to be a random rotation applied--the same protein (Chain A) doesn't superimpose for all the output structures, instead, they are all centered around the origin.

The problem comes when I try to cluster the results...

When I run cluster.linuxgccrelease with the following flags:
-database pathtofolder/rosetta_database/
-extra_res_fa pathtofile/BCA.fa.params
-extra_res_cen pathtofile/BCA.cen.params
-in:file:silent pathtofile/mastersilent.out
-in:file:fullatom
-in:file:silent_struct_type binary
-cluster:radius 5
-cluster:sort_groups_by_energy
-cluster:input_score_filter -115.0
-symmetry:symmetric_rmsd
-score:weights mm_std

I get the following warnings:

protocols.cluster: Clustering an initial set of 20 structures
protocols.cluster: Calculating RMS matrix:
core.scoring.rms_util: WARNING: CA_rmsd out of range... range 1 120 requested but only 118 CA atoms found in fill_rmsd_coordinates
core.scoring.rms_util: WARNING: CA_rmsd calculation aborted. Setting rmsd to -1

The clustering "completes", but I have a single cluster with all 20 members
Two problems:
1) When I specified the -input_score_filter as -115, I only expected 6 poses to make the cut. Instead, I get 20 poses (and 20 output c.##.##.pdb files) , regardless if I include this flag or not.
2) All of the poses are grouped into one cluster, probably because cluster is confused by the ligand's CA atom (even though SymDock "worked" with a single CA...)

When I try to be clever and instruct cluster to ignore the ligand by adding the flag: -cluster:exclude_res 60, I get the following (slightly different) warnings:

protocols.cluster: Clustering an initial set of 20 structures
protocols.cluster: Calculating RMS matrix:
protocols.cluster: Warning!!! For symmetric clustering currently only all CA clustering is available. Calculating all CA rmsd instead...
core.scoring.rms_util: WARNING: CA_rmsd out of range... range 1 120 requested but only 118 CA atoms found in fill_rmsd_coordinates
core.scoring.rms_util: WARNING: CA_rmsd calculation aborted. Setting rmsd to -1

So here, the cluster app recognizes that I would like to ignore the ligand, but symmetric clustering won't let me, so it reverts to the all CA rmsd, and we get the same output (one nonsense cluster with all 20 poses) as the first case.

When I try cluster without the -symmetry:symmetric_rmsd flag, I get a different error:

protocols.cluster: Clustering an initial set of 20 structures
protocols.cluster: Calculating RMS matrix:
core.scoring.rms_util: WARNING: CA_rmsd out of range... range1 122 requested but only 118 CA atoms found in fill_rmsd_coordinates

In this case cluster first finds two extra residues (part of the symmetry info?), but ignores them, generating 20 clusters with 20 different centers.
If I repeat this but relax the -cluster:radius from 5 to 10, I obtain 15 different clusters, but I am concerned the RMSD calculation is not correctly pre-treating the system without the -symmetry flag present
Put another way, since the atoms of Chain A are randomly rotated (see above), the CA RMSD calculation won't make sense because the identical chains are not being superimposed first...

Finally, when I try to cluster by omitting the -symmetry:symmetric_rmsd flag and adding -cluster:exclude_res 60, I cluster.app crashes generating no output other than the following error:
ERROR: ct == final_atoms
ERROR:: Exit from: src/core/scoring/rms_util.cc line: 524

Taking a step back, what I would like to do is cluster the results from a larger run based on score, shape, and relative orientation (parallel/antiparallel/skew/etc.), so I am fine with using symmetric all CA rmsd as conventionally implemented in Rosetta 3.3
This suggests one of the two following choices:
1) Have cluster.app recognize the CA present in the ligand, so that it can be included in the symmetric all CA RMSD calculation. (preferred)
--or--
2) Completely ignore the ligand when it comes time to calculate the symmetric all CA RMSD for clustering (next best option)

Note that I don't expect the CA atom within the ligand to significantly contribute to the overall RMSD, but the ligand needs to be present when the multimers are constructed (taking into account sterics, electrostatics, etc.)
Future runs can investigate the effect of the ligand further, but I need reasonable clustered poses to use as starting points. :)

Note that when I use SymDock and clustering with symmetry without a ligand present, I am able to successfully obtain structures (only the -input_score_filter flag still doesn't seem to work...)

I have attached the input .pdb, the binary silent.out file, as well as the modified .params files (just delete the trailing .txt) for debugging/troubleshooting purposes.

I welcome any comments, advice, and assistance.

Thanks in advance!
Sándor

AttachmentSize
76.45 KB
442.04 KB
16.71 KB
Post Situation:
Wed, 2012-03-07 16:46
skovacs

Steven forwarded me your post, so let me see if I can help a bit. My recommendation would be to try the program calibur in your clustering, which hopefully will be a part of Rosetta in the not-to-distant future.
('Calibur: a tool for clustering large numbers of protein decoys http://sourceforge.net/projects/calibur/)

You would have to split the silent file and generate a path list to each decoy. Next, you would specify to calibur which chains you want to include in your clustering. It should recognize the CA of the ligand, but may have problems with the residue numbering. If it spits out errors due to the residues of the ligand, I can email them to see if they can include a way to read the ligand. They are actively developing it, and are pretty quick and willing to add new features, so let me know how it goes.

-Jared

Thu, 2012-03-08 08:27
jadolfbr

Hi Jared,

Thanks for your advice. I've successfully downloaded and test-run Calibur a few times (without errors), and I have a few questions:

The first is about trimming the silentfile I generate as a result of SymDock. When I used ligand_docking, there was a python script called prune_atdiff_top5pct.py (located in apps/public/ligand_docking/) which would retain the top 5% of the structures by score.
Is there an equivalent script available for trimming these binary silentfiles as output by SymDock by score? If not, what do you think is the easiest way to try and go about this?

Next, I am able to successfully extract .pdbs from a test silentfile using extract_pdbs into a single directory and easily generate a list by using ls on that directory and redirecting the output to a file called pdblist.txt for use in Calibur.
Calibur also seems to easily accept the four chains I'd like to use to calculate CA rmsd (two protein and two ligand in terms of the current dimer system), so it seems trivial to extend/expand this chain list when I consider additional Cn and Dn symmetries.

The problem I have is with the output Calibur generates--it seems to only list the three largest clusters by default and doesn't seem to have an option to list all the clusters it finds.
I obtain the following output when I use the command: pathtoexecutable/calibur -n -c ABCD -t r pdblist.txt
Filtering off
Signature mode on
Using chains 'A', 'B', 'C', 'D' in PDB files
Read 495 decoy names
Estimating threshold range... complete
Estimated threshold range = [0.22691, 43.38322]
Most frequent distance = 20.25648
10.00000 percent distances below 13.02289
Finding threshold using random decoys (max cluster size = 33)
Cluster size 18 => threshold=4.57220. [0.22691,21.80506]
Threshold = 4.57220. Found in 1.65000 s
Reading decoys with signature mode on and filter mode off
Read 495 decoys.
Decoys read in 1.17000 s
Initialized 495 decoys.
Use of SCUD's bound on
Realigning decoys... completed in 0.00000 s
Using MATRIX mode
Auxiliary clustering... 99.8%
AuxiliaryClustering... completed in 0.03000 s
Finding decoys' neighbors... completed 99.8%
Finding all neighbors completed in 0.09099 s
Finding and removing largest clusters... completed in 0.00000 s
Largest 2 clusters: Proc66BChlAandCsmA-CA_0003.pdb(20), Proc58BChlAandCsmA-CA_0004.pdb(11). Margin = 45.00000%
Proc66BChlAandCsmA-CA_0003.pdb 20: Proc17BChlAandCsmA-CA_0001.pdb Proc23BChlAandCsmA-CA_0005.pdb Proc24BChlAandCsmA-CA_0005.pdb Proc2BChlAandCsmA-CA_0001.pdb Proc30BChlAandCsmA-CA_0001.pdb Proc41BChlAandCsmA-CA_0004.pdb Proc44BChlAandCsmA-CA_0003.pdb Proc63BChlAandCsmA-CA_0005.pdb Proc66BChlAandCsmA-CA_0003.pdb Proc86BChlAandCsmA-CA_0001.pdb Proc93BChlAandCsmA-CA_0005.pdb Proc94BChlAandCsmA-CA_0003.pdb Proc27BChlAandCsmA-CA_0003.pdb Proc57BChlAandCsmA-CA_0001.pdb Proc66BChlAandCsmA-CA_0004.pdb Proc28BChlAandCsmA-CA_0005.pdb Proc31BChlAandCsmA-CA_0004.pdb Proc48BChlAandCsmA-CA_0003.pdb Proc6BChlAandCsmA-CA_0003.pdb Proc68BChlAandCsmA-CA_0004.pdb

Proc58BChlAandCsmA-CA_0004.pdb 11: Proc14BChlAandCsmA-CA_0002.pdb Proc20BChlAandCsmA-CA_0004.pdb Proc36BChlAandCsmA-CA_0004.pdb Proc52BChlAandCsmA-CA_0001.pdb Proc54BChlAandCsmA-CA_0002.pdb Proc55BChlAandCsmA-CA_0002.pdb Proc58BChlAandCsmA-CA_0004.pdb Proc87BChlAandCsmA-CA_0004.pdb Proc92BChlAandCsmA-CA_0005.pdb Proc73BChlAandCsmA-CA_0005.pdb Proc90BChlAandCsmA-CA_0002.pdb

Proc15BChlAandCsmA-CA_0001.pdb 10: Proc12BChlAandCsmA-CA_0002.pdb Proc25BChlAandCsmA-CA_0004.pdb Proc3BChlAandCsmA-CA_0002.pdb Proc42BChlAandCsmA-CA_0003.pdb Proc15BChlAandCsmA-CA_0001.pdb Proc57BChlAandCsmA-CA_0004.pdb Proc24BChlAandCsmA-CA_0004.pdb Proc32BChlAandCsmA-CA_0003.pdb Proc63BChlAandCsmA-CA_0004.pdb Proc81BChlAandCsmA-CA_0005.pdb

So, while it's clearly clustering the poses (and INCREDIBLY FAST at that!) the current output it generates isn't helpful just yet... I would need a way to obtain representative poses for EACH of the cluster centers so that I can (re)extract the relevant tags to (re)obtain their scores. Ideally, I would only end up actually looking at, say, the top five or ten best-scoring clusters for a given symmetry.

Also, I am a little confused by the -t flag as used in Calibur. If I would like to cluster based on relative shape (as likely determined by different values of CA RMSD), which of the following options would be best?
From calibur's "help" section:
-t (optional) specifies the threshold finding strategy.
s is one of p, f, a, R, r. (default strategy: p)
p: threshold results in only x% of "edges" between decoys.
default x=100/sqrt(sqrt(#decoys))
f: threshold = min dist + x * (most frequent dist - min dist)
default x=0.666667 (=2/3)
a: threshold = min dist + x * min(the avarage dist of decoys from a decoy)
default x=0.666667 (=2/3)
R: find threshold using ROSETTA's method (auto-detect parameters).
r: same as R, but with a sampled decoy set rather than the full set.

Your thoughts and feedback are appreciated!

Thu, 2012-03-08 16:30
skovacs

From the developers of calibur (Send me your email so I can send you the updated version as it may be a few days before it's up on sourceforge):

Dear Skovacs,

> Largest 2 clusters: Proc66BChlAandCsmA-CA_0003.pdb(20),
> Proc58BChlAandCsmA-CA_0004.pdb(11). Margin = 45.00000%
> Proc66BChlAandCsmA-CA_0003.pdb 20: ...
> Proc58BChlAandCsmA-CA_0004.pdb 11: ...
> Proc15BChlAandCsmA-CA_0001.pdb 10: ...
>
Calibur currently outputs the representative decoys of the three
largest clusters, together with the cluster sizes, and the decoys
within each of them.
That is, in the output above, the three largest clusters are of sizes 20,
11, 10 respectively, with the cluster centers (or representative decoys)
Proc66BChlAandCsmA-CA_0003.pdb,
Proc58BChlAandCsmA-CA_0004.pdb, and
Proc15BChlAandCsmA-CA_0001.pdb, respectively.

The line which begins with "Largest 2 clusters:" may be rather
confusing. It is to facilitate calibur's mechanism for re-clustering when
the top two clusters are significantly similar in size. The line has nothing
to do with the resultant clusters.

> Ideally, I would only end up actually looking at, say, the top five or
> ten best-scoring clusters for a given symmetry.
>

Currently calibur outputs the top three clusters. In the new version
attached with this email we added the ability to specify the maximum
number of clusters to output, through a "-o" commandline switch.
That is, "-o 5" will get calibur to output the 5 largest clusters.

> Also, I am a little confused by the -t flag as used in Calibur. If I
> would like to cluster based on relative shape (as likely determined
> by different values of CA RMSD), which of the following options
> would be best?
>
The clustering technique currently cannot distinguish between
cluster shapes. The "-t" flag simply selects the mechanism to use
for threshold determination (we recommend the default mechanism).

As far as we know, all current available clustering tools are centroid
based (i.e. they either use Rosetta's clustering method, or is based
on minimizing the sum of square distances to the centroids). These
methods do not explicit consider the cluster shape/distribution.
We are examining the relationship between cluster shape and the
quality of the clustering result.

Best Regards,
YK Ng and SC Li
kalngyk@gmail.com, shuaicli@gmail.com

Fri, 2012-03-09 09:20
jadolfbr

Hi Jared,

Thanks for facilitating communication between me and the developers. They sure do turn it around fast!
This -o option sounds like it effectively captures the functionality I'm looking for, but I will still need to sort the cluster center representative poses by Rosetta score.
I suppose I could always set -o to the total number of decoys I'm clustering to be sure that the ENTIRE LIST of cluster centers will be output.

A script which extracts scores from a silentfile based on a supplied list of tags (i.e. all these cluster centers) and then ranks them based on score sounds like what needs to happen next...unless you could point me to one which already exists!

Cheers,
Sándor

Fri, 2012-03-09 14:19
skovacs

I'm unaware of a existing script which does what you want, but it's easy enough to put together through bash (or you favorite unix shell script), which is probably why there isn't one. Just grep out the lines from the (binary or protein formatted) silent file which start with "SCORE:", filter based on tag (which is typically the last column), and then use the sort command to sort based on the total_score column (typically the second column - see "man sort" for details on how to specify a sorting column and use numeric sorting, offhand I think it's something like "-k 2,2 -n".)

Sun, 2012-03-11 12:17
rmoretti

This is the kind of thing that would be easy to do once it's in Rosetta. Unfortunately, I don't have a script I could point you towards...

Fri, 2012-03-09 14:37
jadolfbr

Thanks for the comments Jared and Rocco!

Just finished writing and debugging a post-processing .csh script that will extract the tags for the top 5% of symmetry docking poses by score, write out the relevant .pdbs, cluster them using Calibur, and then export the clustered results for visualization and further analysis.
Before closing this topic, can someone briefly explain the difference between score and I_sc (as well as how I_sc is calculated) and why someone would want to use one rather than another in terms of ranking plausible symmetry structures?

Thanks,
Sándor

Tue, 2012-03-20 15:49
skovacs

From what I can tell, I_sc is the interface score, and is calculated by the difference between the total score when bound, and the total score when the docked structures are separated by a large distance.

The total score is a good metric to tell if the protein is well folded and in an overall good state. The interface score would be used more to tell how well things are docked, as it's less sensitive to changes in score that have no impact on docking (e.g. if some loop far from the binding interface is mis-folded).

I'm not sure about symmetric docking, but for other docking situations, the interface energy is what you want to use when you're comparing substantially different structures (e.g. when ranking binding to a common partner), as different structures have different intrinsic total scores.

Which you use depends on what you're trying to rank by.

Wed, 2012-03-21 11:12
rmoretti

**-input_score_filter** does NOT work! The parameter is not read in by the cluster program, at least in Version 3.4. See the discussion in thread:
http://www.rosettacommons.org/content/inputscorefilter-not-taken-program...
You can fix it by yourself (read to the end of the thread).

The two missing residues complained by cluster program are the virtual residues representing the reference frames. If you use C2.symm generated by make_symmdef_file_denovo.py (as in Figure 8(B) in paper DiMaio et al. 2011 PLoS ONE), there are two virtual residues. That's why cluster complains there are two missing residues (120-118=2). You can also see them as the two X letters in the end of the SEQUENCE section in the output silent file (e.g. 1st line of the silent file). Using the **-cluster:exclude_res** to exclude these virtual residues will make cluster program stop complaining about the missing residues.

cluster program uses CA_rmsd(), which checks whether the number of residues equals to the number of CA atoms in the protein. If not it spits the WARNING. Virtual residues do not have CA atoms.

It seems CA_rmsd carries on with the calculation of RMSD using all CA atoms it gets. For different poses from the same protein, the WARNING can be safely ignored (unofficial judgement!).

Mon, 2012-07-02 08:19
attesor