It seems on literature, the core structure of the top 5 clusters are used to present predictions.
But how do I find these core structures?
Oana tells me the clustering app identifies them automatically by making them the first pdb listed for each cluster.
Thanks for replying. you mean the one with new tag: c.0.0.pdb? Recently I did the following test, if I use this first structure as reference to calculate rms for all pdbs in the cluster, I got some values like 9~10A (cluster was formed with 5A cutoff). So that seems saying the first one is at lowest energy but isn't the core. Is this right? Please help me.
Multiple people have told me that it's "the first member of the cluster" that's the center - that's all I know.
From what I understand, when it says "cluster center" (e.g. in Baker lab CASP papers), it really means the lowest energy structure for that cluster, rather than the geometric (by rmsd) cluster center. That's what's being output as the c.*.0.pdb structures. And again, for the top 5 clusters, it's a ranking by energy (i.e. using the -cluster:sort_groups_by_energy flag.) The reason is that sometimes there is such good convergence on the native structure that it will be the only structure in its cluster. So the five structures selected would be output as c.0.0.pdb c.1.0.pdb c.2.0.pdb c.3.0.pdb and c.4.0.pdb from the clustering application run with the -cluster:sort_groups_by_energy flag.
I don't use the clustering application myself, but from looking at the code, I don't think there's a way to get the geometric cluster centers without re-writing the C++. The Rosetta clustering code is a little old, and is showing its age. A number of Rosetta people are moving over to using Calibur for clustering. ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881085/ http://sourceforge.net/projects/calibur/ ). Also the "top structure from each of the top 5 clusters" isn't necessarily the optimal way of picking structures, and my understanding is that the Baker lab will be using a different method for CASP10.
Should one really interpred singleton low energy decoy as "such good convergence", or merely an outlier and should not be considered seriously? If it is really good convergence, why does one do clustering?
Thanks for both reply, that really helps.
Now, I can see where I get the high rms values from. My major consideration is how to find the best representative structures. I have been using the low E structures as results. When I saw some of Baker lab's paper mentioned the cluster core I am a little confused. So the geometry core (my understanding of 'core') is not necessary to be better representation than the low E ones. Is that correct. Thanks in advance.
If you assume the scoring function reasonably reflects the free energy of the conformations, the biggest clusters are likely sampled from the nearby of the native conformation (they are more frequently sampled due to the nature of the Monte Carlo procedure).
After identifying these biggest clusters, the lowest energy decoy should be regarded as the representative, because it is the deepest in the energy funnel, assuming again the energy function reasonably reflects the free energy of the conformations.
The geometric center will only correspond to the lowest-energy decoy, if the funnel has a symmetric geometry and is well sampled. Both conditions are unlikely to be satisfied. So I doubt how often the geometric center of the cluster will correspond to the lowest-energy decoy.