I did ab initio modeling with rosetta and then performed clustering. Initially I generated 500 decoys and then I did 1000 with a difference sequence. When I performed clustering, I still get 4 clusters yet I have not specified the number of clusters in my flags file.My question is whether this is a correct output?
Secondly, I don't understand the numbering of the pdb files. When I read the log file I see that it uses c.i.j but I don't know which is the correct mapping. This is because of the different numbering in the log file and especially since I am sorting by energy levels.
Also, do I take the model with the lowest energy level in the largest cluster as the most correct model?
I will appreciate some answers. Thanks.
Hi. First, 500 or 1000 decoys is too low for ab initio. You will want 20,000 + models for sampling. More for longer sequences. If you don't have access to a cluster, I would go with the Robetta server instead.
For clustering, you may want to use the calibur program: http://sourceforge.net/projects/calibur/
As for the correct model, it depends on your system and if you have any experimental data. Generally, you are correct in your thoughts. You can also take the 'center' or representative cluster member. This is output by calibur. It usually is one of the lowest energy models.
The residue numbering in the output PDB files should go 1 -> n.
I'm not entirely familiar with the Rosetta++ clustering application, but I believe the number of cluster settings is a maximum number of clusters, rather than an absolute number. That is, if all the structures are within the given cluster radius of a small number of cluster centers, the clustering algorithm in use won't break them up just to have a given number of clusters. The range of structures in your small sample size may be limited enough such that 4 clusters will cover all of them.
The c.i.j notation means that the structure is structure number "j" of cluster number "i". Under common settings these are typically sorted by energy. (So "1" is the lowest energy.) The same notation should be used in the log file and the output PDBs. What are you getting such that you're unsure of how to match things up? (Could you provide examples of the two different notations?)
The lowest energy structure in the largest cluster is typically a good place to start for the model that's most likely to be correct. However, if you have experimental evidence or chemical intuition (you look at the structure and it doesn't look right), you may want to pick another structure or another cluster.