![]() |
Rosetta Protocols
2014.35
|
Public interface for doing affinity propagation clustering. More...
#include <APCluster.hh>
Public Member Functions | |
APCluster (core::Size total_pts, core::Size max_sims_per_pt=0) | |
Create new clustering class for total_pts input data points. Optionally, set a limit on the number of similarities stored per point. More... | |
virtual | ~APCluster () |
virtual void | set_sim (core::Size i, core::Size k, core::Real sim) |
How appropriate is k as an exemplar for i? More... | |
virtual bool | cluster (core::Size maxits, core::Size convits, core::Real lambda) |
Run the actual clustering algorithm. More... | |
virtual core::Size | num_pts () const |
virtual core::Size | get_exemplar_for (core::Size i) const |
Return the index of the point that is the exemplar for point i. More... | |
virtual core::Size | get_num_exemplars () const |
The number of exemplars selected (number of clusters). Monotonically related to the self-preferences s(k,k). More... | |
virtual void | get_all_exemplars (utility::vector1< core::Size > &exemplars) const |
Return the indices of data points chosen as exemplars (cluster centers). More... | |
virtual void | get_cluster_for (core::Size k, utility::vector1< core::Size > &cluster) const |
Returns the indices of points with the specified exemplar k. Note that k is the index of an (input) data point that was chosen as an exemplar, not some "cluster index" between 1 and get_num_exemplars(). More... | |
virtual core::Real | get_net_sim () const |
The sum of similarities s(i,k) between every data point i and its exemplar k, plus the self preferences of the exemplars. The algorithm should minimize this value – if it dips and climbs again, increase lambda. More... | |
virtual bool | save_binary (std::string const &filename) const |
Saves the (sparse) similarity matrix and current cluster assignments (if any), but not the accumulated evidence from the last clustering [ r(i,k) and a(i,k) ]. File format is custom binary and is not portable (host endian-ness). More... | |
virtual bool | load_binary (std::string const &filename) |
Wipes all currently held data and reads in similarity values and cluster assignments. Afterwards, points may be re-clustered with different parameters if desired. File format is custom binary and is not portable (host endian-ness). More... | |
Protected Member Functions | |
virtual void | freeze () |
virtual void | reinitialize () |
virtual void | update_r_ik (core::Real lambda) |
virtual void | update_a_ik (core::Real lambda) |
virtual core::Size | assign_exemplars () |
virtual void | save_best_exemplars () |
virtual void | restore_best_exemplars () |
Private Attributes | |
utility::vector1< DataPoint > | pts_ |
core::Size | max_sims_per_pt_ |
bool | is_frozen_ |
Public interface for doing affinity propagation clustering.
Based on Frey and Dueck, "Clustering by Passing Messages Between Data Points", Science 315 (2007). Useful for choosing a set of representative data points (exemplars) out of a large set (e.g. all decoys from a large Rosetta run) given a measure of similarity (e.g. RMSD, maxsub, GDT, ...).
As I understand it, this procedures tries to maximize the sum of similarities between each data point and its exemplar, while balancing that with total number of clusters. Reasonable measures of similarity would be negative RMSD, log-likelihoods, or squared distance (i.e. squared error), depending on what the points represent. Note there is no requirement for symmetry: s(i,j) need not equal s(j,i). The self-similarity s(k,k) ("preference") for each point controls the likelihood it will be selected as an exemplar, and thus indirectly controls the total number of clusters. There is no way to directly specify a specific number of clusters. The authors suggest that using the median of all other similarities will give a moderate number of clusters, and using the minimum of the other similaries will give a small number of clusters.
This implementation is designed for clustering very large numbers of data points with sparse similarity [ s(i,k) = -Inf for most i,k ]. Similarities for each input point are kept in a heap so that you can limit to only the L highest for each. (This scheme is quite likely to break symmetry, as some points will have more close neighbors than others.) Alternately, you may choose to do your own pre-filtering and only enter the G globally highest similarities between any points in the data set. Run time (per cycle) is linear in the number of similarities, or O(N^2) in the limit of a dense similarity matrix.
I follow the conventions of the original paper, where "i" is the index of some generic data point, and "k" is the index of a data point being considered as an exemplar (cluster center).
protocols::cluster::APCluster::APCluster | ( | core::Size | total_pts, |
core::Size | max_sims_per_pt = 0 |
||
) |
Create new clustering class for total_pts input data points. Optionally, set a limit on the number of similarities stored per point.
References max_sims_per_pt_, and pts_.
|
virtual |
|
protectedvirtual |
References protocols::cluster::DataPoint::a_kk, protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, protocols::cluster::Exemplar::k, N, pts_, protocols::cluster::DataPoint::r_kk, and protocols::cluster::Exemplar::s_ik.
Referenced by cluster().
|
virtual |
Run the actual clustering algorithm.
maxits | maximum number of iterations, period; 100 - 4000 reasonable. |
convits | terminate after clusters don't change for this many iterations, 10 - 400 reasonable. |
lambda | damping factor, 0.50 - 0.95 is a reasonable range |
References assign_exemplars(), freeze(), get_net_sim(), get_num_exemplars(), is_frozen_, reinitialize(), restore_best_exemplars(), runtime_assert, save_best_exemplars(), protocols::TR, update_a_ik(), and update_r_ik().
|
protectedvirtual |
References protocols::cluster::DataPoint::candidate_for, protocols::cluster::DataPoint::candidates, is_frozen_, protocols::cluster::Exemplar::k, N, and pts_.
Referenced by cluster().
|
virtual |
Return the indices of data points chosen as exemplars (cluster centers).
References protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, N, and pts_.
|
virtual |
Returns the indices of points with the specified exemplar k. Note that k is the index of an (input) data point that was chosen as an exemplar, not some "cluster index" between 1 and get_num_exemplars().
References protocols::cluster::DataPoint::curr_exemplar, N, and pts_.
|
inlinevirtual |
Return the index of the point that is the exemplar for point i.
References pts_.
|
virtual |
The sum of similarities s(i,k) between every data point i and its exemplar k, plus the self preferences of the exemplars. The algorithm should minimize this value – if it dips and climbs again, increase lambda.
References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::Exemplar::k, N, pts_, protocols::cluster::Exemplar::s_ik, protocols::cluster::DataPoint::s_kk, and sum().
Referenced by cluster().
|
virtual |
The number of exemplars selected (number of clusters). Monotonically related to the self-preferences s(k,k).
References protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, N, and pts_.
Referenced by cluster().
|
virtual |
Wipes all currently held data and reads in similarity values and cluster assignments. Afterwards, points may be re-clustered with different parameters if desired. File format is custom binary and is not portable (host endian-ness).
References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, is_frozen_, protocols::cluster::Exemplar::k, max_sims_per_pt_, N, pts_, protocols::cluster::read1(), protocols::cluster::Exemplar::s_ik, and protocols::cluster::DataPoint::s_kk.
|
inlinevirtual |
References pts_.
|
protectedvirtual |
Prepares the data points for another clustering run. Does not erase similarity data, etc.
References protocols::cluster::Exemplar::a_ik, protocols::cluster::DataPoint::a_kk, protocols::cluster::DataPoint::best_exemplar, protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, N, pts_, protocols::cluster::Exemplar::r_ik, and protocols::cluster::DataPoint::r_kk.
Referenced by cluster().
|
protectedvirtual |
References protocols::cluster::DataPoint::best_exemplar, protocols::cluster::DataPoint::curr_exemplar, N, and pts_.
Referenced by cluster().
|
protectedvirtual |
References protocols::cluster::DataPoint::best_exemplar, protocols::cluster::DataPoint::curr_exemplar, N, and pts_.
Referenced by cluster().
|
virtual |
Saves the (sparse) similarity matrix and current cluster assignments (if any), but not the accumulated evidence from the last clustering [ r(i,k) and a(i,k) ]. File format is custom binary and is not portable (host endian-ness).
References protocols::cluster::DataPoint::candidates, protocols::cluster::DataPoint::curr_exemplar, protocols::cluster::DataPoint::i, protocols::cluster::Exemplar::k, max_sims_per_pt_, N, pts_, protocols::cluster::Exemplar::s_ik, protocols::cluster::DataPoint::s_kk, and protocols::cluster::write1().
|
virtual |
How appropriate is k as an exemplar for i?
Adding s(i,j) is not the same as adding s(j,i) – you must do both if you want symmetry.
There is currently no protection against adding s(i,k) twice, which will not be caught and will screw up the computation royally.
References is_frozen_, max_sims_per_pt_, and pts_.
|
protectedvirtual |
References protocols::cluster::Exemplar::a_ik, protocols::cluster::DataPoint::a_kk, protocols::cluster::DataPoint::candidate_for, protocols::cluster::DataPoint::candidates, protocols::cluster::Exemplar::k, N, pts_, protocols::cluster::Exemplar::r_ik, protocols::cluster::DataPoint::r_kk, runtime_assert, sum(), and protocols::cluster::DataPoint::sum.
Referenced by cluster().
|
protectedvirtual |
References protocols::cluster::Exemplar::a_ik, protocols::cluster::DataPoint::a_kk, protocols::cluster::DataPoint::candidates, protocols::cluster::Exemplar::k, N, pts_, protocols::cluster::Exemplar::r_ik, protocols::cluster::DataPoint::r_kk, protocols::cluster::Exemplar::s_ik, and protocols::cluster::DataPoint::s_kk.
Referenced by cluster().
|
private |
Referenced by cluster(), freeze(), load_binary(), and set_sim().
|
private |
Referenced by APCluster(), load_binary(), save_binary(), and set_sim().
|
private |
Referenced by APCluster(), assign_exemplars(), freeze(), get_all_exemplars(), get_cluster_for(), get_exemplar_for(), get_net_sim(), get_num_exemplars(), load_binary(), num_pts(), reinitialize(), restore_best_exemplars(), save_best_exemplars(), save_binary(), set_sim(), update_a_ik(), and update_r_ik().