Which clustering method is suited for symmetrical distance matrices?

You can use hierarchical clustering or nearest neighbor clustering method, both methods are implemented in InfoStat (http://www.infostat.com.ar/index.php?mod=page&id=46), Statgraphics, XLStat and R software. The most powerful software is R, and my favorite choice is hierarchical clustering with agglomerative algorithm in Infostat, is too intuitive to use.

The most common way for a cluster analysis is to compute euclidean distance and use a agglomerative algorithm, you have to choice among the linkage option and give a look your results. Please read this link http://en.wikipedia.org/wiki/Hierarchical_clustering.

If you decide to use Infostat in the english version, the software has many example data and a practical guide about how to do the Cluster Analysis. I hope the answer help you, best regards.

Luís Enrique.

Anupam Singh

I appreciate for your valuable comments and suggestions.

So there are different types of clustering, why should one prefer one over the other? In In simple words why hierarchical clustering why not spectral clustering or any other?

Anupam Singh

Dear Johannes,

Is there any specific approach to such decisions without going for hit and trial method?

Johannes Elferich

I guess it depends on your question. In my case, I was clustering results of a docking experiment. So in order to decide which clustering method to use I performed test dockings, where I knew the correct answer, and tested which algorithm would reliably put all correct answers into one cluster.

Hierarchical clustering offers itself to biologists because it produces a tree, which has a straightforward analogy to the evolutionary process that generated these structures.

Luis E. Paternina

Hi Anupam,

There are many clustering methods available for the work, as far as i have read about it, there are no an specific method that give you the best results. That is the reason because we need to compute several cluster analysis with different methods until you get the best results according to your criteria (the cluster must make sense to you). As the cluster analysis is just an exploration method and not a statistical test per se, there are no a unique way to do a best clustering.

"Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized. An overview of algorithms explained in Wikipedia can be found in the list of statistics algorithms. There is no objectively "correct" clustering algorithm, but as it was noted, "clustering is in the eye of the beholder."[4] The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of model has no chance on a data set that contains a radically different kind of model.[4] For example, k-means cannot find non-convex clusters" (http://en.wikipedia.org/wiki/Cluster_analysis)

In my case, I use cluster analysis of bioclimatic variables and I got the best results when run the analysis using hierarchical method and agglomerative algorithm because I´m looking bioclimatic similarities across a wide geographical region. In your case you need to explore your data using differents methods, what kind of data you want to analyze?.

Luís Enrique.

Anupam Singh

Johannes

I am doing protein-protein docking and to predict the interface. In my case I don't have the correct answer. In that case I think I should validate with some known interfaces. Should I follow this approach or there is another better solution for this problem?

Anupam Singh

Hi Luis

The data has RMSD of the same protein-protein complex in different orientations. The order of distance matrix can vary from 2,000 X 2,000 to 10,000 X 10,000. Is Akaike information criterion would of any help in such case for finding the best statistical model?

Somak Ray

Loss of Solvent Accessible Surface Area (SASA) upon complexation is a criterion for residues on surface. There are other distance based criteria too. Do these work for your protein complex?

Anupam Singh

Hi Somak,

Can you be more clear? If you are asking about the distance matrix which I have created, then I will say it is just a RMSD of 1 protein-protein complex with other. And I am trying to find out the best structural pose for the interaction model. I am using Zdock to predict these complexes.

http://zdock.umassmed.edu/references.html

How to perform Molecular Dynamics of peptide+small molecule systems?

What are pros and cons of using carbon nanotubes versus polysiloxanes in drug delivery?

How can I interpret spectral clustering?

How do you select best protein-protein complex?

How can you reduce noise in K-mean clustering?

How do you combine two .pdb protein structure files?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

How to calculate CCS for Sodiated adduct ions and Multiply Charged Ions?