Which is the best method to handle the high dimensional data during the clustering process?

Hi,

I would like to try to clarify one point which I often see in such questions that appears to not be 100% clear: PCA is _not_ a clustering method and it is NOT necessarily a method primarily developed for dimension reduction. For a nice read on what PCA is, I would like to refer you to a post by Lior Pachter http://liorpachter.wordpress.com/2014/05/26/what-is-principal-component-analysis/. PCA is often used to reduce dimensionality but it might not be the best method as it is a linear and parametric method. Recent development in the field of _non_linear and _non_parametric methods should not be ignored as it can exhibit strongly improved results. Here, in the scope of clustering analysis, I would like to highlight t-SNE (http://homepage.tudelft.nl/19j49/t-SNE.html). Just to make this explicit, t-SNE is _not_ a clustering technique. It can be used to embed high-dimensional data into low dimensions, e.g., 2D for human-intuitive visualization. Based on these visualizations it is much easier to assess if there is any data-inherent structure that a clustering algorithm can actually exploit. Also, it might be used in order to find an estimate on the number of expected clusters as this is a challenge on its own if this information is not known in advance.

Also, dimension reduction is often only used because the curse of dimensionality makes relative distances meaningless for high dimensions (http://en.wikipedia.org/wiki/Curse_of_dimensionality#Distance_functions). Accordingly, any distance-based clustering technique might fail/produce largely suboptimal results if the dimensionality is "too high". However, there are some clustering techniques that are supposedly invariant to the dimensionality of the data and thus do not require a prior dimension reduction (s.a. http://en.wikipedia.org/wiki/Clustering_high-dimensional_data).

Regarding SOMs, it shouid be noted that they are mainly used/developed for the discovery of low-dimensional manifolds in high-dimensional data and it is also _not_ a clustering method. However, your data may not necessarily lie on a low-dimension manifold embedded in a high-dimensional space. Accordingly, using SOMs for dimension reduction may also not be the best method.

Accordingly, I would suggest to give t-SNE a try and compare it to the results obtained by using PCA or SOM for the dimension reduction.

Furthermore, something you have not specified is what the dimensionality of your data is and how exactly you plan on doing the clustering, since, as you may already now, the decision of which model to use for the clustering is heavily affecting your clustering results (e.g., kMeans vs. Gaussian Mixtures).

Hope this helps.

Best,

Cedric

Ângelo Cardoso

PCA would be a much more reasonable choice than SOM for most problems, due to high computational cost of training the SOM. Random projections (RP) are another alternative which is related to using PCA. RP instead of finding the eigenvectors as in PCA, uses randomly generated vectors. Since the vectors are random you save some time in finding them which might be or not particularly important for you depending on your data. While the vectors are not as "accurate" as in PCA for clustering and other tasks they are good enough.

A few papers which address RP for clustering if you want to have a look:

Fern, Xiaoli Zhang, and Carla E. Brodley. "Random projection for high dimensional data clustering: A cluster ensemble approach." ICML. Vol. 3. 2003.

Boutsidis, Christos, Anastasios Zouzias, and Petros Drineas. "Random Projections for $ k $-means Clustering." Advances in Neural Information Processing Systems. 2010.

Cardoso, Ângelo, and Andreas Wichert. "Iterative random projections for high-dimensional data clustering." Pattern Recognition Letters 33.13 (2012): 1749-1755. (MATLAB implementation: http://web.ist.utl.pt/~angelo.cardoso/irpkmeans.zip)

Hope this helps.

Best,

Ângelo Cardoso

Extraction of phaspholipid from soy lecithin?

Hlw, I am trying to build a model a single pile in 3 layers soil in Flac-3D. Could you please help me about the Flac-3D code?

How to cite a paper?

How to extract phaspholipid from soy lecithin and what maximum concentration can get from soya lecithin?

Could someone kindly assist me in finding the equation that is used to determine the capacitive contribution in case of battery type electrode?

Is IJSREM UGC Care or cloned journal?

I have written DFLUX code in the ABAQUS it is running without any effect?

Kindly suggest the suitable analysis method?

How to I upload multiple PDBQT files on PYRX to analyze results ?

Is there a way to model JET A1 (aviation fuel) for molecular dynamics simulation?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why does my protein refolded to beta sheet during thermal denaturation analysis?