I am doing k-means cluster analysis for a set of data using SPSS. There is an option to write number of clusters to be extracted using the test. I believe there is a scientifically criterion to decide which number of clusters is right.
I worked on "Isotropic Dynamic Hierarchical Clustering." My assumption was that the clustering should determine the number of clusters and levels of the hierarchy automatically (like B-tree.)
To determine the best cluster number for k-means classification, cluster validity indices such as Silhouette index, DB index, Xie-beni index, SSW and Partition co-efficient can be used.
Each method differs with the index criterion (either minimum or maximum of the index is said to be the best cluster number).
These indices measure the compactness between each cluster by measuring interspecific /intra-specific distances between data points and provide the index range at the best cluster number to be used for classification.
It’s always good to use two to three cluster validity measures and compare the best cluster number for further analysis.
All the above mentioned measures are available as R-studio functions.
I appréciate the responses given to Ali as I also use R ! But I think they are not sound as Ali asks about a sort of cut-off criteria to choose the number of clusters in a data set according to certain parameters/variables. The different coefficient you suggest are NOT available in SPSS. Maybe Ali should begin using R ... But if he wants to continue usign SPSS, he shoud know that there is - I really well know SPSS and computed lots of cluster analysis, if I'm wrong please tell me -- no mesure/value computed by SPSS to decide on the number of clusters. There are only some rule of thumb procedure. Here is how I proceed :
1. observing the data with descriptive statistics - a scaterplot matrixs is of great help to get a hypothesis of an approximative range of minimum and maximum clusters - sometimes, when not too much parameters, a multidimensional scaling may be instructive.
2. run a hierarchical cluster analysis using a agglomerative algorithme (mostly a Ward but you have different choices of methods and the one you use depends on your research question / purpose. I ask SPSS to provide the dendrogramme on which I apply the Best-cut criteria (first longest iterration distance) with the agglomeration schedule I choose the first longest distance also.
3a. Using another agglomerative method is sometimes useful but You may know that a perfect cluster fit is difficult to reach as you use different algorithm. To examine clustering fit, I use Chi2 (is there fit ? p
Most important "internal clustering criteria" to compare clustering results and to choose the best number of clusters - are available in SPSS too. Google "Kirill's spss macros page" and download "Internal clustering criteria" collection. It is also possible to run the macros from menu dialogs (see KO_macros.spe extension).