Dear all,
My question is the following:
I have large datset: 100,000 observations, 20 numerical and 2 categorical variables (i.e. mixed variables)
I need to cluster these observations based on the 22 variables, I have no idea how many clusters/groups a priori I should expect.
As the large dataset I use clara() function in r (based on "pam").
Because of the large number of observations, there is no way to compare distance matrixes (R does not allow such calculations, and is not a problem of RAM), therefore the common way of cluster selection using treeClust() and pamk() and comparison of "silhouette" does not work.
My main quesitons is: can I use factors like total SS, within SS, between SS to have an idea of the best performing Tree (in terms of number of clusters)? Do you have any other idea of how can I select the right number of clusters?
Best regards
Alessandro