I'm looking for a method for unsupervised classification of big data with an unknown number of clusters. Can you suggest a robust method? Is there any Matlab toolbox dedicated to this purpose?
Thanks Dear Majid for your response ... mixture models necessitate the specification of the number of components ... so, how can i determinate the number of components that best represent the data distribution ?? is there any matlab function or external toolbox for this purpose ??
The approach of AutoClass, which automatically finds the natural classes, is pretty cool. You might want to look at http://ti.arc.nasa.gov/tech/rse/synthesis-projects-applications/autoclass/ and the research papers based upon AutoClass.
Good Evening dear colleagues ... thank you all for your interesting responses ... a freind seggested to me the PG-means and the XPG-means methods .. do you have any idea about them ? ... is there any available implementation of them on internet ?
You can do clustering and use the Mean Split Silhouette (MSS) as a measure of cluster heterogeneity and. You can also use it to estimate the number of significant clusters by choosing the number of clusters as the one that minimize the MSS and produces the most homogeneous clusters in the data.
Regardless the generic learning method adopted for the given classification task, BigData (where by BigData I understand a scale comparable with data from social networks like facebook, twitter, linkedin, data from web blogs, comments and personal documents, data from public image repositories like instagram, flickr, picasa etc. and from movie repositories like youtube etc., data from internet searches, or from large prime numbers searches etc.) requires some specific adaptations such as negotiating a good balance between online learning, partial learning, parallel&distributed learning. The results of this negotiation should be compatible, of course, with the manner in which you choose to express and test the levels of intra-class similarity and inter-class non-similarity, which , on the other hand are very much data-specific. These are the critical aspects when designing classification algorithms for BigData. Ready to run algorithms for a specific problem - I'm not so optimistic. A nice inventory of BigData techniques is here: http://www.mapr.com/blog/big-data-zz-%E2%80%93-glossary-my-favorite-data-science-things#.UzAwkaiSwsA
If you are looking for some well-founded probabilistic math, check out non-parametric Bayesian methods. Distributions such as the Dirichlet Process or Pitman-Yor process allocate some probability to unseen classes.
If you want to know how much the number of cluster from your data, you can cluster it using some clustering algorithms. But, some clustering algorithms, like k-means, need the number of cluster (k) as a parameter and you must analysis the best "k" by calculate internal validity measurement, like SSE.
If you want to find the number of cluster by defining the large number of cluster, you can read this paper https://www.researchgate.net/publication/221908653_Learning_the_Number_of_Clusters_in_Self_Organizing_Map?ev=prf_pub
Chapter Learning the Number of Clusters in Self Organizing Map