I want to know about various clustering methods and factors to evaluate them. Basically, I want to study various algorithms and by using existing evaluation methods I want to verify their performance on various datasets.
I am new to data mining. Basically, I want to study clustering algorithms and by using existing evaluation methods I want to verify their performance on various data sets.
I would recommend you start with the following paper
Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666.
You will get the the pdf link in google scholar. Also clutering is a well documented process , I will recommend the following book before you read the papers.
This is a fast density based clustering technique using recursive density estimation. The easiest way to describe it is that it is similar in approach to subtractive clustering, but has adaptive cluster radii and is infinitely faster. It is on a par with k-means for speed, but without the need to 'know the answer' first and provide the number of clusters.
A development of the technique above that requires no user input whatsoever!
It uses the density measure along each axis to estimate the initial radii for DDC. The adaptive radii in DDC allow for the estimate to be very approximate and still provide the same clusters.
Both above are best used with data in hyper-ellipsoid type clusters and larger datasets. I am currently completing an extension for both of these that finds aritrary shaped clusters.
I guess I will get on the ask to be included bandwagon.
I would love to see some extensions to my work on extending K-means if you find it an interesting algorithm. If you have any questions about it I would be happy to provide more information.
Article K-Models Clustering, a Generalization of K-Means Clustering
You may find the python scikit- learn library of clustering algorithms useful for experimentation. The documentation provides quite a lot of useful information on the relative strengths of the various approaches, together with details of suitable performance metrics: