I have the same dataset for k-mean clustering. I have applied same algorithm on the same dataset using two different tool (weka and rapidminer). I have got different cluster? Which one should I use? Your suggestions are welcome?
It sounds like an initialization issue. K-means generally needs some initial cluster assignment or set of cluster centers to start with. The two differing results might hence likely be two local minima of the function (minimal distances to class means) that k-means optimizes. If source of the two algorithms are available, you might want to check this more thoroughly.
You should prefer the one that suits you and your data best. Unfortunately, there is usually no a priori preference to be made.
The result of a K-means clustering run can depend on the setting of the first centroids (initialization).
Therefore, it is good habit to run k-means clustering with different initializations. Either using a pure random setting of the first centroids, or a subset of the training data.
In both cases, same SEED for the random generator MUST yield the same result.
When you start the k-means clustering with different initializations,
how many different results do you get?
and if you have enough time, it is possible to find out which of these solutions is occuring more often?
e.g. if only two different clusterings occur, and one of them occurs in 80% of the runs, and the other in only 20%, you have an idea of the size of the local minima.
Since i am not familiar with the tools ((weka and rapidminer) i do not know, if you can set the initialization explicitly.
If both tools get the same init, and the same data, the results MUST be identical.
If not at least one of the results is questionable, and you would have to compare it to an implementation where you have access to all parameter settings (including the initialization!!!)
As other people have said, the initialization is the issue. K-means++ can help, but it depends on the data.
In general, you might want to apply the following approach:
1. run the clustering algorithm multiple times and produce a sequence of possible clustering assignments
2. evaluate each final clustering configuration based on some clustering quality index. Commonly used is Silhouette index, though there are more: Rand index, Isolation index, Dunn index, Davies-Bouldin index, etc. (as well as the mean squared error and possibly entropy if the data is labeled)
3. select the clustering configuration that has the best quality, according to the index of your choice
As almost everyone pointed out, it's an initialization issue (mostly due to different initial random seeds used to select the initial centers), and since K-Means is an algorithm that attempts to minimize the sum of square distances of each point to its assigned centroid (MSSC), if that happens to be your clustering criterion, then you should choose the clustering that produced the minimal SSE score. If your k (the number of clusters in your problem) is relatively large (>=50) and your data-set is high-dimensional (>=10), then it is extremely likely that any version of K-Means that you used (even after many repetitions) is very far from the true optimal SSE score. In such a case, you should combine the results of all your clusterings (from WEKA or RAPIDMINER or whatever other tool) using the method described in the following paper: Christou, I.T. (2011) "Coordination of Cluster Ensembles via Exact Methods", IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2)279-293.
If you don't really know the value of k (i.e. how many clusters your data-set really has), then the X-Means clustering method of Pelleg is an excellent tool (there is a free open-source implementation of X-Means as well, google it).
Hello, with some datasets when i run k-means i get different labels assigned and different numbers of clusters. I cant understand why this is happening.
the algorithm which i use in order to identify the best number of clusters is the following:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import davies_bouldin_score
from sklearn.metrics import silhouette_score
import numpy as np
def kmeans(data,clusters=15):
"""seperates the data into n_clusters number of clusters, computes the
db index for each number of clusters from 2 to clusters and finally
returns the clustered data with the best db score
input:
data (dataframe)
clusters (integer)
output:
Labels and cluster centers according to the best DB index and Silhouette score"""
#initializing max_id and max_score for each metric
I used kmeans to study image compression and for same input it resulted several 'compressed' images, some of them having a size larger than the initial image. Yes, the results depends on the randomized initialization and on the input data. I found great suggestions here: https://stackoverflow.com/questions/25921762/changes-of-clustering-results-after-each-time-run-in-python-scikit-learn
"The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result."
"Typically when running algorithms with many local minima it's common to take a stochastic approach and run the algorithm many times with different initial states. This will give you multiple results, and the one with the lowest error is usually chosen to be the best result.
When I use K-Means I always run it several times and use the best result."
running K-means Clustering several times is a good first approach.
Taking the "best" result in the end is a fine strategy.
But consider to widen the approach from just comparing "best" results on the summed and squared differences between data and clusters to some quality criterion like:
Sihouette Coefficient, Gap-Statistics, Calinski-Harabasz index, Dunn index, Davies-Bouldin index, and others that have been mentioned above
and then decide to use one of those, and explicitly optimize all meta parameters (like no of patterns, initial K, ...) w.r.t this index.
For image compression i had fruitful results using the Calinski-Harabasz index.