When you apply a clustering method to your dataset, it allows you to separate your data in groups that maximize the similarity between data in the same group and maximise the dissimilarity between data in different groups. The number of groups is an input parameter of the problem, that is you will choose it.
K-means groups data, and returns k centroids, i.e. k vectors that represent the center points of the groups, and returns a matrix that assigns each sample in your dataset to a group.
K-means is an hard method, i.e. it assigns one class to each sample.
There is a soft implementation of k-means, named Fuzzy C-means, that allows you to assign each sample to different groups with a membership value.
Clustering methods belong to Unsupervised learning theory, because they allow to find hidden connections in the data, discovering knowledge.
For this reason, this methods are very relevant in the data mining process, where we want to "mine" knowledge from data.
Unsupervised problems are tricky, since the informations that we want to extract from data are not known a-priori, the evaluation of the results is not as simple as in the classification tasks.
I'm not expert of businness data, but suppose that your dataset represents information about customers of a shop. So k-means can group the customers according to a similarity measure.
When you apply a clustering method to your dataset, it allows you to separate your data in groups that maximize the similarity between data in the same group and maximise the dissimilarity between data in different groups. The number of groups is an input parameter of the problem, that is you will choose it.
K-means groups data, and returns k centroids, i.e. k vectors that represent the center points of the groups, and returns a matrix that assigns each sample in your dataset to a group.
K-means is an hard method, i.e. it assigns one class to each sample.
There is a soft implementation of k-means, named Fuzzy C-means, that allows you to assign each sample to different groups with a membership value.
Clustering methods belong to Unsupervised learning theory, because they allow to find hidden connections in the data, discovering knowledge.
For this reason, this methods are very relevant in the data mining process, where we want to "mine" knowledge from data.
Unsupervised problems are tricky, since the informations that we want to extract from data are not known a-priori, the evaluation of the results is not as simple as in the classification tasks.
I'm not expert of businness data, but suppose that your dataset represents information about customers of a shop. So k-means can group the customers according to a similarity measure.
K-means is a simple method that is often used to partition (cluster) the data in an unsupervised way. The centers move around over many iterations until they converge to a local optimum. Different initializations can lead to different optima (depending on the basins of attraction) - so the clustering is usually re-run several times and the best result is taken. Alternatively, more care can be taken while initializing the centroids - look for example at the K-means++
K-means can be easily kernelized and Kernel K-means allows for detecting clusters that are not (hyper)spherical in shape.
Spatial indexing can be used to speed up the search for the closest centroid, by pruning out the obviously distant ones (which is not crucial for small K-s, but almost necessary for larger ones - in my implementation it speeds up the clustering by a factor 5-10, depending on the data).
Also, the K-means search can be a bit stochastic in order to avoid converging to local optima and guide the final configuration to a global optimum. We have recently proposed one such method, the Global Hubness-Proportional K-Means (GHPKM) that performs rather well on high-dimensional data. The details can be found in the journal paper here:
Thank you Gabriella Casalino and Nenad Tomašev for your explanations.. This will do for now. I am in the process of understanding the predictive analysis in data mining that helps in making intelligent decisions for any business.
Why don't you use a better clustering algorithm. K-means is one of the simplests and also with less accuracy. I might suggest you this paper in order to have a more general idea about the diversity in clustering algorithms http://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf