Want to use some centroid based clustering algorithm to perform som pattern analysis. I Need to know what kind of measure I can use instead of Euclidean distance and if this will improve results in some way.
I didn't understand very well what you want exactly. Data Mining depends on three steps: Data Representation, Dissimilarity Measure and Clustering/Classification algorithm. Which one of these steps are you looking for? You mentioned classification and clustering, but then you asked about a dissimilarity measure (Euclidean).
Data Representation: What kind of data are you working with? It will restrict the following steps.
Distance Measure: Have you tried other Minkowski distances besides Euclidean? I've seen that you work with Meteorology. Then your data may involve Time Series, in this case, try Dynamic Time Warping (DTW).
Clustering/Classification: What methods have you already applied? Do you own a Training Set? Initially, I would try Kmeans (Clustering) and KNN (Classification).
Quality Measures: How are you calculating the quality of your results? Accuracy may be fallacious. Try Precision, Recall, Fmeasure and Dunn Index if you own a Ground Truth. If you don't, try SIlhouette.
Trying to interpret you question, (1) there are two questions here (a) Forming the clusters, and then defining/ their centro id., and then (2) subsequently classify an observation based on some "distance" metric from these centroid. This is done either with training set data, or none trained data.
Consider the case when you have the training set data. (The case withot the traning set is similar). That is you know thee classification of the data. Lets go to (1) First. have form your "clustered" . That is each "cluster" is a set f data point presenting a same class and each of data points are represented by a vector in Euclidean space. If dimension of the data sets relatively small compare to compared of data in cluster then you finding a closed form of distribution based on empirical distribution and interpolation would be the ideal situation. But in general this may not be feasible and the closed from may be too complicated to work with. Nonetheless it is imperative to check that whether or not the empirical density function has multiple local maximas. Roughly speaking number of maximal points indicate that the cluster (class) is best represented as number of sub clusters. At his point you can use the location of these local maximas to model the density as Gaussian mixture with local maximal locations as the mean of each of the mixture. These (mean vectors are your centroids for the cluster in question, so you may have multiple centroid indicating your classes are in fact further subdivided into sub-classes. Each of these sub-classes are identified by three parameters: mean vector , covariance matrix, and the weight of the corresponding component of mixture (sub centroid) . Now you ca classify a data point by measuring its Mahlanobis distance (It is sort of normalized Euclidean distance, normalized by Covariance matrix).
ff your data points are N dimensional vectors and number of classes that you need to classify is M where M
What problem are you working on and why is the Euclidean distance not an efficient metric? What kind of assumed data set? Let us know a litle bit about the structure of your data set (metric space). The choice of a clustering metric (distance function=utility function) depends on the underlying problem and the structure of assumed data set and these are not clear here.
Something like Distance time warping measure is exactly what I was looking for. I started with k-means using Euclidean Distance but the real problem i have involves time series so thanks for your recomendation on this regard.
Well, there some methods in clustering algorithm to measure distance calculation for numerical attributes (the centriod and the observations in the same cluster):
- Minkowski Distance;
- Manhattan Distance;
- Euclidean Distance;
- Chebyshev Distance;
- Cosine Distance; and
- Jaccard Distance.
Each of them has pros and cons. You may try one of them them based on your situation.