Inspired by Aggarwal et al.`s paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space" the question arises which distance measure is best suited for high dimensional data. My targeted analysis tasks are: NN calculation, clustering, and the projection of data
I doubt that there is a "best" distance meausure neither is there a "best" method for clustering. Suppose you use partition based clustering, then k-means may do a good job (because simple) or distances similar to the Mahalanobis -distance may bee preferred. Expecting cluster with irregular shape, lets say like curved lines, tne things change a lot...As Iearnt one has to be very careful in setting up the problem and making the choice depending one what kind of patterns we want to identify..
I personally very much like the cosine of angle (uncentered Pearson correlation) for multidimensional data because it has several advantages:
-it is bound on the interval [-1,1]
-distribution of correlation between random vectors becomes narrowly focused around zero as the dimensionality grows. So the significance of small correlations increases with growing dimensionality.
-it is good at capturing the similarity of patterns of feature changes, at the same time disregarding the absolute amplitude of the compared feature vectors. This is very useful for certain kinds of data (gene expression patterns, multivariate phenotype analysis, proteomics, mineral composition data in geology)
What about Parsec Distance? :-) Well seriously there is no best distance, it depends on your data. Try first to reduce the dimensionality to avoid the "curse of high dimensionality" and after depending on the "behavior" (Gaussian, uniform,spiral..) of your remaining data you choose a kernel that should be able to discriminate this behavior.
You might want to check out our paper on re-scaling of distances to avoid problems measuring distances in high dimensions:
Schnitzer D., Flexer A., Schedl M., Widmer G.: Local and Global Scaling Reduce Hubs in Space, Journal of Machine Learning Research, 13(Oct):2871-2902, 2012.
http://jmlr.org/papers/v13/schnitzer12a.html
It demonstrates that re-scaling can take care of problems like concentration of distances and hubness.
The so called "best" measure of distance depends on the distribution of your data and also the kind of analysis you want to carry out.
In addition to the distribution of data like Marc and Ranjan pointed out, additional questions come up :
1) DISTANCE TO WHAT ?
2) What is your definition of DISTANCE ? Hamming distance ? Geometric distance ? Algorithmic distance ? ...
*** If you are asking the distance to a global center, like the origin, or the average of all of the points in the data, this is GEOMETRIC DISTANCE. Then, take the Euclidean distance. be done with it. Distance=SQRT( Sum(point(i)-GLOBPOINT)^2 ).
Of course, for each dimension, you have to do this ...
*** If you are asking about Hamming distance, this is something like, HOW MANY BIT FLIPS DOES IT TAKE TO GET FROM ONE POINT TO THE OTHER. I can see this in MEMORY ERROR CORRECTION ... and many other error correction applications ...Each bit location could be modeled as a dimension. Then, the distance is the sum of (Hamming Distance for each bit) for each dimension
*** If you are asking, Algorithmic, this gets a little trickier ... This is more like asking : HOW MANY TRANSFORMATIONS DOES IT TAKE TO GET FROM ONE DATA POINT TO THE OTHER ... Of course, these transformations are allowed ones for your dataset ... Examples are : rotations, transpositions, additions. I am wondering if this is sounding like what Nikolay pointed out ...
*** Also, you could define a distance to the CLOSEST CLUSTER CENTER, which assumes that, you have clustered the data in the n-dimensional space initially, and the cluster centers are clear ... now, it is just a matter of calculating the distance of each point to the closest cluster center. For example, you could be answering the question : DOES THIS PATIENT DATA POINT TO A HEALTHY, UNDER-RISK, OR, COMPLETELY-SICK INDIVIDUAL. And, your n-dimensions could be the patient vitals, such as, blood pressure, heart rate, cholesterol, etc ...
You will define your diagnosis in terms of the distance to the closest cluster center, such as (assuming normalized values):
if Distance(Patient, Sick Cluster center) >= 0.9, diagnose SICK (high confidence).
if Distance(Patient, Sick Cluster center) >= 0.6, diagnose SICK (low confidence).
if Distance(Patient, Sick Cluster center)
It depends on your data but I think k-means is really simple and acceptable because its based on an unsupervised learning and iterative procedures
Perhaps there is a misunderstanding. It is true that unsupervised classification is said that does not depend on examples, which give target classification established by an expert. But the expert has to say how the distinction should be made instead of which is the distinction we want to see. Supervised learning is strong when the supposed discrimination problem is complicated. Indeed, with supervised learning approaches one widely avoids to make a-priori assumptions on the discrimination function. In unsupervised approaches we have to do this, and the functions are typically rather simple. Seems like JING and JANG: Or one makes a-priori assumptions defining examples with target classification (and limited assumption on the discrimination function) or one makes a-priori assumptions on the discrimation function (distances).
Interesting question and interesting answers. One point I would like to bring up is the curse of dimensionality. As we all know, this means high dimensional data are very sparse and if the problem at hand (for which we need to compute the distance) is classification/clustering, maybe one shouldn' t think at all about distance-based approaches. The reason is that these approaches are based on the nearest neighbor concept, which suffers from the curse of dimensionality. Instead exploit other aspects of the data, such as structuredness and look for example into SVMs.
No technique is uniquely better suited to higher dimensional data, per se. The usefulness of a distance measure is intimately related to the geometry of each class, the relationship between the classes, and both the correlation between the measures and physical, functional relationships.
You may have better luck and better performance by looking at sets of measures with high correlation or strong physical relationships, then deriving stable, meaningful measures estimated from those sets. Proceeding from that basis will generally provide better success with the standard data reduction - classifier design. The results you get will be have a clearer interpretation and performance will generalize more cleanly.
In the process, it should become clear that the appropriate distance measure is intimately tied to the design of the measures applied to the classifier.
I came across this kind of question while working on a project to elucidate the nature of Time. It led me to a domain of a genotype information and the multidimensional signals distance measure, in Scale-Space.
http://milanjovovic.wordpress.com/2013/07/19/genotype-information-and-the-space-time-generation/
http://milanjovovic.wordpress.com/2013/07/19/genotype-information-and-the-space-time-generation/
Dear followers, thank you very much for your valuable and inspring answers! FYI: In the meantime I also labored on geodesic distances (and weighted shortest paths) which tend to be benefical when the structure of dense regions is unknown. To conclude the discussion so far - reduce complexity to improve the quality of your measure: (a) reduce the dimensionality (which is difficult since the intrinsic dimensionality is unknown) or (b) estimate the nummer of clusters and their shape (which is also difficult since the number of clusters is also unknown). Think the problem at heart is to solve such challenges first and then measure similarity... Again: thank you very much for your answers!
Mahalanobis distance for quantitative variables, Jaccard for qualitative variables with many zeros, correlation metrics if we focus on profiles....
First, a philosophical question: if it is high dimensional data, do you really believe that a comparison can be reduced to a one-dimensional metric?
Having accepted that a metric is necessarily an approximation, one approach is essentially an information theoretic measure. Given a data set consisting of examples drawn from some number of distinct classes, what is the probability, P, that two examples were drawn from the same class? The "metric" is -log(P). I put it in quotes for two reasons. May not always satisfy the triangle inequality, and more importantly, the measure depends on the rest of the data set, not just the two examples being compared. Finally, it may not be easily computable, although this is obviously not a formal requirement for a measure being a metric. Nevertheless, as a framework it will let you make comparisons across heterogeneous dimensions and missing data.
Weigthed euclidian distance can bring effective help. With PCA + contribution coefficient you can determine importance of each variable and then apply weigths.
Regards
Bernard, the measure depends in my view, on the research target. If for example you check quality failure data in order to avoid such failures, the measure would probably be a function of cause-of-failure DAMAGE; and if you check the same data for planning a workshop layout, the measure would be the PHYSICAL DISTANCE.
It is so because at the end one has to bring the various attributes to a common denominator with the target.
In short, the theory of DATA RELATIVITY, if any more headache is needed :)
Note: many assume that multidimensionality is a problem, but for me it is a most useful info that enables richer in depth insight to what makes data tick, and actually helps in the process of diagnostics / analytics.
Edith, Home of GT data mining
Jürgen Bernard, please take a note, if interested in my reports, if one assumes the scale-space wave information propagation with the associated uncertainty in data, than it jointly solves for the clusters distribution and its dimensionality, since computes their eigenvectors.
http://milanjovovic.wordpress.com/2012/10/12/multidimensional-scaling-dynamical-cascades/
I’d summarize that the measure is less important than the transformation we do on that data before we calculate the distance, and the set of transformations we do are domain specific. For example Cosine distance is equivalent to Euclidian distance applied to vectors normalized by its length, which used to be a typical approach for text mining. Extending that approach people from text mining end up with TF-IDF normalization + Euclidian distance (or equivalent Gaussian kenrel) running over it. Similarly a classical normalization or Z-tranform is equivalent to a special kind of weighted Euclidian distance. If we know how to weight the attributes we can run it as a preprocessing or search for appropriate weights before applying Euclidian distance with kNN (or other distance based classifier). Using Machalanobis distance is also equivalent to initial transformation by rotating and weighting initial attributes (one comment, I would not do PCA on really high dimensional data – that may kill your computer if we talk about thousands or hundreds of thousands , and the quality depends on the number of vectors) . In data mining where the data is heterogeneous – different sources, different type of attributes, it is necessary to transform subsets of attributes independently and on the top of it a classical Euclidien distance may be used. Of course in some applications it is required to switch from Euclidian to more sophisticated distance metrics, but they are domain specific.
Finally in my opinion moving to probabilistic distance metrics is like switching from one problem to another. As one mentioned we have to estimate probabilities, and how can we do that? What is the quality of the estimation in high dimensional data (sparse data), do we really benefit from that switch?
Finally, I would say that in most of the cases the problem is in transformations, and one should consider how should I prepare (transform) the data to meet the domain specific properties (Text Mining, Image Processing, Genomic, etc.)
One aspect to look at is hubness: in higher dimensional spaces data becomes sparser and some data points tend to become hubs (they appear more often in the neighbourhood of other points). This affects clustering by k-NN (see e.g. http://eprints.kobson.nb.rs/30/1/215006-2010-2487-Radovanovic.pdf , http://wwwiti.cs.uni-magdeburg.de/~stober/publ/hubness2013.pdf ). Some distance measures have been designed to avoid this effect ( http://jmlr.org/papers/volume13/schnitzer12a/schnitzer12a.pdf , http://ailab.ijs.si/nenad_tomasev/files/2011/08/cikm0065-tomasev.pdf ). Reducing dimensionality (not by clustering) can also avoid this effect.
The answer to your question cannot be given for all "high dimensional data". Why?
The main problem is that various applications use various means for 'producing' multidimensional data, e.g. from molecules, sales data, a face in an image, webpages, etc.
What that means is that, although the *structures* of the original objects are fundamentally different, such structures cannot be uniformly and adequately captured by the corresponding "high dimensional data". In other words, the methods involved in producing a "high dimensional data" are incredibly varied but all of them amount to the numerical 'dismemberment' of the corresponding structural information: e.g. the structure of the face is fundamentally different form the structure of the molecule.
So when you asked about the best distance measure, you obviously need it for the classification purposes, which are, in turn, related to the structure of the original objects, which as mentioned above are radically different.
So there is no useful general distance measure!
As a number of people may have already pointed out, there is no such thing as "the best distance". However, the degree to which a given distance it appropriate or useful in a given situation does _not_ depend on the data, nor on how the data is distributed, but rather on _your_ goals. When you define or pick a distance, you are defining what similarity means for you, in that particular application. Let us consider a set of 5 experiments and a variable g1 that has an value of 1, 2, 3, 4 and 5 in the 5 experiments, respectively. This variable can be represented as g1 = (1, 2, 3, 4, 5) Let us also consider the variables g2 = (100, 200, 300, 400, 500) and g3 = (5, 4, 3, 2, 1). The correlation distance will place g1 in the same cluster with g2 and in a different cluster from g3 because (1, 2, 3, 4, 5) and (100, 200, 300, 400, 500) have a high correlation whereas (1,2,3,4,5) and (5,4,3,2,1) are anti-correlated. However, the Euclidean distance will place g1 in the same cluster with g3 and in a different cluster from g2. Hence, if in your application you are looking for things that are similar in the sense of being in the same neighborhood of the space according to our intuition, then Euclidean would be the distance that would achieve your goal. If on the other hand, you are looking for things that go up and down together, regardless of their range, correlation would be better.
For a detailed discussion and lots of examples, see Chapter 18 in
"Statistics and Data Analysis for Microarrays using R and Bioconductor"
http://www.amazon.com/Statistics-Microarrays-Bioconductor-Mathematical-Computational/dp/1439809755/ref=sr_1_1?ie=UTF8&qid=1384802595&sr=8-1&keywords=sorin+draghici
Take a look at some "manifold learning" examples - methods for projecting high dimensional data to lower dimensions while maintaining local properties like distance to neighbors, etc.
http://www.cs.cmu.edu/~efros/courses/AP06/presentations/melchior_isomap_demo.pdf and http://www.cs.nyu.edu/~roweis/lle/swissroll.html .
After viewing the pictures of the "swiss roll", a synthetic high dimension data set, you'll agree "distance" depends upon your intended application.
As others have mentioned, life is complicated in high dimensions. Enjoy!
What seems to be a problem here (distance measurement in high dimensional spaces) is also the driving force behind the channel capacity theorem (by C. Shannon - http://www3.alcatel-lucent.com/bstj/vol27-1948/articles/bstj27-4-623.pdf). That is to say that in high dimension spaces, a pattern disturbed by (stationary) noise tends to be found on the surface of a hypersphere around this pattern, instead of in a typical cloud-like cluster. I believe that distances that take this phenomenum into account are more well suited to high dimensional data.
dear Jurgen, I don't think that there is a "best" distance in absolute, but rather a better distance according to some criteria. I worked with Mahalanobis distance and I found it very usefull, but I can't say if it was and actually is the best!
I think distance measure not apply based on high-dimensional data. Normally applied distance measure based on type of data.
For numerical data we normally applied "Ecludian " or Manhatten.
For text data cosin similarity applied
Выбор меры зависит от характера распределения и Вашей цели. Если распределение близко к равномерному и/или цель- преимущественное внимание к средним-большим долям, используйте расстояния Эвклида или Минковского. Если распределение резко неравномерное (различия на несколько порядков) и цель - учет различий и среди самых малых компонентов, – тогда «анэнтропийное» расхождение по формуле: Da=∑│ln(pi/qj)│ , где p – частоты, вероятности, доли компонентов. Если ситуация промежуточная, то расхождение «энтропийное» по Колмогорову De= ∑( pi-qj)ln(pi/qj) . Доказательства наивысшего качества расстояния для всех случаев не существует. (Т.Г. Петров, О.И. Фарафонова. Информационно-компонентный анализ. Метод RHA. СПб. 2005.)
To elaborate more, can we rephrase the question or add to the current question "what do we expect from a 'good' or 'best' distance measure?"
I have to differ here. We just published a paper on choosing norms *other* than the l^2 norm (Euclidean distance) which has been shown to be quite bad in high dimensional spaces. The paper also contains pointers to the literature on what can go wrong when measuring distances in high-dimensional spaces.
Schnitzer D., Flexer A.: Choosing the Metric in High-Dimensional Spaces Based on Hub Analysis, in Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 2014.
Abstract: "To avoid the undesired effects of distance concentration in high-dimensional spaces, previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm. Closely related to concentration is the emergence of hub and anti-hub objects. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. The contribution of this work is an empirical examination of concentration and hubness, resulting in an unsupervised approach for choosing an ℓp norm by minimizing hubs while simultaneously maximizing nearest neighbor classification."
http://www.ofai.at/cgi-bin/tr-online?number+2014-03
Feature selection is a key issue for developing distance measure for high dimensional data and perform data mining.
Momiao Xiong
I agree with Arthur Flexer that L2 is usually not the best in high dimensions; it actually often leads to the problem that all neighbours are equi-distant.
I highly recommend the ICDT paper by Beyer et al. to get the intuition:
http://dl.acm.org/citation.cfm?id=656271
I believe that the question has no good answer. There simply isn’t an answer as to which distance measure is best suited for high dimensional data because it is an ill defined question. It always depends on the choice of representation. Others have already commented that it first comes down to feature selection or feature engineering. Features are rarely an homogeneous set and often mixed mode (symbolic, discrete, continuous, etc.) How to represent objects cannot be separated from how to measure distances/similarity/dissimilarity. The two questions are not separable and must be answered simultaneously. But let us suppose for a moment that L1 (or some other metric) is found to be “best” in 80% of all cases ever studied in "NN calculation, clustering, and the projection of data" -the target applications that the question is motivated by. This L1 observation is a totally useless observation when you come to solve a specific problem where you really want the best possible solution that you can find. Will you really take the risk and not check L2, cosine similarity, or some other metric just because L1 is *usually* better? Surely you will explore quite a few options, within reason.
I have an experience dealing with high dimensional data. PCA was my first attempt to solve the problem. One advantage of PCA was its good for data which has multicollinerity among the variables included. So if this is your case, you may have chosen the right approach. But, if it is not the case, forget about PCA. If PCA does works for your problem, my next question would be: are you interested to know which variable(s) that is/are influential? Knowing that some variables are redundant and not important, you may want to include only influential variables for your next analysis, then PCA will no longer helpful. In this case, you would need to apply Feature Selection. And for such case ie high dimensional data I would suggest the Bounded Mahalanobis distance which works well in my research. Or if you like, you may use the common Mahalanobis distance which I labelled as Unbounded Mahalanobis distance. I have publish 2 related papers with regard to this issue.
Conference Paper Understanding Mahalanobis Distance Criterion for Feature Selection
Conference Paper Sensors closeness test based on an improved [0, 1] bounded M...
I agree with Shlomo...the choice depends on the problem you have. For instance, one may base the metrics on the Euclidean distance, or the Mahalanobis distance - which is derived by normalizing with respect to the inverse of the covariance matrix, Others use a similar metrics, but normalizing with respect to the inverse of the disperion matrix, which allows the separation of "within" and "between variance" (better "dispersion"), but then a probalisitc interpretation of distance is not so straight forward. The angle (between the normalized feature vectors) has a serious shortcoming : we loose the information of the absolute values. Angle is a good measure, for instance, when we compare the "shape" of two objects, and we can neglect their "size"...