What is the best distance measure for high dimensional data?

10 October 2013 34 9K Report

Inspired by Aggarwal et al.`s paper "On the Surprising Behavior of Distance Metrics in High Dimensional Space" the question arises which distance measure is best suited for high dimensional data. My targeted analysis tasks are: NN calculation, clustering, and the projection of data

Horst Langer

I doubt that there is a "best" distance meausure neither is there a "best" method for clustering. Suppose you use partition based clustering, then k-means may do a good job (because simple) or distances similar to the Mahalanobis -distance may bee preferred. Expecting cluster with irregular shape, lets say like curved lines, tne things change a lot...As Iearnt one has to be very careful in setting up the problem and making the choice depending one what kind of patterns we want to identify..

Nikolay Samusik

I personally very much like the cosine of angle (uncentered Pearson correlation) for multidimensional data because it has several advantages:

-it is bound on the interval [-1,1]

-distribution of correlation between random vectors becomes narrowly focused around zero as the dimensionality grows. So the significance of small correlations increases with growing dimensionality.

-it is good at capturing the similarity of patterns of feature changes, at the same time disregarding the absolute amplitude of the compared feature vectors. This is very useful for certain kinds of data (gene expression patterns, multivariate phenotype analysis, proteomics, mineral composition data in geology)

Arnaud Ogier

What about Parsec Distance? :-) Well seriously there is no best distance, it depends on your data. Try first to reduce the dimensionality to avoid the "curse of high dimensionality" and after depending on the "behavior" (Gaussian, uniform,spiral..) of your remaining data you choose a kernel that should be able to discriminate this behavior.

Arthur Flexer

You might want to check out our paper on re-scaling of distances to avoid problems measuring distances in high dimensions:

Schnitzer D., Flexer A., Schedl M., Widmer G.: Local and Global Scaling Reduce Hubs in Space, Journal of Machine Learning Research, 13(Oct):2871-2902, 2012.

http://jmlr.org/papers/v13/schnitzer12a.html

It demonstrates that re-scaling can take care of problems like concentration of distances and hubness.

Ranjan Piyush

The so called "best" measure of distance depends on the distribution of your data and also the kind of analysis you want to carry out.

Tolga Soyata

In addition to the distribution of data like Marc and Ranjan pointed out, additional questions come up :

1) DISTANCE TO WHAT ?

2) What is your definition of DISTANCE ? Hamming distance ? Geometric distance ? Algorithmic distance ? ...

*** If you are asking the distance to a global center, like the origin, or the average of all of the points in the data, this is GEOMETRIC DISTANCE. Then, take the Euclidean distance. be done with it. Distance=SQRT( Sum(point(i)-GLOBPOINT)^2 ).

Of course, for each dimension, you have to do this ...

*** If you are asking about Hamming distance, this is something like, HOW MANY BIT FLIPS DOES IT TAKE TO GET FROM ONE POINT TO THE OTHER. I can see this in MEMORY ERROR CORRECTION ... and many other error correction applications ...Each bit location could be modeled as a dimension. Then, the distance is the sum of (Hamming Distance for each bit) for each dimension

*** If you are asking, Algorithmic, this gets a little trickier ... This is more like asking : HOW MANY TRANSFORMATIONS DOES IT TAKE TO GET FROM ONE DATA POINT TO THE OTHER ... Of course, these transformations are allowed ones for your dataset ... Examples are : rotations, transpositions, additions. I am wondering if this is sounding like what Nikolay pointed out ...

*** Also, you could define a distance to the CLOSEST CLUSTER CENTER, which assumes that, you have clustered the data in the n-dimensional space initially, and the cluster centers are clear ... now, it is just a matter of calculating the distance of each point to the closest cluster center. For example, you could be answering the question : DOES THIS PATIENT DATA POINT TO A HEALTHY, UNDER-RISK, OR, COMPLETELY-SICK INDIVIDUAL. And, your n-dimensions could be the patient vitals, such as, blood pressure, heart rate, cholesterol, etc ...

You will define your diagnosis in terms of the distance to the closest cluster center, such as (assuming normalized values):

if Distance(Patient, Sick Cluster center) >= 0.9, diagnose SICK (high confidence).

if Distance(Patient, Sick Cluster center) >= 0.6, diagnose SICK (low confidence).

if Distance(Patient, Sick Cluster center)

Shekoofe Bostan

It depends on your data but I think k-means is really simple and acceptable because its based on an unsupervised learning and iterative procedures

Horst Langer

Perhaps there is a misunderstanding. It is true that unsupervised classification is said that does not depend on examples, which give target classification established by an expert. But the expert has to say how the distinction should be made instead of which is the distinction we want to see. Supervised learning is strong when the supposed discrimination problem is complicated. Indeed, with supervised learning approaches one widely avoids to make a-priori assumptions on the discrimination function. In unsupervised approaches we have to do this, and the functions are typically rather simple. Seems like JING and JANG: Or one makes a-priori assumptions defining examples with target classification (and limited assumption on the discrimination function) or one makes a-priori assumptions on the discrimation function (distances).

Theophano Mitsa

Interesting question and interesting answers. One point I would like to bring up is the curse of dimensionality. As we all know, this means high dimensional data are very sparse and if the problem at hand (for which we need to compute the distance) is classification/clustering, maybe one shouldn' t think at all about distance-based approaches. The reason is that these approaches are based on the nearest neighbor concept, which suffers from the curse of dimensionality. Instead exploit other aspects of the data, such as structuredness and look for example into SVMs.

James Walter Taylor

No technique is uniquely better suited to higher dimensional data, per se. The usefulness of a distance measure is intimately related to the geometry of each class, the relationship between the classes, and both the correlation between the measures and physical, functional relationships.

You may have better luck and better performance by looking at sets of measures with high correlation or strong physical relationships, then deriving stable, meaningful measures estimated from those sets. Proceeding from that basis will generally provide better success with the standard data reduction - classifier design. The results you get will be have a clearer interpretation and performance will generalize more cleanly.

In the process, it should become clear that the appropriate distance measure is intimately tied to the design of the measures applied to the classifier.

Milan Jovovic

I came across this kind of question while working on a project to elucidate the nature of Time. It led me to a domain of a genotype information and the multidimensional signals distance measure, in Scale-Space.

http://milanjovovic.wordpress.com/2013/07/19/genotype-information-and-the-space-time-generation/

Jürgen Bernard

Dear followers, thank you very much for your valuable and inspring answers! FYI: In the meantime I also labored on geodesic distances (and weighted shortest paths) which tend to be benefical when the structure of dense regions is unknown. To conclude the discussion so far - reduce complexity to improve the quality of your measure: (a) reduce the dimensionality (which is difficult since the intrinsic dimensionality is unknown) or (b) estimate the nummer of clusters and their shape (which is also difficult since the number of clusters is also unknown). Think the problem at heart is to solve such challenges first and then measure similarity... Again: thank you very much for your answers!

Alessandro Giuliani

Mahalanobis distance for quantitative variables, Jaccard for qualitative variables with many zeros, correlation metrics if we focus on profiles....

David J States

First, a philosophical question: if it is high dimensional data, do you really believe that a comparison can be reduced to a one-dimensional metric?

Having accepted that a metric is necessarily an approximation, one approach is essentially an information theoretic measure. Given a data set consisting of examples drawn from some number of distinct classes, what is the probability, P, that two examples were drawn from the same class? The "metric" is -log(P). I put it in quotes for two reasons. May not always satisfy the triangle inequality, and more importantly, the measure depends on the rest of the data set, not just the two examples being compared. Finally, it may not be easily computable, although this is obviously not a formal requirement for a measure being a metric. Nevertheless, as a framework it will let you make comparisons across heterogeneous dimensions and missing data.

Mbaye Babacar Gueye

Weigthed euclidian distance can bring effective help. With PCA + contribution coefficient you can determine importance of each variable and then apply weigths.

Regards

Edith Ohri

Bernard, the measure depends in my view, on the research target. If for example you check quality failure data in order to avoid such failures, the measure would probably be a function of cause-of-failure DAMAGE; and if you check the same data for planning a workshop layout, the measure would be the PHYSICAL DISTANCE.

It is so because at the end one has to bring the various attributes to a common denominator with the target.

In short, the theory of DATA RELATIVITY, if any more headache is needed :)

Note: many assume that multidimensionality is a problem, but for me it is a most useful info that enables richer in depth insight to what makes data tick, and actually helps in the process of diagnostics / analytics.

Edith, Home of GT data mining

Milan Jovovic

Jürgen Bernard, please take a note, if interested in my reports, if one assumes the scale-space wave information propagation with the associated uncertainty in data, than it jointly solves for the clusters distribution and its dimensionality, since computes their eigenvectors.

http://milanjovovic.wordpress.com/2012/10/12/multidimensional-scaling-dynamical-cascades/

Marcin Blachnik

I’d summarize that the measure is less important than the transformation we do on that data before we calculate the distance, and the set of transformations we do are domain specific. For example Cosine distance is equivalent to Euclidian distance applied to vectors normalized by its length, which used to be a typical approach for text mining. Extending that approach people from text mining end up with TF-IDF normalization + Euclidian distance (or equivalent Gaussian kenrel) running over it. Similarly a classical normalization or Z-tranform is equivalent to a special kind of weighted Euclidian distance. If we know how to weight the attributes we can run it as a preprocessing or search for appropriate weights before applying Euclidian distance with kNN (or other distance based classifier). Using Machalanobis distance is also equivalent to initial transformation by rotating and weighting initial attributes (one comment, I would not do PCA on really high dimensional data – that may kill your computer if we talk about thousands or hundreds of thousands , and the quality depends on the number of vectors) . In data mining where the data is heterogeneous – different sources, different type of attributes, it is necessary to transform subsets of attributes independently and on the top of it a classical Euclidien distance may be used. Of course in some applications it is required to switch from Euclidian to more sophisticated distance metrics, but they are domain specific.

Finally in my opinion moving to probabilistic distance metrics is like switching from one problem to another. As one mentioned we have to estimate probabilities, and how can we do that? What is the quality of the estimation in high dimensional data (sparse data), do we really benefit from that switch?

Finally, I would say that in most of the cases the problem is in transformations, and one should consider how should I prepare (transform) the data to meet the domain specific properties (Text Mining, Image Processing, Genomic, etc.)

Tillman Weyde

One aspect to look at is hubness: in higher dimensional spaces data becomes sparser and some data points tend to become hubs (they appear more often in the neighbourhood of other points). This affects clustering by k-NN (see e.g. http://eprints.kobson.nb.rs/30/1/215006-2010-2487-Radovanovic.pdf , http://wwwiti.cs.uni-magdeburg.de/~stober/publ/hubness2013.pdf ). Some distance measures have been designed to avoid this effect ( http://jmlr.org/papers/volume13/schnitzer12a/schnitzer12a.pdf , http://ailab.ijs.si/nenad_tomasev/files/2011/08/cikm0065-tomasev.pdf ). Reducing dimensionality (not by clustering) can also avoid this effect.

Lev Goldfarb

The answer to your question cannot be given for all "high dimensional data". Why?

The main problem is that various applications use various means for 'producing' multidimensional data, e.g. from molecules, sales data, a face in an image, webpages, etc.

What that means is that, although the *structures* of the original objects are fundamentally different, such structures cannot be uniformly and adequately captured by the corresponding "high dimensional data". In other words, the methods involved in producing a "high dimensional data" are incredibly varied but all of them amount to the numerical 'dismemberment' of the corresponding structural information: e.g. the structure of the face is fundamentally different form the structure of the molecule.

So when you asked about the best distance measure, you obviously need it for the classification purposes, which are, in turn, related to the structure of the original objects, which as mentioned above are radically different.

So there is no useful general distance measure!

Sorin Draghici

As a number of people may have already pointed out, there is no such thing as "the best distance". However, the degree to which a given distance it appropriate or useful in a given situation does _not_ depend on the data, nor on how the data is distributed, but rather on _your_ goals. When you define or pick a distance, you are defining what similarity means for you, in that particular application. Let us consider a set of 5 experiments and a variable g1 that has an value of 1, 2, 3, 4 and 5 in the 5 experiments, respectively. This variable can be represented as g1 = (1, 2, 3, 4, 5) Let us also consider the variables g2 = (100, 200, 300, 400, 500) and g3 = (5, 4, 3, 2, 1). The correlation distance will place g1 in the same cluster with g2 and in a different cluster from g3 because (1, 2, 3, 4, 5) and (100, 200, 300, 400, 500) have a high correlation whereas (1,2,3,4,5) and (5,4,3,2,1) are anti-correlated. However, the Euclidean distance will place g1 in the same cluster with g3 and in a different cluster from g2. Hence, if in your application you are looking for things that are similar in the sense of being in the same neighborhood of the space according to our intuition, then Euclidean would be the distance that would achieve your goal. If on the other hand, you are looking for things that go up and down together, regardless of their range, correlation would be better.

For a detailed discussion and lots of examples, see Chapter 18 in

"Statistics and Data Analysis for Microarrays using R and Bioconductor"

http://www.amazon.com/Statistics-Microarrays-Bioconductor-Mathematical-Computational/dp/1439809755/ref=sr_1_1?ie=UTF8&qid=1384802595&sr=8-1&keywords=sorin+draghici

Kevin R Keane

Take a look at some "manifold learning" examples - methods for projecting high dimensional data to lower dimensions while maintaining local properties like distance to neighbors, etc.

http://www.cs.cmu.edu/~efros/courses/AP06/presentations/melchior_isomap_demo.pdf and http://www.cs.nyu.edu/~roweis/lle/swissroll.html .

After viewing the pictures of the "swiss roll", a synthetic high dimension data set, you'll agree "distance" depends upon your intended application.

As others have mentioned, life is complicated in high dimensions. Enjoy!

Jugurta Montalvao

What seems to be a problem here (distance measurement in high dimensional spaces) is also the driving force behind the channel capacity theorem (by C. Shannon - http://www3.alcatel-lucent.com/bstj/vol27-1948/articles/bstj27-4-623.pdf). That is to say that in high dimension spaces, a pattern disturbed by (stationary) noise tends to be found on the surface of a hypersphere around this pattern, instead of in a typical cloud-like cluster. I believe that distances that take this phenomenum into account are more well suited to high dimensional data.

G. Filligoi

dear Jurgen, I don't think that there is a "best" distance in absolute, but rather a better distance according to some criteria. I worked with Mahalanobis distance and I found it very usefull, but I can't say if it was and actually is the best!

M. Ramakrishna Murty

I think distance measure not apply based on high-dimensional data. Normally applied distance measure based on type of data.

For numerical data we normally applied "Ecludian " or Manhatten.

For text data cosin similarity applied

Phridviraj M.s.B

yes , Gaussian measure is the proved distance measure

Tomas G. Petrov

Выбор меры зависит от характера распределения и Вашей цели. Если распределение близко к равномерному и/или цель- преимущественное внимание к средним-большим долям, используйте расстояния Эвклида или Минковского. Если распределение резко неравномерное (различия на несколько порядков) и цель - учет различий и среди самых малых компонентов, – тогда «анэнтропийное» расхождение по формуле: Da=∑│ln(pi/qj)│ , где p – частоты, вероятности, доли компонентов. Если ситуация промежуточная, то расхождение «энтропийное» по Колмогорову De= ∑( pi-qj)ln(pi/qj) . Доказательства наивысшего качества расстояния для всех случаев не существует. (Т.Г. Петров, О.И. Фарафонова. Информационно-компонентный анализ. Метод RHA. СПб. 2005.)

Omer Muhammet Soysal

To elaborate more, can we rephrase the question or add to the current question "what do we expect from a 'good' or 'best' distance measure?"

Arthur Flexer

I have to differ here. We just published a paper on choosing norms *other* than the l^2 norm (Euclidean distance) which has been shown to be quite bad in high dimensional spaces. The paper also contains pointers to the literature on what can go wrong when measuring distances in high-dimensional spaces.

Schnitzer D., Flexer A.: Choosing the Metric in High-Dimensional Spaces Based on Hub Analysis, in Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 2014.

Abstract: "To avoid the undesired effects of distance concentration in high-dimensional spaces, previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm. Closely related to concentration is the emergence of hub and anti-hub objects. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. The contribution of this work is an empirical examination of concentration and hubness, resulting in an unsupervised approach for choosing an ℓp norm by minimizing hubs while simultaneously maximizing nearest neighbor classification."

http://www.ofai.at/cgi-bin/tr-online?number+2014-03

Momiao Xiong

Feature selection is a key issue for developing distance measure for high dimensional data and perform data mining.

Momiao Xiong

Arjen P. de Vries

I agree with Arthur Flexer that L2 is usually not the best in high dimensions; it actually often leads to the problem that all neighbours are equi-distant.

I highly recommend the ICDT paper by Beyer et al. to get the intuition:

http://dl.acm.org/citation.cfm?id=656271

Shlomo Geva

I believe that the question has no good answer. There simply isn’t an answer as to which distance measure is best suited for high dimensional data because it is an ill defined question. It always depends on the choice of representation. Others have already commented that it first comes down to feature selection or feature engineering. Features are rarely an homogeneous set and often mixed mode (symbolic, discrete, continuous, etc.) How to represent objects cannot be separated from how to measure distances/similarity/dissimilarity. The two questions are not separable and must be answered simultaneously. But let us suppose for a moment that L1 (or some other metric) is found to be “best” in 80% of all cases ever studied in "NN calculation, clustering, and the projection of data" -the target applications that the question is motivated by. This L1 observation is a totally useless observation when you come to solve a specific problem where you really want the best possible solution that you can find. Will you really take the risk and not check L2, cosine similarity, or some other metric just because L1 is *usually* better? Surely you will explore quite a few options, within reason.

Maz Jamilah Masnan

I have an experience dealing with high dimensional data. PCA was my first attempt to solve the problem. One advantage of PCA was its good for data which has multicollinerity among the variables included. So if this is your case, you may have chosen the right approach. But, if it is not the case, forget about PCA. If PCA does works for your problem, my next question would be: are you interested to know which variable(s) that is/are influential? Knowing that some variables are redundant and not important, you may want to include only influential variables for your next analysis, then PCA will no longer helpful. In this case, you would need to apply Feature Selection. And for such case ie high dimensional data I would suggest the Bounded Mahalanobis distance which works well in my research. Or if you like, you may use the common Mahalanobis distance which I labelled as Unbounded Mahalanobis distance. I have publish 2 related papers with regard to this issue.

Conference Paper Understanding Mahalanobis Distance Criterion for Feature Selection

Conference Paper Sensors closeness test based on an improved [0, 1] bounded M...

Horst Langer

I agree with Shlomo...the choice depends on the problem you have. For instance, one may base the metrics on the Euclidean distance, or the Mahalanobis distance - which is derived by normalizing with respect to the inverse of the covariance matrix, Others use a similar metrics, but normalizing with respect to the inverse of the disperion matrix, which allows the separation of "within" and "between variance" (better "dispersion"), but then a probalisitc interpretation of distance is not so straight forward. The angle (between the normalized feature vectors) has a serious shortcoming : we loose the information of the absolute values. Angle is a good measure, for instance, when we compare the "shape" of two objects, and we can neglect their "size"...

Badges
Science topic

Similar topics
Computer Science
Data Mining

More Jürgen Bernard's questions See All

Do you know features that describe high-dimensional data sets? (Inspired by Tukeys Scagnostics for 2D)

Like describing the density, shape, etc. of point clouds in scatterplots it would also be valuable to have a feature vector that helps estimate interesting properties of high-dimensional datasets....

09 October 2013 3,685 3 View

What is the best JAVA-based graph drawing toolkit for graphs with weighted edges and a varying number of nodes during runtime?

I face the task of laying out a weighted graph that will change the number of nodes during runtime. The chosen layout should rearrange the nodes without losing the global topology in order to keep...

01 February 2013 7,572 11 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Can you connect an HPLC to a Mass Spec only at a certain time point?

Can anyone explain this method? Especially the last statement where it says only at 1.5 to 2.5mins was the MS/MS connected to the UPLC. How is that possible, is it a feature in this specific...

11 August 2024 8,141 3 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View