How to get a euclidean distance within range 0-1?

08 August 2018 7 7K Report

I have a set a={x1,x2,x3}, b={y1,y2,y3} and c={z1,z2,z3}. X are financial variables from my dataset, Y and Z are financial variables from other dataset. Each value is in thousand dollar. I want to find which set (set b or set c) is closer to set a. So, I used the euclidean distance. But, the resulted distance is too big because the difference between value is thousand of dollar. Hence, I divided each distance with the mean of set a to make it smaller with range of 0-1:

Distance (b,a) = euclidean(b,a)/mean(a)

Distance (c,a) = euclidean(c,a)/mean(a)

I'm not sure if this is mathematically correct or not. Is there any better way?

Artur Jordão

Dear Izham Jaya

Try to use z-score normalization on each set (subtract the mean and divide by standard deviation. This process is used to normalize the features to the same space, in your context the elements in a, b and c. In other words, a = zscore(a), b=zscore(b), c = zcore(c). Finally, I think you need to use the raw distance instead of normalizing it, since distance go from 0 to +inf.

Izham Jaya

Thanks for the answer. i'd tried and noticed that if b={0,0,0} and a={389.2, 62.1, 9722}, the distance from b to a is infinity as z can't normalize set b. Somehow, the exact distance is using unnormalize data is 9729.98. It seems that normalizing set a and set b will effect the distance. Is this expected?

Filippo Maria Denaro

I would consider a normalization based on the max distance... In other words, you have a N dimensional space with some vectors having some modulus values. Compute the max modulus of the vectors and use it to normalize.

Charles Fox

I would:

1) Divide each dimension by its standard deviation

2) Calculate the average squared distance of the scaled data points

3) For each squared distance, evaluate the CDF of a Chi-Squared distribution with degrees of freedom equal to the number of dimensions and mean equal to the average squared distance

An easier alternative would be to use F=1 − exp(−x/λ) where λ is the average distance and x is the distance of the point you are evaluating. As x -> inf, this function goes to 1. For x=0, this function equals 0. It will preserve the sort-order of the points, but not the relative sizing (twice as far will not result in twice the function value).

Dividing by max distance preserves both sort-order and relative sizing. However, if a new point is larger than your initial set, you'll either get a value greater than 1 or recalculate all of the points by multiplying by the prior max and dividing by the new max.

The best choice depends on your goal. A value between 0 and 1 isn't necessary for sorting so it is unclear why values between 0 and 1 are required.

Wajih Ullah Baig

What you can do is the following - MATLAB

v = norm(vectorA)+norm(vectorB);

if v ~=0

distance = 1- norm(vectorA-vectorB)/v;

else

distance = 0

end

Or best would be to normalize your features between zero to one

Saad Mouti

I just came to see this feed and I am interested in the answer that was most satisfying.

I found that some would consider 1/(1+dist) to have a measure of similarity between 0 and 1.

Another thing I thought of is to divide by the largest distance you have within your data.

Djalma Menezes de Oliveira

For two sets points (2 vectors). Normalize each set of points, then calculate (a-b) ^ 2, get total sum of these, finally get the square root of the total sum. Sumarized: (square root (sum ((a-b)^2)) = Euclidian.Dist.

How to apply early stopping procedure in multilayer perceptron model in WEKA?

How to learn more about SPSS and its Application?

What is the difference between mathematical R^4 space and physical 4D unit space?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

How can I interpret the data without the need of solving it manually?

Why can't academics earn the money they deserve?

Conjugation of PEG-Amine to an Amino Acid Using EDC?