Every object in my dataset is described as a vector of n=20 features. All the features are integers but they have different scales. I need to choose a measure to evaluate the similarity of two objects in the dataset. I have to satisfy the following condition: Two feature vectors which are identical (i.e., they have exactly the same numbers), must have the same similarity value. I already tried different similarity measure, like dot product and cosine similarity: 

  • Dot product does not work in my case because the similarity measure depends on the specific numbers in the feature vector. For example given these two objects a=[2, 2, 30, 4, 5], b=[2, 2, 30, 4, 5], then similarity(a, b)=949. Given these two vectors c=[2, 2, 300, 4, 5], d=[2, 2, 300, 4, 5], then similarity(c, d)=90049. I want the similarity to be the same number in both cases, i.e., similarity(a, b) = similarity(c, d);
  • Cosine similarity does not work in my case because it only takes into account the angle between the vectors. I also need to take into account magnitude. For example, given these two objects a=[2, 2, 30, 4, 5], b=[4, 4, 60, 8, 10] then similarity(a, b) = 1 (the maximum similarity). Since the numbers in the feature vectors are different, in my case their similarity should be not the maximum.
  • It is seems to me that standardizing the features and using an Euclidean distance, a Manhattan distance or in general a Minkowski distance is the most suitable solution. Can you suggest me other distance measures that are more suitable for my scenario?

    Similar questions and discussions