Most measures that I have seen such as cosine similarity measure similarity between two attributes. I'm not familiar with any measures than can be used (or extended) to compared the values for three of more attributes. Please comment.
Ron, what do you mean by attributes? Cosine similarity is used for vectors, and in vector every coordinate can be regarded as attribute. But if we speak about distance measurement, we usually understand distance between two objects. What would be a distance between three or more objects?
Thanks for asking. I use attributes in a very general sense here with full understanding that cosine similarity refers to vectors. Also, it doesn't have to be distance-based, I just don't know what other measures are available for use. I'll use an example to explain my predicament. I have different people 1 to n. Each person has a group of attributes I'm examining e.g. age, height, sex, weight, political affiliation and favorite food. I collect these attributes for each person and now will like to group them in non bias categories based on these attributes/characteristics. It goes without saying that as characteristics of the population changes the groupings may very well change. I'm hoping that (1) there is some similarity measure that can be used for comparing the similarity between 3 or more persons each with information collected for m attributes and (2) some data mining approach that can be used for grouping data which consist of a mix of numerical and categorical. I know this is not in any way a unique problem but I haven't been successful in finding relevant literature. I hope this better explains my issue. Any ideas would be greatly appreciated.
If you want to form different groups of persons then probably you want to find clusters in your data. If so, you could use your distance measure, but also use something on top of that, for example, k-means. The idea is that you try to minimize the sum of the distances from persons to the middles of their clusters, more from wikipedia http://en.wikipedia.org/wiki/K-means_algorithm. This is just an example, there are many more clustering techniques.
Is this that you re looking for?
Clustering could be computationally demanding because you may need to compare all pairs of objects. In some cases it is possible to avoid this issues by, for example, locality sensitive hashing. If you have this issue take a closer look here http://en.wikipedia.org/wiki/Locality-sensitive_hashing
Thanks Peter and Artem. I've thought about using clustering using the k-means as a grouping option as I'm familiar with this method. I was just curious about other methods and definitely I'll look more into LSH. The angle between 3 vectors seems interesting. I'm very curious in methods such as these. The data that I have collected are on human characteristics which are both numerical and categorical in nature. The 2-step k-means should work since it treats categorical data as a multinomial distribution. Are you guys familiar with other similar methods such as these taking similar mixed types of data. I know other clustering algorithms exist but the one that I've looked at other that 2-step k-means looks at distances measures between two categorical variables which doesn't always make sense. For example, binary data.
Rather than reply at length here, may I refer you, and Ron, to the excellent introduction by Jeffrey Johnson at http://hypernetworks.eu/the-book.html.
My paper on Landscape is an extension of Atkin's Q-analysis intended for application to strategic facilitation in enterprises as practised by Boxer (see www.asymmetricdesign.com)
A simplicial complex is a binary relation presented as a boolean matrix. Each element of the relation’s domain labels a row of the matrix, called a simplex, and each element of the range, called a vertex, labels a column, so that each simplex is a boolean vector , the value of each element of which is True if that simplex-vertex pair is in the relation and False otherwise.
A simplex may be considered to be a q-dimensional surface in an n-dimensional space, where n is the number of vertices and q is one less than the number of vertices in the simplex.
Q-analysis computes values for certain topological properties of this space. For example:
• Two simplexes share a common face if at least one vertex is in both.
• The dimension of this common face is one less than the number of shared vertices.
• A simplex is q-connected to all those simplexes, other than itself, with which it shares a common face of dimension greater than or equal to q.
• A q-component of a simplicial complex is a maximal set of its simplices that are q-connected to each other (transitively).
• The structure vector of a simplicial complex gives the number of q-components at each dimension, q.
The simplicial complex shown on page 2 of the Landscape paper (fig 1) represents a commercial situation in which several aerospace companies manufacture various components and systems and the questions of interest concern their capabilities and competition.
Q-analysis, together with an ordering algorithm, generates a histogram (fig 2) that addresses the competition question and the landscape in fig 4 provides a more detailed view of this space.
Applying the same tools to negated and transposed versions of the original simplicial complex yields the landscapes in figs 5, 6 and 7 which provide insights into other questions of interest.
This technology has been used to facilitate strategic analysis in a wide range of enterprises, ranging from healthcare to warfare.