@Afarin Adami: That's a good idea. It's basically the Hamming distance between the categorical variables. You could compute that, and also an Euclidean distance between the numerical variables, and give a weight to each, depending on the number of categorical and non-categorical variables, the range of the numerical features, etc.
In my opinion, the actual clustering algorithm that creates a partition of your dataset is a secondary choice. The success of clustering depends mainly on the choice of dissimilarity measure. Only if you formalize your prior knowledge / your expectations about what makes things different into a measure of dissimilarity, your clustering result can help you to understand your data. Clustering is a little tricky because the objective function to optimize is often unclear and might even be a different one for two different researchers, even if they use the same dataset. In case you have a dataset that has different labels or some pairwise dissimilarities defined, you can try to infer a proper dissimilarity function. Otherwise, you have to make assumptions about your data (e.g., distributional assumptions) to obtain an interpretable result.
F.Y.I. The difficult part of the task is we can't compute distances between categorical data. For example, assume the categorical data are regions such as China, United States, etc. You can't really define the distances between them.
Hao, would it be possible for you to estimate the likelihood of a given data point having a certain label for the categorical component? Say, following your example, that you could estimate the likelihood of the region being equal to China, or to the U.S. That way you could replace your categorical feature for a set of numerical features (one for each category). It would increase the dimensionality of your data, but all your features would be numerical and that may be easier to handle.
Hao, you can compute distances between categorical data if you make additional assumptions. For example, you can map (as similarly suggested by Alfonso) your categorical variable with k levels to a regular simplex in (k-1) space. For example, variable X with values {A,B,C} can be mapped to A => (0,0), B => (0,1), and C => (sqrt(3/4), .5). In the embedded space, each level of the variable is equally far apart from every other level (distance=1). Of course, this equality is an assumptions thats additionally imposed.
I guess, in a second step, it is important to think about the relative weighting of the embedded dimensions to the dimensions of your numerical variables.
Symbolic data analysis: http://www.ceremade.dauphine.fr/~touati/introdiday/Introdiday.htm.
There are many articles related SDA, i can share the links. Let me know if in case the above link helps you.
you can check this: K. Chidananda Gowda, Edwin Diday: Symbolic clustering using a new similarity measure. IEEE Transactions on Systems, Man, and Cybernetics 22(2): 368-378 (1992).
I have developped a Fuzzy Clustering Technique called CRUDAW that works on a data set having both categorical and numerical attributes. For categorical attributes, the distance is measured based on similarity instead of either zero or one.
As Andreas has said above, you can process categorical data into numerical vectors at least in two different ways
* If there's a meaningful distance between categorical variables (say, they represent size) you can assign numbers to them. For instance, "small, big and huge" could be converted to 1,10, 100.
* If there's not, and the number of variables is small, you can convert n categories to vectors of n-1 dimension converting A,B and C to 0,0, 0,1, and 0.5.sqrt(0.75), for instance, the key being that the distance between any two of them must be the same so as not to introduce artifacts into the (usually metric) training algorithms.
@Afarin Adami: That's a good idea. It's basically the Hamming distance between the categorical variables. You could compute that, and also an Euclidean distance between the numerical variables, and give a weight to each, depending on the number of categorical and non-categorical variables, the range of the numerical features, etc.
When you conduct clustering on symbolic/textual/qualitative data, there are issues at multiple levels of abstraction to be taken into account. We wrote a journal paper on these issues where the use of unsupervised learning methods is considered in general (using the self-organizing map algorithm as an example):
For categorical data clustering you could use a Fuzzy Genetic Algorithm. The paper, http://www.sciencedirect.com/science/article/pii/S1568494615000502 proposes a Non Dominated Sorting Genetic Algorithm for Categorical Data clustering. This approach outperforms the state of the art techniques in many situations. However, it depends on your data set and the problem you are trying to solve.
Article Non-dominated sorting genetic algorithm using fuzzy membersh...