What is a good clustering algorithm on hybrid dataset composed of both numerical and categorical data?

More Hao Wang's questions See All

What are the hottest research topics in deep learning ?

04 May 2019 5,172 1 View

What are the hottest research topics in computer graphics ?

What is the list of the most heavily researched problems in computer graphics ?

04 May 2019 5,903 3 View

How do you evaluate graph visualization results ?

04 May 2019 2,882 3 View

Is it possible to create an IoT-based big data solution ?

Is it possible to create a distributed framework similar to Hadoop on sensor networks ? For example, computation clients are pushed to each sensor.

02 March 2019 4,607 3 View

Which school has the best computer graphics program ?

I keep being asked which university has the best computer graphics program in America ? I don't know. U has a quite good program, but I heard UNC ranks #1 from a U student back in my school...

02 March 2019 5,116 1 View

What should you do if a conference doesn't send you reviews?

I'm experiencing this problem twice with conferences organized by the same university - I never received the reviews when they rejected my papers. In my communications with the conference...

02 March 2019 8,726 4 View

How to check algorithmically whether a series of paragraphs form a coherent article?

Suppose I have a series of paragraphs, how do I know when put together, these paragraphs form a coherent article?

01 February 2013 3,341 2 View

Which dimensionality reduction approaches use only distance matrices?

I know Multidimensional Scaling uses only a distance matrix, but Self-Organizing Map requires coordinates of points in the original space. What are some other dimensionality reduction techniques,...

31 December 2012 5,085 2 View

Where can I find Self-Organizing Maps implementation that supports sparse matrices?

I've noticed in R we have SOM in package class and SOM in package kohonen, but these packages do not support sparse matrices.

31 December 2012 6,646 0 View

How to predict click-through rates of image ads ?

I've read a couple of papers from the WWW conference on how to predict click-through rates of text ads. Does anyone know any good research on image ads?

10 November 2012 4,339 0 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

Alfonso Alba Popular answer

@Afarin Adami: That's a good idea. It's basically the Hamming distance between the categorical variables. You could compute that, and also an Euclidean distance between the numerical variables, and give a weight to each, depending on the number of categorical and non-categorical variables, the range of the numerical features, etc.

J. Van 't Klooster

Depends on the goal, but k-medoids may work for you

Andreas M Brandmaier

In my opinion, the actual clustering algorithm that creates a partition of your dataset is a secondary choice. The success of clustering depends mainly on the choice of dissimilarity measure. Only if you formalize your prior knowledge / your expectations about what makes things different into a measure of dissimilarity, your clustering result can help you to understand your data. Clustering is a little tricky because the objective function to optimize is often unclear and might even be a different one for two different researchers, even if they use the same dataset. In case you have a dataset that has different labels or some pairwise dissimilarities defined, you can try to infer a proper dissimilarity function. Otherwise, you have to make assumptions about your data (e.g., distributional assumptions) to obtain an interpretable result.

Hao Wang

F.Y.I. The difficult part of the task is we can't compute distances between categorical data. For example, assume the categorical data are regions such as China, United States, etc. You can't really define the distances between them.

Alfonso Alba

Hao, would it be possible for you to estimate the likelihood of a given data point having a certain label for the categorical component? Say, following your example, that you could estimate the likelihood of the region being equal to China, or to the U.S. That way you could replace your categorical feature for a set of numerical features (one for each category). It would increase the dimensionality of your data, but all your features would be numerical and that may be easier to handle.

Hao, you can compute distances between categorical data if you make additional assumptions. For example, you can map (as similarly suggested by Alfonso) your categorical variable with k levels to a regular simplex in (k-1) space. For example, variable X with values {A,B,C} can be mapped to A => (0,0), B => (0,1), and C => (sqrt(3/4), .5). In the embedded space, each level of the variable is equally far apart from every other level (distance=1). Of course, this equality is an assumptions thats additionally imposed.

I guess, in a second step, it is important to think about the relative weighting of the embedded dimensions to the dimensions of your numerical variables.

Bapu B Kiranagi

Symbolic data analysis: http://www.ceremade.dauphine.fr/~touati/introdiday/Introdiday.htm.

There are many articles related SDA, i can share the links. Let me know if in case the above link helps you.

you can check this: K. Chidananda Gowda, Edwin Diday: Symbolic clustering using a new similarity measure. IEEE Transactions on Systems, Man, and Cybernetics 22(2): 368-378 (1992).

Kerstin Wendt

Check out heterogeneous distance functions, using different sub-distances for different types of data. I have some references in my theses. Good luck.

Md Anisur Rahman

I have developped a Fuzzy Clustering Technique called CRUDAW that works on a data set having both categorical and numerical attributes. For categorical attributes, the distance is measured based on similarity instead of either zero or one.

https://www.researchgate.net/publication/233380967_CRUDAW_A_Novel_Fuzzy_Technique_for_Clustering_Records_Following_User_Defined_Attribute_Weights

Conference Paper CRUDAW: A Novel Fuzzy Technique for Clustering Records Follo...

Juan Julián Merelo Guervós

As Andreas has said above, you can process categorical data into numerical vectors at least in two different ways

* If there's a meaningful distance between categorical variables (say, they represent size) you can assign numbers to them. For instance, "small, big and huge" could be converted to 1,10, 100.

* If there's not, and the number of variables is small, you can convert n categories to vectors of n-1 dimension converting A,B and C to 0,0, 0,1, and 0.5.sqrt(0.75), for instance, the key being that the distance between any two of them must be the same so as not to introduce artifacts into the (usually metric) training algorithms.

MehrAfarin Adami

I have read it in hon`s data mining book(2006)

the dissimilarity between i & j in categorical variable can computed like this:

d(i,j)=(p-m)/p

where m is the number of matches and p is the total number of variables.

You can add weights to increase the effect of m .

And in that book also there is a solution for finding the dissimilarity of multivariables.

If u need it tell me to write it for you

The distance between two categorical attribute values can be measured by the following way:

distance(sydney, melbourne)=1-similarity(sydney,melbourne). I have also used this one in my clustering technique.

Reza Ghaemi

K-means, K-median, Fuzzy C-means, Random Subspace Method, Complete-Linkage

Timo Honkela

When you conduct clustering on symbolic/textual/qualitative data, there are issues at multiple levels of abstraction to be taken into account. We wrote a journal paper on these issues where the use of unsupervised learning methods is considered in general (using the self-organizing map algorithm as an example):

http://users.ics.aalto.fi/tho/info/JanasikHonkelaBruunORM09.shtml

Medhini Narasimhan

For categorical data clustering you could use a Fuzzy Genetic Algorithm. The paper, http://www.sciencedirect.com/science/article/pii/S1568494615000502 proposes a Non Dominated Sorting Genetic Algorithm for Categorical Data clustering. This approach outperforms the state of the art techniques in many situations. However, it depends on your data set and the problem you are trying to solve.

Article Non-dominated sorting genetic algorithm using fuzzy membersh...