How can I convert nominal data to numeric data before feeding it to some classifier?

01 January 2016 16 686 Report

Machine Learning Techniques, Weka

Deyuan Zhang Popular answer

It depends on the classifier. For Support vector machine, the followingis usual. For example, the nominal data is "blue" "red" "green", you can convert the nominal data into the binary vector. blue is encoded as 1 0 0, red as 0 1 0, green as 0 0 1.

Mukesh M Goswami

I think the best way to convert Nominal feature vector into numeric is to compute the histogram of nominal values over feature vector. The fundamental idea is "if two nominal feature vectors are similar then the euclidean distance between their histogram is close to zero". The only problem here is that each nominal value in feature vector is given equal importance, but that can be taken care by normalizing the histogram using normalizing factor for different nominal values. Here are few steps to compute histogram of nominal feature vector

-divided the nominal feature vector into non-overlapping bins and compute the normalize histogram over each bin and concatenate them to form a new numerical feature vector

- you can even consider the entire nominal feature vector as a single bin and compute a single normalized histogram only or may divided them into more than one bins to get more details. It is also possible to compute hierarchical histogram as well to capture details at different level of generalization.

hope this will be helpful

Yasir Hamid

Thanks Mukesh M Goswami, your idea is good enough, but I guess I am looking for something else

Fabrice Clerot

many classifiers will accept both categorical and numerical variables ...

anyway; two extreme treatments :

- disjunctive coding : convert each modality of each categorical variable into a new variable and code 1/0 whether the individual's description includes the modality or not ; this may result in a large sparse numerical space if your nominal variables have a lot of modalities and you may need either well regularized classifiers or an additional dimensionality reduction step before feeding the classifier

- apply MCA ( https://en.wikipedia.org/wiki/Multiple_correspondence_analysis ) to your categorical variables, select the appropriate number of top factors and recode the individuals by their projection in this factorial space

now ... i'm just guessing what "something else" you might be looking for !

Yasir Hamid

Thank you Sir :)

First option I already tried but it is increases the dataset as my nominal attributes have lots of categories about the second I am not sure, I will go through it and get back to you asap

Mukesh M Goswami

It will be nice if you can give some more information about the kind of problem you are addressing (Classification/Clustering), nature of dataset, and algorithms you are planning to use. I think with this information, you may get more relevant answers to your question

regards

Yasir Hamid

About the problem its a classifiaction one and I am planning to do a comparative anysis of most of the algorthims available over some dataset...

Deyuan Zhang

Yasir Hamid

That's fine Sir, but this turns the dataset into a sparse matrix...is there no better approach Sir??

Alberto Muñoz

I suggest yo to take a look to the book "Multidimensional Scaling", from Brian Everitt. There si a chapotear devoted to distances for different data types, and it is also described the Gower distance to combine the different distances to produce a single number. Once you have the distance matrix d-ij you use multidimensional scaling to produce numerical coordinates. So, you get numerical coordinates that properly reflect the distances between individuals, integrating all different variables.

Fabrice Clerot

just one quick remark here : it seems that

(1) your aim is to run some kind of benchmark of various classification algorithms on one (maybe many) datasets

(2) you want to find some kind of transformation of your dataset which would make it "processable" by most if not all your candidate algorithms

admitedly, this has been done gazillions of times but it still seems a little odd to me : among the properties of the algorithms, there is more than just "classification performance" ; there are many more properties such as their capacity to run in very high dimensional sparse spaces without running amok, their capacity to treat natively numerical and categorical variables on the same footing, scalability and so on : all these nice properties may come at a small expense in terms of "classification performance" but bring a large benefit in terms of cost for the pre-processing step

moreover, a pre-processing step that would make the data set "processable" by all algorithms might have to throw away some pieces of information that some algorithms would have been happy to use, which is unfair as far as the benchmarking is concerned

it might be fairer to benchmark the cost in development time for the couple (pre-processing, algorithm) needed to obtain a correct performance on the dataset

Ritesh Kasat

Import your data into a pandas data frame. Then pass this data-frame along with the name of target column (which you want to convert from nominal to numeric) to the below function . This will return a new data frame with additional column containing the required numeric data. Hope this helps.

def encode_target(df, target_column):

df_mod = df.copy()

targets = df_mod[target_column].unique()

map_to_int = {name: n for n, name in enumerate(targets)}

df_mod["Target"] = df_mod[target_column].replace(map_to_int)

return (df_mod, targets)

Yasir Hamid

@Fabrice Clerot.... Thanks for your valuable comments...well you are right, I am actually trying to implement some or all the machine learning algorithms on the benchmark dataset, yes I am trying to do a comparative analysis on the machine learning techniques on my dataset I believe that this would give me some insight about which to choose and stick to for my problem..

@Alberto Muñoz.....Thanks surely I will give it a read....I also have some problem in mind for which I want to transform n-dimensional data into some distance measure and use it for classification

@Ritesh Kasat......Thanks for your suggestions, surely I will give it a try..

Academic Network of Loc Nguyen

We should not convert nominal data into numeric data even though nominal data is ranking data such such as non-parametric rankings. Arithmetic operations such as addition, multiplications are only meaningful with regard to numeric data. However it is possible to convert numeric data to nominal data.

Zenon Gniazdowski

My suggestion: https://www.researchgate.net/publication/278403191_Numerical_Coding_of_Nominal_Data

Article Numerical Coding of Nominal Data

Academic Network of Loc Nguyen

It is optimal to use decision tree classifier for nominal data. You should not convert nominal data into numeric data because there is no order and arithmetic operators on nominal data.

Badges
Science topic

Similar topics
Computer Science
Data Mining

Anybody well familiar with Map Reduce Programming?

Is it fine to publish a paper in International Journal of Network Security http://ijns.jalaxy.com.tw/ ?

Anybody working on Internet traffic classification ?

What are the possible approaches for solving imbalanced class problems?

Why 10 fold cross validation doesn't go well with lazy classifiers?

Trying to train Neural Network on a mighty dataset, already taken 48 hours but yet to settle down...any idea about what could be the possible reason?

Feedback defines the constitution of an organism?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Do you know best mines of western part of Afghanistan?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?