It depends on the classifier. For Support vector machine, the followingis usual. For example, the nominal data is "blue" "red" "green", you can convert the nominal data into the binary vector. blue is encoded as 1 0 0, red as 0 1 0, green as 0 0 1.
I think the best way to convert Nominal feature vector into numeric is to compute the histogram of nominal values over feature vector. The fundamental idea is "if two nominal feature vectors are similar then the euclidean distance between their histogram is close to zero". The only problem here is that each nominal value in feature vector is given equal importance, but that can be taken care by normalizing the histogram using normalizing factor for different nominal values. Here are few steps to compute histogram of nominal feature vector
-divided the nominal feature vector into non-overlapping bins and compute the normalize histogram over each bin and concatenate them to form a new numerical feature vector
- you can even consider the entire nominal feature vector as a single bin and compute a single normalized histogram only or may divided them into more than one bins to get more details. It is also possible to compute hierarchical histogram as well to capture details at different level of generalization.
many classifiers will accept both categorical and numerical variables ...
anyway; two extreme treatments :
- disjunctive coding : convert each modality of each categorical variable into a new variable and code 1/0 whether the individual's description includes the modality or not ; this may result in a large sparse numerical space if your nominal variables have a lot of modalities and you may need either well regularized classifiers or an additional dimensionality reduction step before feeding the classifier
- apply MCA ( https://en.wikipedia.org/wiki/Multiple_correspondence_analysis ) to your categorical variables, select the appropriate number of top factors and recode the individuals by their projection in this factorial space
.
now ... i'm just guessing what "something else" you might be looking for !
First option I already tried but it is increases the dataset as my nominal attributes have lots of categories about the second I am not sure, I will go through it and get back to you asap
It will be nice if you can give some more information about the kind of problem you are addressing (Classification/Clustering), nature of dataset, and algorithms you are planning to use. I think with this information, you may get more relevant answers to your question
It depends on the classifier. For Support vector machine, the followingis usual. For example, the nominal data is "blue" "red" "green", you can convert the nominal data into the binary vector. blue is encoded as 1 0 0, red as 0 1 0, green as 0 0 1.
I suggest yo to take a look to the book "Multidimensional Scaling", from Brian Everitt. There si a chapotear devoted to distances for different data types, and it is also described the Gower distance to combine the different distances to produce a single number. Once you have the distance matrix d-ij you use multidimensional scaling to produce numerical coordinates. So, you get numerical coordinates that properly reflect the distances between individuals, integrating all different variables.
(1) your aim is to run some kind of benchmark of various classification algorithms on one (maybe many) datasets
(2) you want to find some kind of transformation of your dataset which would make it "processable" by most if not all your candidate algorithms
.
admitedly, this has been done gazillions of times but it still seems a little odd to me : among the properties of the algorithms, there is more than just "classification performance" ; there are many more properties such as their capacity to run in very high dimensional sparse spaces without running amok, their capacity to treat natively numerical and categorical variables on the same footing, scalability and so on : all these nice properties may come at a small expense in terms of "classification performance" but bring a large benefit in terms of cost for the pre-processing step
.
moreover, a pre-processing step that would make the data set "processable" by all algorithms might have to throw away some pieces of information that some algorithms would have been happy to use, which is unfair as far as the benchmarking is concerned
.
it might be fairer to benchmark the cost in development time for the couple (pre-processing, algorithm) needed to obtain a correct performance on the dataset
Import your data into a pandas data frame. Then pass this data-frame along with the name of target column (which you want to convert from nominal to numeric) to the below function . This will return a new data frame with additional column containing the required numeric data. Hope this helps.
def encode_target(df, target_column):
df_mod = df.copy()
targets = df_mod[target_column].unique()
map_to_int = {name: n for n, name in enumerate(targets)}
@Fabrice Clerot.... Thanks for your valuable comments...well you are right, I am actually trying to implement some or all the machine learning algorithms on the benchmark dataset, yes I am trying to do a comparative analysis on the machine learning techniques on my dataset I believe that this would give me some insight about which to choose and stick to for my problem..
@Alberto Muñoz.....Thanks surely I will give it a read....I also have some problem in mind for which I want to transform n-dimensional data into some distance measure and use it for classification
@Ritesh Kasat......Thanks for your suggestions, surely I will give it a try..
We should not convert nominal data into numeric data even though nominal data is ranking data such such as non-parametric rankings. Arithmetic operations such as addition, multiplications are only meaningful with regard to numeric data. However it is possible to convert numeric data to nominal data.
It is optimal to use decision tree classifier for nominal data. You should not convert nominal data into numeric data because there is no order and arithmetic operators on nominal data.