In machine learning, categorical variables need to be preprocessed using one-hot encoding to create binary independent variables. For example, if a specific categorical variable has 6 unique values (discrete states), one-hot encoding the feature will result in 6 binary variables. If a specific sample is in the 1st category of the 6, then the first binary feature will be '1' and the rest will be '0' (100000). If another sample is in the 2nd category, second binary variable will be '1' and the rest will be zero (010000) and so on.
I don't agree with above answer. It may be useful only small amount of the possible values of the single variable (two, maximum 3). I don't understand, why it is impossible to use direct coding - 1,2,3,... for each value of the variable. For example, white = 1, black = 2, gray = 3, green = 4, etc.
Hello Nahian, yes I had a binary classification but I found following ;
Non-numerical data such as categorical data are common in practice. Some classification methods are adaptive to categorical predictor variables in nature, but some methods can be only applied to continuous numerical data. Among the three classification methods, only Kernel Density Classification can handle the categorical variables in theory, while kNN and SVM are unable to be applied directly since they are based on the Euclidean distances. In order to define the distance metrics for categorical variables, the first step of preprocessing of the dataset is to use dummy variables to represent the categorical variables. Secondly, due to the distinct natures of categorical and numerical data, we usually need to standardize the numerical variables, such as the contributions to the euclidean distances from a numerical variable and a categorical variable are basically on the same level. Finally, the introduction of dummy variables usually increase the dimension significantly. By various experiments, we find that dimension reduction techniques such as PCA usually improve the performance of these three classifiers significantly. Following is the link