Say we have a set of features that are highly correlated to each other. We know that in this situation, features are redundant. Our interest is to retain features that are important for further classification process.
The methods of the feature selection can be generally divided into filter and wrapper ones. Filter methods do not require application of learning model to select relevant features. They select features as a preprocessing step, independent on the choice of the predictor. They also use information included in the dataset, e.g. the correlation between variables and discriminatory abilities of the individual features, to create the most promising feature subset before commencement of learning. The main disadvantage of the filter approach is the fact that it totally ignores the effect of the selected feature subset on the performance of the learning model.
The wrapper approach operates in the context of the learning model – it uses feature selection algorithm as a wrapper around the learning algorithm and has usually better predictive accuracy than the filter approach. The wrapper approach using the learning model as a black box is remarkably universal. However, this approach can be very slow because the learning algorithm is called repeatedly.
Some learning models have internal build-in mechanisms of the FS. For example decision trees (CART, ID3, C4.5) which incorporate FS routine as a subroutine and heuristically search the space of feature subsets along tree structure during the learning process. The approaches in which FS is an essential part of the training process are so-called the embedded methods.
You can find good wrapper method of FS in ma article: Dudek G.: Tournament searching method to feature selection problem. In: Rutkowski L. et al. (eds.): Artificial Intelligence and Soft Computing, LNCS 6114, pp. 437-444
See also:
Kohavi, R., John, G.H.: Wrappers for Feature Subset Selection. Artificial Intelligence 1-2, pp. 273--324 (1997)
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction. Foundations and Application. Springer-Verlag, Berlin Heidelberg (2006)
Thank you very much Dr. Dudek. Very good explanation. Can we apply filter and adapt that technique into wrapper approach? Say, I am using distance criteria in filtering my highly correlated features, and the selected features will be formed into different feature subsets. The performance of feature subsets will be evaluated based on the classification performance i.e. error rate. In such situation, am I dealing only with filter or both filter and wrapper? Thank you.
I guess you combine filter and wrapper. When features are highly correlated you should try principal component analysis. It creates new uncorrelated features as linear combinations of original features. Using PCA you can reduce a number of features a lot.
We know that PCA is the best dimension reduction technique for highly correlated features. But the disadvantage of PCA is that we cannot concentrate on specific features. It serves you the reduced dimensions in terms of linear combinations of all our features in such a way that the first linear combination (i.e. principal component 1 [PC1]) cater the most variance explained followed by the rest of the PCs in decreasing order of variance explained. I did purposely focuses on feature selection (ie using filter and wrapper) instead of feature extraction (ie PCA). I think by doing that we can be more specific in explaining the discriminant features that contribute to the good classification performance.
In addition to the great answers above, linear discriminative analysis (LDA) is another widely used solution for this purpose. Unlike filter-based feature selection and PCA, LDA does take the separability of classes into account. It is learning-based, and selects features which are more useful for classification. Compared with wrapper-based feature selection, LDA does not require recursively calling the learning model.
The idea of LDA has been extended and improved by many previous studies. Please check them if you are interested in LDA.
Thank you Dr Wu. Yes, I agree with you that dimension reduction using LDA is another alternative. And the good thing about LDA is that it (i.e. the discriminant functions) tells you the separability among classes. But again, it is in a linear combination forms of all the features. The first discriminant function (DF1) cater for the most (max) possible separation (i.e. the F-ratio for variation within and between groups), followed by DF2 with the same criteria subject to the condition that there is no correlation between DF1 and DF2. Its follows until DFk where k=(number of group - 1) for the case of number of group < number of features. In the case of extended LDA, I am not sure if I can specifically concentrate to discriminative features. Any suggestion?
I am not sure what you mean by "concentrate on specific features" or "specifically concentrate to discriminative features". If your concern is the limitation of the linear combination, kernel Fisher discriminant analysis (KFD) can be an alternative allowing non-linear mapping.
If you need to reduce the dimension of the feature rather than the linear or non-linear combination, I guess you can simply select those dimensions with highest weights like what PCA-based dimension reduction methods do.
Thank you Dr Wu. "concentrate on specific features" or "specifically concentrate to discriminative features" means I am interested to find the best features not in a linear combination forms. Say I have 10 features which are highly related (having in mind that PCA or LDA is no longer an option for dimension reduction), and I need to find the best 5 features . And from the best 5 features, I would proceed with 5 feature subsets as the input for classification performance (my interest would be the best feature subset with low error rate.). Anyway, you pointed out another good options of doing it. Thank you very much.
This is maybe outdated, but should be helpful for someones.
There are many possible "systematic approaches" to choosing your features. three main categories are: wrappers, filters and embedded methods.
The simplest algorithm is to test each possible subset of features finding the one which minimises the error rate.
This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets.
Filter method, rank each feature according to some univariate metric and select the highest ranking features. In this case, scoring should reflect the discriminative power of each feature. Some examples are
Fisher score:
T-test
Information Gain:
AUC of the ROC curve
Wrapper Method search for the “best” subset of features. there are many tool in this regards, eg.
Forward selection
Backward elimination
Beam search
Simulated annealing
If you desired in Dimensionaly reduction, PCA is most common tool.
You can use also these methods although they are used for classification in non supervised approaches, par example SOM (Self Organized Maps), K-means, FA (Factor Analysis) and well-known PCA (Principal Components Analysis) and SVD (Singular Vector Decomposition) methods for feature extraction and features selection respectively.
Feature Selection is basically selection of a subset of features from a larger pool of
available features. The goal is to select those that are rich in discriminatory information with respect to the classification problem at hand. This is a crucial step in the design of any classification system, as a poor choice of features drives the classifier to perform badly. Selecting highly informative features is an attempt
(a) to place classes in the feature space far apart from each other (large between-class distance) and
(b) to position the data points within each class close to each other (small within-class variance).
For more details: see chapter-4 of below book:
Book: An introduction to pattern recognition: A MATLAB approach