Hi, I want to use feature selection and feature extraction method both before classification. Do I apply feature selection first and then feature extraction or reverse? Please let me help.
First you should extract some valuable features. After you can apply feature selection methods to determine if all features you extracted are really valuable or not? And you should get answer which of them are really valuable!
I am proposing the following steps for your classification process
1- Extract Features
2- Select relevant features (using MRMR method)
3- Apply PCA allowing you to perform the analysis of all the components, and to determine the ones that contribute the most, i.e. explain the maximum variability. Hence the Eigenvalues will inform you whether factors F1, F2, .... F8, for instance explain 87% of the total variability.
4- Apply Classifier (Perform classification), to find out how many groups / clusters you obtained, which set belong to which group / cluster.
Hi Mamadou Bamba, You are right. I have applied MRMR first and then PCA and it's working well rather than using PCA first and then MRMR. My question: Why does PCA perform well after MRMR?
It works better because the features that PCA combines are class-relevant because they have been selected with respect to a particular class-dependant criterion. In the opposite way, principal components you try to select may not be class-relevant since the criterion i(total covariance) is not class-dependant.
An alternative you should try is to use a simple LDA (Linear Discriminant Analysis) or any derivative (QuadraticDA, RegularizedDA, etc) with also performs dimension reduction by combining features according to within-class and between-class (here is the relevancy) criteria.
Hi C. Frelicot , Thanks for your nice explanation. I am dealing with two class problems with 4096 features. So using LDA provides only one dimension which is not good enough for my task. From 4096 to only one dimension makes average precision is zero. So , How do I apply LDA for my two class problems to get better result?
First, feature extraction. But feature extraction involves some transformation. PCA extracts linear combinations on the original variables, searching for maximum variability in the data,but it is not class-dependant. You could try PLS, ICA and kernel PCA to extract features using diferent optimization criteria and compare.
I would like to add a general comment on feature selection which I have stated in several blogs.
There are many feature selection methods available like LDA, Fisher's Disriminant with Rayleigh coefficient, Intra-class-Minimizers, etc. What is usually not working is: PCA! Why? PCA tries, based on a gaussian process (assumption), to measure the variance and sort the eigenvalues which are proportional to the variances in decending order. The assumption is that the main Eigenvalues (EWs) contains most of the information and therefore, we use the main components (EWs) for data reduction. So far so good. But: Applying this approach for feature reduction is risky because this assumes that the feature in itself is stable (invariant) AND the feature's variance contains all information for classification.
This assumption is wrong!
Real-world features contain artefacts (noise, etc.) and therefore, PCA generates in its main components EWs with the highest variance which are related to noise. Hence, you generate "new" features which are not stable.
Iff you can prove that your features are completely artefact-free and the features underly a gaussian process, then PCA might work - in all other cases PCA is, as I said, very risky.
First of all you have to find out, the kind of objective and its characteristics (linear or non linear). PCA is a good way for linear systems, for non linear systems I use artificial neural networks and do a sensitivity analyses after. This fits perfectly well for all of my applications within the last 20 years it has the benefit of a more generic aproach, if you don't know much about your system.