How can I determine which attributes to use by applying SVM-RFE?

PCA will not select your feature in the original dimension. It will project them.

You can used statistical test (Student's t-test), sequential selection, Fisher discriminant ration for the feature selection, Scatter matrices. You can get a good review on the wiki page actually: http://en.wikipedia.org/wiki/Feature_selection.

Regarding the normalisation, this is really important to have as low variance between patient as possible to ensure that what you are learning is what you will see during the test. So normalise your data and spend some time to know what how your data "look like" to apply the good normalisation approach.

Unbalanced data is kind of tricky. I would rather prefer to get them balanced if it is possible. You could for instance create several balanced set and learn ensemble from that.

George Emil Sakr

For feature selection, get the variance of all features and select the top ones. Features with low variance have less information.

As for normalization, look at your features, if you see a feature that has very large values compared to the others, then you need to normalize. Otherwise you don't.

Bassel Sabbagh

thank you all for your answers::

@ Ali Akbar Jamali: i am already using RFE which is a feature selection method, which ranks the features but i need another method to decide where the threshold is,

@ Guillaume Lemaître: sometimes it is not easy to get samples from patients either because it is an invasive procedure or you do not have access to them. so unfortunatly i have to live with this ;)

@ George Emil Sakr : it is interesting to choose the feature based on variance, i am going to apply this to the ranked features and see is there is a gap or an obvious threshold. And yes some of the features has very large values compared to others so i think i need to normalize, any suggestion for normilization methods?

George Emil Sakr

Yes you can do a [-1,1] normalization so get the max and min for every feature then calculate 2 coefficients for every feature

a=2/(max-min)

b=1-a*max

then apply axi+b for every data point.

Or you can just do a divide by max normalization.

Or you can do a Gaussian normalization where you find the mean and variance for every feature and do the following:

(xi-mean)/variance.

Hope this helps

Bill Plakandaras

Scaling all your instances into a scale of your choice (usually in SVM/R the [-1,1]) is really important as to attribute the same weight to all instances during minimization of the dual form of the SVM equation.

Regarding RFE there is no clear answer as to where place a cutoff. A practical solution that I apply in analogous situations is to find large changes in the performance of the system. In RFE you should search for large changes in the margin of the classifier and place your threshold there. Hope this helps.

Bassel Sabbagh

thank you Bill for your answer.

As i mentioned before i am dealing with imbalanced data (Imbalance ration 1:10). should i apply resampling methods like SMOTE to get more balanced one or it an acceptable ratio?

Sebastián Maldonado

SVM-RFE is a backward elimination approach that relies on a sucessful initial SVM solution with all features. In this context, you can use SMOTE or undersampling (depending of the sample size) to improve your initial SVM solution and/or scaling the data. Both approaches are related to the classification task rather than the feature selection. Class-imbalance is not a problem by itself, and your initial SVM solution may be very good (no overlap, close to perfect classification). When both classes are overlapped, SMOTE can be very useful.

Regarding the stopping criterion of SVM-RFE ("optimal" subset of features, parameter r in the original Guyon publication), you may want to monitor the performance of a validation subset along various subset of features, and select the ona with best performance (although sometimes performance is not the only valid criteria).

Hope it helps. More details in the publication below.

Article Feature selection for high-dimensional class-imbalanced data...

Bassel Sabbagh

thanks sebastian, i am going to apply smote, undersampling or cost sensitive classification. and see which one is more suitable. i am going to read your paper to get more info. regarding the feature selection i am going to measure the performance by sensitivity, specificty and g-means value.

Sebastián Maldonado

sounds good, Bassel. Remember to perform feature selection and data resampling always in the training set. Validation/test subsets remains always unseen for such tasks. That means that Validation/test subsets remains imbalanced, and you may find different subset of features in each training partition. There is a high risk of overfitting when combining data-resampling, wrapper/embedded feature selection, and SVM (especially when using kernel methods).

best

Marcelo Bassani de Freitas

In these Feature Selection algorithms that ranks the attributes, the sure way to know the threshold is to test all possibilities in the rank that you have. For example try first a dataset with only the 1st in the rank, than a data set with the 1st and 2nd attributes, and so son.

You can easily build a script that does that using weka command line. My datasets have more than 800 attributes and it is doable.

If you want it faster, you can use other type of feature selection methods that already gives you the best subset of attributes. When choosing a classifier go to meta > Attribute Selected Classifier and choose any evaluator that ends with "SubsetEval" and choose your SVM as the classifier. Make sure you check out some search methods, they can change your result a lot. And since you know you will be using SVM you should try the ClassifierSubsetEval with SVM.

For the imbalance problem you can use 2 strategies (1) resample your data and (2) select your instances with a Instance Selection Method (ISM). The first one can be done in weka so I think it will be easier for you. I advise you to try SMOTE, Resample and SpreadSubsample. SMOTE is more widely know but the other two had better results for several datasets I tested.

One of the things I do in my research is to test all the possibilities for resampling and feature selection. And the best thing I did was to start working with scripts using weka via command line. I have done more than 10 thousand classifications with it and it wouldn't be possible to do using wekas GUI. So the best advice I can give you is to spend some time and automate your tests with scripts, then let it run over night and find out what was your best combination.

What if heavy and light peptides do not co-elute together?

Do you know a product for separating cystein containing peptides?

Can anyone suggest a method for hemoglobin removal from a culture media?

Does anyone have a protocol for in solution double digestion of a protein sample using Trypsin and GluC?

Is there a problem with my RNA pellet?

RNA Extraction Using Hot Borate Method No Longer Working?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Low-yield gel extraction problem?

Do you have good tips for seaweed tissue preservation in the field for post RNA extraction?

Need help with my research project on open source SIEM and machine learning?

The question is how to use Wavenet transform?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

How to choose the journal?