My advice is to look at the paper: Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning Research 3 (2003): 1157-1182.
This gives you a very detailed overview on feature selection methods, because it always depends on your dataset! The authors wrote some questions to quickly know what is the best approach to use.
The two main methods that are used (as far as I have learnt till now) are wrapper methods and filter methods. Please look at how these methods work. The link posted by Mr.Omid seems to contain most of the algorithms used.
One can use rough set with formal concept analysis to identify the best features. Refer the paper.
B. K. Tripathy, D. P. Acharjya and V. Cynthya: A Framework for Intelligent Medical Diagnosis using Rough Set with Formal Concept Analysis; International Journal of Artificial Intelligence & Applications; Vol. 2 (2), pp. 45 – 66, (2011)
One can predict the missing associations also. Refer the paper:
D. P. Acharjya, Debasrita Roy and Md. A. Rahaman: Prediction of Missing Associations using Rough Computing and Bayesian Classification, International Journal of Intelligent Systems and Applications, Vol. 4 (11), pp. 1-13 (2012)
Another similar way to feature selection is feature extraction. If you want to reduce the feature dimensions, you can also use feature extraction, or both feature selection and feature extraction. Principal Component Analysis (PCA) is one of the most widely-used feature extraction approaches.
PCA is very good if your data are 'linear'. If they are not, you should have a look at nonlinear methods, such as Isomap, locally linear embedding (LLE), Laplacian Eigenmaps (LE) and kernel PCA. The following paper provides a review:
Automatic Configuration of Spectral Dimensionality Reduction Methods, M. Lewandowski, D. Makris and J.-C. Nebel, Pattern Recognition Letters, 31(12), 2010
My advice is to look at the paper: Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning Research 3 (2003): 1157-1182.
This gives you a very detailed overview on feature selection methods, because it always depends on your dataset! The authors wrote some questions to quickly know what is the best approach to use.
Here is a different take on feature extraction and selection. This may not be the answer you are looking for but worth pondering.
Feature extraction is a heuristic derived from underlying domain knowledge (Plotz et al. 2011) - that may work on some domains and may not work on others. Therefore the question that which feature selection method is better is quite coarse. Infact there is no clear winner in feature selection literature and they are very domain specific. Based on these problems, researchers are now trying to learn features directly from the data to find a generic and effective representation and deep learning is emerging as a good methodology to reach that goal.
Reference
Plötz, Thomas, Nils Y. Hammerla, and Patrick Olivier. "Feature learning for activity recognition in ubiquitous computing." Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two. AAAI Press, 2011.
One of my doctorate students, Bing Xue, has just finished her thesis on the application of Particle Swarm Optimisation (PSO) in both continuous and discrete forms to wrapper, filter and hybrid approaches to feature selection. The results in identified domains are statistically significant compared with both conventional and evolutionary approaches. PSO is useful for feature selection as it can handle large dimensionality efficiently, has a velocity component so can continue exploring useful areas of the search space once found and is readily adaptable, e.g. multi-objective optimisation, such as number of features versus classification performance. I will note that the thesis did not have time/space to test all possible algorithms appropriate for feature selection, so many other approaches will perform as well as/better than PSO on given domains (NFL), but it is an approach that is worthwhile considering.
Xue B., Cervante L., Shang L., Browne W. N., Zhang M (2013) Multi-objective Evolutionary Algorithms for filter Based Feature Selection in Classification. International Journal on Artificial Intelligence Tools 22(4)
Xue B., Zhang M, Browne W. N. (2013) Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach. IEEE T. Cybernetics 43(6): 1656-1671
Xue B., Cervante L., Shang l., Browne W. N., and Zhang M., (2012) A multi-objective particle swarm optimisation for filter-based feature selection in classification problems. Connect. Sci. 24(2-3): 91-116
There are many feature methods. Rough sets base attribute reduction is one of the best feature selection. Rough sets can apply to find the best minimal attribute reduction by combining with metaheuristic algorithm.
If you are interested in an approach that is grounded on sensory input, does not use any stochastics/ probabilities and uses deep learning of coincidental patterns you might want to read my approach - a six page paper, punblished last year, can be found at; www.adaptroninc.com
One of the feature selection methods is based on the distance criteria (separability criteria). However, not many paper discuss deeper about this approach. I am using Mahalanobis distance for my research. But still in progress. Few references:-
1. De Maesschalck, R., Jouan-Rimbaud, D., & Massart, D.L. (2000). The Mahalanobis Distance (Tutorial). Chemometrics and Intelligent Laboratory Systems 50, pp. 1-8.
2. Han, J., Lee, S. W., & Bien, Z. (2012). Feature Subset Selection using Separability Index Matrix, Information Sciences xxx, xxx (this paper wasn't published when I downloaded. I think the paper is now available)
3. Xiang, A., Nie, F., & Zhang, C. (2008). Learning Mahalanobis distance metric for data clustering and classification. Pattern Recognition 41, pp. 3600-3612.
You can try random forest algorithm, if you use R there's randomForest package. But the simplest way is to compute correlation coefficients, if there are features strongly correlated, it means that they bring, usually the same variability informations then you can retain one of them.
In our own work on feature selection for the last few years, we worked on filters wrappers and mixed feature selection algorithms. Before you start with any algorithm just go through Guyon's papers and book (http://clopinet.com/isabelle/Projects/NIPS2003/call-for-papers.html), especially chapter one as it gives a good start.
From the filters side, I can list few algorithm with Matlab code,
1- Battiti's algorithm, "Using Mutual Information for Selecting Features in Supervised Neural Net Learning", available here "http://sci2s.ugr.es/B5CEF775-81B3-4D54-8440-1ED72B42FA0A/FinalDownload/DownloadId-ECBAE3337097507CCA12C2E96240C22A/B5CEF775-81B3-4D54-8440-1ED72B42FA0A/keel/pdf/algorithm/articulo/battiti.pdf", see point 2 below for Matlab code.
2- Peng's MRMR: Battiti's original work was further continued on by Peng H. C. who developed the Maximum Relevance Minimum Redundancy
3- Al-Ani's MIEF algorithm, it extends upon the above algorithms by considering the mutual information between more than two variables, for example it employs I_Cxx that is the mutual information between two variables together and the class label, or I_Cxxx etc.... (consult Dr. Al-Ani directly for the code http://services.eng.uts.edu.au/~ahmed). If you can't get it let me know and I will implement it for you.
4- There are many others in the filter side, like the Laplacian score by Deng Cai, try this, its really simple and fast.
There are many more, but the above should give you a good start.
Note: Filter methods are not really powerful in capturing the interaction capability between features, i.e., they can list the features based on individual relevance and minimum redundancy, but they can't really tell you how a subset of features can act together. Put simply, 6 features could be bad in terms of their individual relevance to the problem in comparison to other features, but together they can make a subset that tells you a lot about the problem at hand. Imagine if you estimate, I_Cxxxxxx using mutual information, or any other method, this will consume a lot of memory resources. On the other hand, population based techniques seems to be more capable of finding the subsets that best interact together.
Population-based algorithms: We have simply tried almost all, GA, DE, PSO, ACO, Tabu, Forward and Backword selection, ACO+DE, modified GA in many forms, and more. Here are two algorithms we previously developed based on Differential Evolution (DE).
1- The first is available from the link below. All what you need to do is to give a training set, testing set, and couple of other parameters including the desired number of features to be selected.
The choice of attributes depends on by what decision rules it will be used. Therefore it is necessary to do a choice of attributes and construction of a decision rule simultaneously. As it is done it is possible to find out from applied article.
The choice of attributes depends on by what decision rules it will be used. Therefore it is necessary to do a choice of attributes and construction of a decision rule simultaneously. As it is done it is possible to learn from applied article.
Difficult question: I think that the best tool is experience! Any way, you can read some papers appeared in the literature concerning expriments similar to those that you are going to implement an try to catch some secret from the experienced people.
It is very difficult to say that this is the best method for feature selection (FS). It is a matter of experience. Generally, FS methods can be categorized into three classes, namely filter, wrapper and hybrid. In Filter based FS, the correlation between features and target class is computed based upon certain creation. For example, Mutual information. The details can be further explored our publications. Link as follows:
1. An empirical comparative analysis of feature reduction methods for intrusion detection
We are working out on Bootstrap resampling, it seems to match well with SVM and other methods, though it is not immediate and some details have to be learned from experience and from your database first. But it gives you a probabilistic backbone for machine learning, which is a usual lacking in some feature selection methods.
The problem, as stated, is an ill-defined one. Others have already provided a general categories of approaches, such as filter, wrapper and hybrid, but at the end it really depends on the problem. Similar to the no-fee lunch theorem that argues that there is no classifier that is universally better than all others - in the absence of any prior knowledge, similar arguments can be made for feature selection algorithms. Filter based approaches are faster and independent of a subsequent classifier, and hence use a criterion function (figure of merit) that is unrelated to classification accuracy. Wrapper approaches simply look for features that work well with a particular classifier, and hence require many training / testing cycles. Also, features identified to be "good" with a particular classifier when using the a wrapper based approach, may not be "informative" when used with another classifier. Also worth noting is Gavin Brown's unifying framework, which shows that many of the filter based approaches are in fact special cases of a more general approach (see: http://jmlr.org/papers/v13/brown12a.html)
To prove, that some method of a choice of attributes is best on all tasks it is impossible. But comparison of this method with others on a plenty of different tasks is possible serves as a good reference point. In work
http://www.biomedcentral.com/1471-2[9]5/7/359
the big experiment с10 the most known algorithms is described. And in applied work
the algorithm which has appeared better these{it} 10 algorithms is described.
Add would like to add to the above that pre-processing the features may increase their effectiveness and therefore interact with your selection criteria, for example discretising continuous data, merging attribute values etc.
Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into Wrappers, Filters and Embedded.
Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model.
Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated.
Embedded techniques are embedded in and specific to a model.
The following articles are about feature selection:
Cai, Z., Xu, D., Zhang, Q., Zhang, J., Ngai, S. M., & Shao, J. (2015). Classification of lung cancer using ensemble-based feature selection and machine learning methods. Molecular BioSystems.
Hall, M. A. (1999). Correlation-based feature selection for machine learning (Doctoral dissertation, The University of Waikato).
Blum, A. L., & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial intelligence, 97(1), 245-271.
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1), 273-324.
Hall, M. A., & Smith, L. A. (1999, May). Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. In FLAIRS conference (pp. 235-239).
I would like to add a slightly different viewpoint. Most of the answers appear to assume you already have a set of numerical attributes and you are selecting from this set which captures all (or at least sufficient) information about the domain that you need. Suppose you are starting on a completely new domain from scratch? When we were creating the Silent Talker Lie Detector, we spent a lot of time on knowledge engineering talking to expert psychologists and reviewing the psychological literature to build a sort of corpus of non-verbal behaviour associated with deception. Then we broke the behaviours down into very fine-grained compponents which the ANN classifier itself could re-combine into complex features for classification - more information in https://semanticsimilarity.files.wordpress.com/2011/09/silenttalker-applied-cognitive-psychology10-1002acp-1204.pdf. So sometimes yopu will need to start with a bit of the "craft" of knowledge engineering.
The papers are good for understanding ML. So the first step must be, ever, fight against model. For me the best protocol is start with practical exercises and when you know how run the models next step will be understand the mathematics behind the model.
Actually is not the best way to learn but is quicker. Tempus fugit...
"Luke: Vader... Is the dark side stronger?
Yoda: No, no, no. Quicker, easier, more seductive. "
To make simple you can use PCA or (KPCA) for extracting features and then correlation (contribution or Random Forest) to link extracted features to original ones.
See http://www.sciencedirect.com/science/article/pii/S0098300414001782
Ghamisi, P., & Benediktsson, J. A. (2015). Feature selection based on hybridization of genetic algorithm and particle swarm optimization. Geoscience and Remote Sensing Letters, IEEE, 12(2), 309-313.
Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Systems with Applications, 41(4), 1476-1482.
Rodrigues, D., Pereira, L. A., Nakamura, R. Y., Costa, K. A., Yang, X. S., Souza, A. N., & Papa, J. P. (2014). A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest. Expert Systems with Applications, 41(5), 2250-2258.
Krisshna, N. A., Deepak, V. K., Manikantan, K., & Ramachandran, S. (2014). Face recognition using transform domain feature extraction and PSO-based feature selection. Applied Soft Computing, 22, 141-161.
Bhattacharyya, S., Sengupta, A., Chakraborti, T., Konar, A., & Tibarewala, D. N. (2014). Automatic feature selection of motor imagery EEG signals using differential evolution and learning automata. Medical & biological engineering & computing, 52(2), 131-139.
You might also try to read this paper. Here we compared SVM combined with PCA and Random Forests for feature selection. The first one is example of feature extraction, the second one is feature selection. The difference is that in first case you cannot interpret the results so easily. So everything depends on what you need your method for. SVM is a good choice, but the interpretation of the results is questionable. If you need to know which features are important and which not, try random forest or any other feature extraction method.
You can also find some interesting position in references.
For me it give me low recognition rate when trying classical feature selection trchniques like PCA and LDA. But when using swarm optimization as feature selection (Bat, Chicken, PSO, .....), it gave me superior results and for sure time reduction. (Note. the used classifiers were SVM, KNN and Random forest "the last one obtained best results").
Use convolutional deep networks (for images) and keep the last layer as features.
Li, Yifeng, Chih-Yu Chen, and Wyeth W. Wasserman. "Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters."Research in Computational Molecular Biology. Springer International Publishing, 2015
As it is rightly said "Feature selection is an art". Based on what do you want to do you should choose the features accordingly. You can use algorithms like PCA or increasing the dimensions of data based on your requirements. If you have image classification you can directly use CNN(doesn't require feature extraction).
Weka implements interesting filters and wrappers for feature selection.
https://www.youtube.com/watch?v=x5wa1w-BpRE
https://www.youtube.com/watch?v=UOadhDKRbPM
Personally, I used to apply a filter that computes the information gain of each attribute with respect to the class attribute (InfoGainAttriteEval). It gave me pretty good results each time. I work on text classification tasks with several representations (bag of words, ngrams…). I can quickly get hundreds of thousands of features without feature selection.