Conceptually, in feature selection there are two alternatives, feature ranking and subset selection. Feature ranking in turn, exists in two forms: wrappers and filters. Wrappers are cost-prohibitive and classifier-specific. Filters are efficient and classifier-independent, they estimate discrimination of each feature in an independent fashion. The classifier then picks features with high(est) scores to use in training/testing and predictions.

Now, the fact that filter only works on one feature at a time seems troublesome. Classifier uses selected features in a group; filter on the other hand never takes into consideration more than one feature ignoring remaining features.

( For two-class problems one feature might be sufficient, and so the problem only exist for a multi-class scenario. )

Thus, with filter concept (one feature at a time) and implementation (classifier contains several features) are in disconnect. And worse, with filter there is no protection against selecting highly correlated, redundant features.

In contrast, many subset selection techniques are considerate of features' interference. First, in information theory there are approaches working directly with mutual information and thus, obstracting redundancy. Another group of methods compute projection of data into either orthogonal space (like PCA, LDA, etc.) or to some manifold beneficial for data discrimination (including linear embedding and other techniques); this way they address feature redundancy as well.

So, with pitfall of redundancy in mind: Shouldn't we constraint feature ranking to subset selection exclusively while abandoning filters altogether?

Similar questions and discussions