I have two sets of data, lets say each one is a feature a sample space, and two-sample t-test revealed that they are significantly different. Can I use it as a good feature in any classification technique?
it depends what your classification problem is ...
say i have sample A and sample B (everything representative, independent and so on and so forth ...) and my classification problem is to classify A vs B
an individual (from A or B) is defined by features
if a t-test run on a given feature F rejects the null hypothesis that F|A and F|B have been drawn from the same distribution, then there must be "something" (possibly a very tiny "something") making a statistically significant difference (at the chosen confidence level) between F|A and F|B and this "something" might well be leveraged by my classifier
this is the classical "filter" approach to feature selection
https://en.wikipedia.org/wiki/Feature_selection
easy to use, does not take the specificities of the classifier into account, does not take the correlations between features into account
now, of course, feature F is a good candidate for the A vs B classification problem ... not a "universal" good feature for any classification problem !
.
now, one more caution word : review the conditions for the t-test to be applicable ... do not run a t-test on any kind of distribution : if the normality assumption is strongly violated, anything can happen !
it depends what your classification problem is ...
say i have sample A and sample B (everything representative, independent and so on and so forth ...) and my classification problem is to classify A vs B
an individual (from A or B) is defined by features
if a t-test run on a given feature F rejects the null hypothesis that F|A and F|B have been drawn from the same distribution, then there must be "something" (possibly a very tiny "something") making a statistically significant difference (at the chosen confidence level) between F|A and F|B and this "something" might well be leveraged by my classifier
this is the classical "filter" approach to feature selection
https://en.wikipedia.org/wiki/Feature_selection
easy to use, does not take the specificities of the classifier into account, does not take the correlations between features into account
now, of course, feature F is a good candidate for the A vs B classification problem ... not a "universal" good feature for any classification problem !
.
now, one more caution word : review the conditions for the t-test to be applicable ... do not run a t-test on any kind of distribution : if the normality assumption is strongly violated, anything can happen !
Another confusing point for me is that when we have decided about the features (lets say the best features or good features), which classifier should be choosed if we are concerned about supervised learning with more than two classes? For example, Matlab has built in Bayes, Classification Tree, KNN and Discriminant Analysis. Which one is the best?
I think there is no unique answer to the question "which one is best".
It will depend on the data, intrinsic properties of features selected, the way you use the algortihm etc.
Answer to your very question is No, existence of a continous feature significantly different between two sets, does not guarantee that you can classify these two sets by using this feature. As noted by Fabrice, for especially large data sets tiny differences become statistically significant which is a fallacy of frequentist methods.
Alas, the answer is: not necessarily (in fact, more likely not) ! Well,it all depends on what you expect from a “good” feature.
The problem is the aggregate nature of the t.test: it simply detects differences in the estimated means of the two populations. For large enough sample sizes, even with very heavy overlap in the two distributions the t test will be significant but the separation of the individual data points is poor!
Since I cannot post graphs or fancier formatting here, allow me to just link to this post:
Regarding the same feature selection problem, what if I apply Two-sample Kolmogorov-Smirnov test to see weather two samples are from same distribution or not. If they are different, does it mean they are good features?
in fact, the type of test is less important than the null hypothesis which is being tested !
this null is usually that the two samples come from the same distribution and the aim of the test is to decide whether this hypothesis can be rejected or not ; each test differs by the statistic which is chosen so as to be able to reject the null and the assumptions made on the distribution from which the data are drawn under the null hypothesis
now, if successfull, what is shown at the end of the day is that the null can be rejected : this does not necessarily imply that you have a good feature : it just says that the discrepancy between your two samples is large enough, given the sample sizes, to reject the null hypothesis and this could be done on the basis of an infinetesimal difference if you are working with very large samples (as Markus pointed above)
now, rejecting the null is obviously better than not rejecting it when you are looking for potential features !
I guess I mis-used the word best feature. I should have said potential feature instead of best feature but I got the general point of the discussion. Very good discussion and useful for me.
Good discussion indeed as I believe that it touches upon some widespread misunderstandings of the implications of "statistically significant".
Maybe you can think of a "good feature" as a "strong signal" whereas t/KS/MW tests or any others excel at finding "weak signals". (Of course they will also detect strong differences)
As an example from medicine: 4-7 drinks a week apparenty increase the risk of certain type of cancer by 20%. If you declare a binary variable to indicate the drinking flag, with a sample size of e.g. N=10^4, any t or KS test will yield highly significant differences between the two populations.
However, that feature alone will still not give you a "good" classifier at all. The R-squared for e.g. a linear model would me a meagre 0.0007 and a classification tree or any other classifier would not have much discriminative power at all.