Is it true to say that if Two-sample t-test shows that two data are significantly different, then they are good feature for classification purposes?

More Keyvan Jaferzadeh's questions See All

Treated cells with 3 different treatment at the same time?

I have a plan to test the effect of phenolic compounds on LPS-treated cells and show its effect on pathways, because of that I am going to use pathways inhibitors. How to treat cells with LPS,...

03 February 2023 3,224 2 View

What is the effect of drought and silicon stress on soil pH and EC?

15 November 2020 4,229 5 View

What is the best resource/website/videos to start with Ansys Fluent tutorials?

Dear All, I am starting using Ansys Fluent from scratch and I need tutorials to help with some exercises. I want to simulate and analyze aerodynamics on sports balls Like soccer, Volleyball, and...

10 November 2020 3,740 10 View

How can I calculate Energy Barrier (first and last Image) using VASP?

Hello Everyone I want to Calculate the Energy Barrier of Li-ion in a bulk material using VASP. I build my Images using nebmake.pl and I have 6 Images and total 8 folders(00-07).my Energy results...

14 September 2020 7,844 2 View

How to train ANN/ANFIS including linguistic inputs?

Hi everyone, I have collected a set of experimental data regarding the strength of a composite material. Besides quantitative data (dimensions and mechanical properties of the materials),...

22 January 2020 522 3 View

Are there any Physicslists to produce ion products during ionisation in geant4 using the Electric field?

Hi everyone I am trying to detect ion products during ionisation by the electric field in geant4 . i have implemented electric field inside the box which have been fiiled by some materials that...

03 December 2019 7,668 1 View

How can I set Boussinesq for phase change material in Ansys workbench?

In this Equation, the density in the buoyancy force term conforms to the Boussinesq approximation, which can be considered to be a function of the temperature. I have this information but i...

11 October 2019 9,331 1 View

How to calculate Percentage Depth does (PDD) using Geant4?

Hi I am trying to calculate PDD using Geant4 I divided a water phantom to voxels (2 mm x 2mm x 2 mm) by G4NestedParameterisation and i have created a class derived from Sensitive Detector class...

10 October 2019 6,228 2 View

What degrees of freedom are important in structural control?

Why are horizontal degrees of freedom (DOF) mainly considered in the dynamic equation of motion rather than other DOFs such as vertical and rotational for control of structures?

15 January 2019 5,767 9 View

How to calculate sliding mode control voltage/current signal in an active tendon system?

I have modeled a multi-story building structure equipped with active tendons systems. I am using sliding mode control method to decrease the dynamical responses of building in MATLAB. Using the...

18 November 2018 3,837 6 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Fabrice Clerot Popular answer

it depends what your classification problem is ...

say i have sample A and sample B (everything representative, independent and so on and so forth ...) and my classification problem is to classify A vs B

an individual (from A or B) is defined by features

if a t-test run on a given feature F rejects the null hypothesis that F|A and F|B have been drawn from the same distribution, then there must be "something" (possibly a very tiny "something") making a statistically significant difference (at the chosen confidence level) between F|A and F|B and this "something" might well be leveraged by my classifier

this is the classical "filter" approach to feature selection

https://en.wikipedia.org/wiki/Feature_selection

easy to use, does not take the specificities of the classifier into account, does not take the correlations between features into account

now, of course, feature F is a good candidate for the A vs B classification problem ... not a "universal" good feature for any classification problem !

now, one more caution word : review the conditions for the t-test to be applicable ... do not run a t-test on any kind of distribution : if the normality assumption is strongly violated, anything can happen !

Fabrice Clerot

Keyvan Jaferzadeh

Another confusing point for me is that when we have decided about the features (lets say the best features or good features), which classifier should be choosed if we are concerned about supervised learning with more than two classes? For example, Matlab has built in Bayes, Classification Tree, KNN and Discriminant Analysis. Which one is the best?

Mehmet Sinan Iyisoy

I think there is no unique answer to the question "which one is best".

It will depend on the data, intrinsic properties of features selected, the way you use the algortihm etc.

Answer to your very question is No, existence of a continous feature significantly different between two sets, does not guarantee that you can classify these two sets by using this feature. As noted by Fabrice, for especially large data sets tiny differences become statistically significant which is a fallacy of frequentist methods.

Markus Loecher

Alas, the answer is: not necessarily (in fact, more likely not) ! Well,it all depends on what you expect from a “good” feature.

The problem is the aggregate nature of the t.test: it simply detects differences in the estimated means of the two populations. For large enough sample sizes, even with very heavy overlap in the two distributions the t test will be significant but the separation of the individual data points is poor!

Since I cannot post graphs or fancier formatting here, allow me to just link to this post:

http://blog.hwr-berlin.de/codeandstats/?p=42

Thanks everybody for the helpful discussion. Specially, Markus for the visual explanation.

Cheers,

Regarding the same feature selection problem, what if I apply Two-sample Kolmogorov-Smirnov test to see weather two samples are from same distribution or not. If they are different, does it mean they are good features?

Kolmogoroff-Smirnov, Mann-Whitney-Wilcoxon, whatever !

in fact, the type of test is less important than the null hypothesis which is being tested !

this null is usually that the two samples come from the same distribution and the aim of the test is to decide whether this hypothesis can be rejected or not ; each test differs by the statistic which is chosen so as to be able to reject the null and the assumptions made on the distribution from which the data are drawn under the null hypothesis

now, if successfull, what is shown at the end of the day is that the null can be rejected : this does not necessarily imply that you have a good feature : it just says that the discrepancy between your two samples is large enough, given the sample sizes, to reject the null hypothesis and this could be done on the basis of an infinetesimal difference if you are working with very large samples (as Markus pointed above)

now, rejecting the null is obviously better than not rejecting it when you are looking for potential features !

I guess I mis-used the word best feature. I should have said potential feature instead of best feature but I got the general point of the discussion. Very good discussion and useful for me.

Cheers

Good discussion indeed as I believe that it touches upon some widespread misunderstandings of the implications of "statistically significant".

Maybe you can think of a "good feature" as a "strong signal" whereas t/KS/MW tests or any others excel at finding "weak signals". (Of course they will also detect strong differences)

As an example from medicine: 4-7 drinks a week apparenty increase the risk of certain type of cancer by 20%. If you declare a binary variable to indicate the drinking flag, with a sample size of e.g. N=10^4, any t or KS test will yield highly significant differences between the two populations.

However, that feature alone will still not give you a "good" classifier at all. The R-squared for e.g. a linear model would me a meagre 0.0007 and a classification tree or any other classifier would not have much discriminative power at all.