I need to compare two samples, one sample has 500 data points and other sample has 100 data points? Which would be the best way to compare them? I learned online that T-test will not be the best method to do this. Any suggestion?
With this sample size, the t-test is very robust, unless you have exceptionally malign data.
Apart from that, the t-test can well handle different sample sizes. A problem occurs if the variances in the two populations are very different (the problem to test a mean difference in such cases is called the Behrens-Fisher problem). Depending on whether the larger or the smaller sample is from the population with the larger variance, the t-test will either give too small or too large p-values. This is effectively controlled for by Welch's test. But this has to do with unequal variances, not unequal sample sizes.
Here is a discussion on that: https://stats.stackexchange.com/questions/232084/welchs-t-test-when-the-smaller-sample-has-a-larger-variance
Theoretically it is a t-distribution with degree of freedom=500+100-2=598 that is the basis. In practice that means the basis is the asymptotic equivalence, the standard normal distribution, since the degree of freedom is very large.
Kunal Kanoi oi , I firmly advise you go with Prof. @David Eugene Booth pls.
Your sample sizes have ruled out student t-test or T-test , except you want meaniless type 1 error!
Welch's t-test, (or unequal variances t-test,) is a two-sample location test which is used to test the hypothesis that two populations have equal means.
The Welch’s t-test can be applied in the following scenario:
-the Sample sizes are unequal (as yours)
- When the samples have unknown or rather unequal variances. i.e., Calculate the sample variances and compare.
-When the distribution is assumed to be normal and sample size, typically n>30 ; [test can be conducted, or data plotted to compare how close it is to a bell shape. Goodluck.
[*Reference to Robust Statistics: The presence of significant outliers makes the use of robust statistics; the Median in place of the mean, & normalized deviation or MADN(x) for the variance, imperative in your calculations]