Im working on a binary classification task using the Unscrambler software but my data greatly overlaps and cant get any distinct grouping of my samples from PCA scores plot? What do i do in progressing with my analysis?
You could try discriminant analysis (also known as canonical discriminant analysis) but unscrambler unfortunately does'nt offer this algorythm but many other software does. According to the shape of your matrix you can either use it directly or use the PC from your PCA analysis (as mentionned by Michel)
Best regards
Damien
PS; maybe you can tell us a bit more about your data matrix
If a scores plot does not give you a clear distinction between groups, it may not mean that there is no distinction, it could simply mean that largest source(s) of variation is/are similar in both groups. Simple things to try in that case are (i) look at higher PC's, sometimes your clusters will pop up there or (ii) use some form of preprocessing like scaling (auto,etc), transformation (log10). This may help reduce the overall variation present in both groups.
If it is just about knowing about features that are different between groups, you can also find significant ones using a simple t-test. If you wan to do that multivariately, try some classification method like PLS-DA or PC-DA. In these cases you'll need good validation since these methods will always come up with significant differences.
In addition to all the above comments, I would like to emphasis on some important issues:
1. In order to make your features comparable, first you should make sure that you have them normalized in advance to applying PCA or any classification algorithm on your data, i.e. zero mean and unit variance for all the features.
2. PCA stands for Principal Component "Analysis". As it is mentioned in its name, this algorithm is aimed on better understanding the data set, its features and to analyze it. PCA is used for dimension reduction but not recommended for the classification. Nevertheless, one can use the resulted data set in lower dimensions for the classification with no problem.
3. There are some extensions on PCA as well, that make it applicable on non-linear feature sets. You may want to study those extended versions of PCA.
4. While using PCA, be aware of the effects of the high variance features on the results. This feature or features may have no information on them and may mislead you and your classifier to nowhere. Imagine the noise in the data (signal noise) that will come up as the most important feature by PCA but, actually, it is worthless!
Thank you all for your submissions. My data matrix is a 335 X 167. Im starting with the NIPAL clustering algorithm and was to use SIMCA for final classification/testing new samples using the Unscrambler software. I will continue based on your pieces of advise. But my class A in the dependent variable has up to 312 samples of the 335, with my class B being only 23 samples. Do you think this imbalance might be responsible for the inseparability? Dr Peyman, can you please suggest an example of PCA extension to try out? Thank you.
Search for "Non-Linear PCA" and "Weighted PCA" keywords.
If I find any articles will suggest them to you.
Number of samples are no concern for PCA (although for the classification they are very important), instead, number of the features are the main concern here (number of fields in each record), i.e. 167 in your case.
You may also have to do some work on increasing number of your samples and on the uniformity of the distribution of your sampling in your sample space.
One important point to keep in mind is that the principal components with the largest variances say PC1 or PC2 are NOT necessarily the ones that have discriminating ability. PCA only provide variance information about the data, not about sample separation. So cases can happen that for example, the 3rd and the 5th PC can discriminate samples from different classes.