Hi everyone.
I'm trying to conduct a metabolomic study by comparing two sample groups. The program I am using is SIMCA MVDA ver. 17 for the multivariate analyses. I am stuck between three different methods to assume reliability of an OPLS-DA model:
1. A PCA as a predictor for fitting an OPLS-DA model. From my understanding, separation on the X-axis between sample groups and high R2 and Q2 from two principle components of a PCA must be 'closer to 1(?)' before an OPLS-DA can be considered. If the samples are not separated or the R2/Q2 values are not similar/closer to 1 in the PCA, then conducting an OPLS-DA will create an unreliable model.
2. Straight to OPLS-DA and rely on the visual separation + R2 (cum) + Q2 (cum). I've read comments online on RG about going straight to an OPLS-DA and visualizing the separation and reading the R and Q cumulative numbers. However, I do not fully understand how to ensure if the model is actually a good predictor for 'differences in chemical compounds' from scores to be reflected onto the loadings.
3. Straight to OPLS-DA but use a CV-ANOVA for model reliability. Since this method just uses a CV-ANOVA and looking at the generated ANOVA table (P-value), I think I would choose this method. However, what would R and Q values mean in this case? Could they still be used as a predictor in the model in combination with a CV-ANOVA? And should a PCA still be used before conducting this OPLS-DA method?
So my main question would be, 'Which method would be most useful in metabolomics studies?'.
Additional Note: I am comparing a large variety of variables (more than 10,000 variables) within two sample groups, so I face lots of noise in all my PCAs.