Hi everyone!

I analyzed 20 tissue samples of oral leukoplakia (OL - an oral potentially malignant disease) through untargeted metabolomics to compare the metabolic profile of those OL who had malignant transformation (5) and those who did not (15). I know that the small sample size is one important limitation of the study, but OL is a rare disease and I have to deal with it.

Well, when I use my complete dataset (around 4k compounds) to perform multivariate analysis such as PLS-DA, my model is overfitted, exhibiting a negative q2. However, when I use the 72 compounds considered statistically significant by the univariate methods (hypothesis tests) as the input data, my q2 rises to 0.6. The improvement also occurs when I use this small dataset to build the heatmap that clearly distinguishes the malignant transformed from the non-transformed OL. Interestingly most of the compounds classified on the PLS-DA VIP list are the same, both using my whole data and using the 72 discriminant features as the input.

I recently presented my thesis to a metabolomics specialist and she told me that my analysis is curious and that she cannot tell me whether it is right or wrong.

Would anyone here help me with this question?

Thanks!

More Roberta R. Martins-Chaves's questions See All
Similar questions and discussions