How to deal with pseudoreplicates in linear discriminant analysis?

14 February 2020 3 2K Report

Within a project about geographical traceability of horticultural products, we would like to apply classification models to our data set (e.g. LDA) to predict if it is possible to correctly classify samples according to their origin and based on the results of 20-25 different chemical variables.

We identified 5 cultivation areas and selected 41 orchards (experimental units) in total. In each orchard, 10 samples were collected (each sample from a different tree). The samples were analyzed separately. So, at the end, we have the results for 410 samples.

The question is: the 10 samples per orchard have to be considered pseudoreplicates since they belong to the same experimental unit (even if collected from indepedent trees)? Should the LDA be performed considering 41 replicates (the 41 orchards, taking the average of the 10 samples) or should we run it for the whole dataset?

Thank you for your help.

Nick VL Serão

Dear Agnese Aguzzoni ,

This is a very relevant and usually overlooked issue! Thanks for asking!

The samples are independent from each other within an orchard, but correlated to each other across orchards. Therefore, using 410 samples directly in your analysis is not appropriate. It is NOT wrong to use 410 samples, but, adjustment to the data (i.e. statistical removal of the "orchard" effect from your observation) must be done prior to your discriminant analysis.

You mentioned that you have 20-25 chemicals. Let's say that you are analyzing chemical 1 (CH1). Based on the information provided, your statistical model for the analysis of CH1 would be:

CH1 = µ + Cultivation_Area + Orchard + e

where CH1 is the observed level of CH1, µ is the general mean, "Cultivation_Area" is the fixed effect of cultivation area, "Orchard" is the random effect of orchard, a "e" is the random error, resulted from the trees within each orchard (BTW, I am assuming that Orchard is somewhat nested within Cultivation_Area). In this example, you are using 410 samples for the analysis.

But, the problem is that, in your LDA, you do not want the effect of "Orchard" to have any impact on your results. Thus, you must adjust your data for it. From the statistical model above, you are only really interested in "Cultivation_Area" and the random variation, given by "e" (not that we want variation, but, variation is part of reality, and you need to acknowledge it).

Thus, from the model above, you can obtain the estimates of "Cultivation_Area" and "e" of each sample, namely "CA_hat" and "e_hat" respectively, and sum them over to calculate an adjusted CH1 value, namely "CH1*". In other words, CH1* = CA_hat + _hat.

With this pre-adjusted value, you can now run your LDA using data that does not include the effect of "Orchard", which is a systematic effect, that you are not interested in (please, I am making the assumption that "Orchard" is independent from "Cultivar_Area").

Your other approach, of averaging values out within "Orchard" will result in a similar idea (you are removing the effect of "Orchard"!!!). However, in this approach you are removing potential true variation, due to differences between trees within orchards.

Thus, both can be used, but if you use 410, you should pre-adjust the data first.

Although you did not mention anything about this,. I suggest you using some sort of cross-validation, such as leave-one-out cross-validation, to properly evaluate the predictive ability of your model.

Please let me know if you have other questions. Thanks, Nick

Oyekola Oluyimika Oloyede

Look at these papers

Article Application of multivariable optimal discriminant analysis i...

https://academic.oup.com/bioinformatics/article/28/4/531/211887

Agnese Aguzzoni

Dear Oyekola Oluyimika Oloyede , thanks for the reading suggestion.

Dear Nick VL Serão ,

Thank you very much for the informative answer.

Just few additional questions:

If I want to run ANOVA before the LDA, can I also apply it to the new variables CH* (from which I have deleted the random effect) considering the whole dataset?

Do you have any suggestion about an R package with functions through which I can transform my variables?

Many thanks,

agnese

Is it possible to simulate solid polymers on Aspen Plus?

What are the logical search operators used by ResearchGate in the search bar?

How can I convert a molecule written in SMILE in SMART?

What is the color of S. cerevisae ade12 colonies?

I am wondering what is the difference between using Versene, Trypsin or a mix of Versene and Trypsin for cell detachment?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

Unusual intensity drop in some sections of chromatograms in DDA?