There are different ways of using BLUPs for and from GWAS. However, there are also limitations!
Generally speaking, we want to use phenotypic records in all statistical analyses. But, sometimes this is not possible (for several reasons) and we may use BLUPs instead. Here are three common reasons for using BLUPs instead of phenotypes:
Example 1: Complex models
- Some software have limitations in performing some type of analyses, such as, but not limited to: including random effects other than the residual, repeated records (related to the previous one), multivariate analysis (aka multiple-trait analysis), etc.
- In this case, one could use BLUPs that are already adjusted for these effects when performing GWAS
Example 2: Limited phenotypes
- Given a genetic relationship matrix (e.g., A matrix, Genomic Relationship Matrix) that measures the genetic similarities (i.e., covariance) among individuals, the Mixed Model Equations (MME; Henderson, 1963) allows every single individual in the relationship matrix to have a BLUP, regardless of the individual having or not phenotypic records.
- Hence, when the genotypic data include individuals with and WITHOUT phenotypic records, a larger dataset used for analysis could be obtained by using BLUPs instead of phenotypic records.
Example 3: Individuals with large progeny records
- In general terms, when an individual in the relationship matrix has lots (i.e., hundreds to thousands) of progeny records, the BLUP of this individual should be highly accurate.
- Hence, in a large number of genotyped individuals have a large number of (non-genotyped/limited-genotyped) progeny with phenotypic records, the use of BLUPs in place of its phenotypic records could provide with better/more accurate GWAS results.
However, as I mentioned, there are also limitations about this approach:
Example 4: BLUPs have different accuracies
- When talking about real data, we see a large variation on the number and degree of relationships among individuals, the number of phenotypic records, and more.
- Therefore, some BLUPs should be more accurate than others. HENCE, the statistical analysis using BLUPs should be properly weighted by the level of uncertainty of these BLUPs.
- Such weighing procedure could be complex or impossible to be implemented (depending on the software and dataset)
Example 5: BLUPs are only part of story
- BLUPs are estimates of the additive values of the individuals. Thus, if your goal in your GWAS is to identify non-additive effects, such as dominance and epistasis, it is not expected to identify associations for SNPs with non-additive estimates.
- Therefore, BLUPs, unless specifically calculated to include those effects*, should not provide you with these associations.
*By the way, if non-additive effects are included, we shouldn't call them BLUPs anymore.
There are additional thoughts about this, but I think this could give you some clarity.
Please let me know if you have any other questions. Thanks, Nick
The BLUPs is used in the case of analysis using a single trait animal model to estimate the correlation coefficient between the studied traits in order to genetically improve one of the traits by improving for the other trait that is high in the genetic h2
the reason is probably more historical than theoretical. Since the introduction of BLUPs by Henderson in the 70's, these have been used as a way to predict breeding values (additive genetic value) in animal breeding, specially when the trait cannot be directly measured on an individual (for instance, the breeding value of a bull for milk production). BLUPs can be purely random effects or a mixture of random and fixed effects (in the end, this is also random). The randomness of the additive genetic component is reflected on the assumption of a usually known distribution, with an expected value of 0 and a variance-covariance structure determined by a relationship matrix multiplied by a scaling factor, a.k.a. additive genetic variance. The relationship matrix could go from an identity matrix (where you assume that all individuals are unrelated) to the inverse of a kinship coefficient matrix (from pedigree records or markers).
BLUEs and BLUPs have similar properties: i) both are "best" estimators (have minimized variance or expected errors from their sampling distributions). ii) both are linear, because come from linear models (the relationship between the left and right side of model equations). iii) Both are unbiased because their mathematical expectation is exactly the paremeter they intend to estimate. The difference lies on the assumption for fixed (BLUEs) and random (BLUPs) effects.
One disadvantage of BLUPs is that the values are compressed towards the mean as compared to BLUEs. The compression degree of BLUPs is a function of the amount of information with which a particular prediction was performed, this means, it is not linear: usually the more information you have the more compressed are the BLUPs. Plant breeders on the contrary tend to use BLUEs, because traits can be in general directly measured on hermaphrodite plants and the wider (less compressed) distributions of BLUEs provide them a better discrimination power between extremes.
In GWAS, calculating R-squared values with BLUPs will give you theoretically a direct measure of the additive genetic variance explained by a marker, while in case of BLUEs this would be the amount of phenotypic variance explained by the marker. It is thus advisable to divide R-squares computed with BLUEs by the narrow sense heritability of the trait. This will provide a measurement accounting for the amount of additive genetic variance explained by the marker, when BLUEs are used.
You estimate BLUP simply because you are interested in partitioning only the genetic component of the phenotypic variation; thus, you do not want confounding signals from environmental or genotype-by-environmental effects. Also, observe, for this purpose, you are not limited by BLUPs, you can use any sort of model that allow getting main genetic effects. For instance, you can use Bayesian models, neural networks, or any sort of model that allow you estimate genetic variation free of noise, or even use one-step analysis to estimate marker-associations by jointly modelling environmental and genotype-by-environment factors.
My short answer would be as follows: It is done to save computational time while analysing very large datasets with a lots of genotypes (>100k SNP pannels) and individuals (> 1000). It allows to run models (MLMs for example) faster as you can estimate the effect of multiple covariates (year, location, field, collection, replicate, etc.) and produce single BLUP value that already includes all effects for each individual.
We use BLUP in order to get close to the population actual mean, which is to remove every confounding records that would have introduced error or bias.