Why RNA-seq data are always modeled as negative binomial distribution? What are the parameters or the assumptions that make RNA-seq data fit the negative binomial distribution?
Thanks Robert and Diego! the paper and discussion were very helpful.
The bottom lines:
1) In the DE experiments, distribution models are used to fit gene expression across replicates, not within a sample.
2) The discrete RNAseq data can be modeled as Poisson distribution if the variance/mean ration = 1 (e.g. RNAseq from technical replicates). TSPM and GLM algorithms use Poisson distribution. They assess the variance to mean ratio in different ways. Then they use quasi-Poisson distribution if the variance to mean > 1 and Poisson distribution if the variance to mean = 1
3) The discrete RNA-seq data from biological replicates fit better in NB distribution, which assumes the data are overdispersed (variance > mean). The algorithms “EdgR”, “DESeq”, and “baySeq” use negative binomial (NB) distribution to model RNA-seq data. The difference between these algorithms is mainly in their ways of measuring or assessing the overdispersion.
4) A recent paper by Wanger et al., used a combination of two distributions. Exponential distribution for non-expressed genes (inactive genes if the count is ≤ 2 TPM) and Negative Binomial distribution for expressed genes (active genes if the count > 2 TPM).
“Wagner GP, Kin K, Lynch VJ. (2013) A model based criterion for gene expression calls using RNA-seq data. Theory Biosci”
Thanks Robert and Diego! the paper and discussion were very helpful.
The bottom lines:
1) In the DE experiments, distribution models are used to fit gene expression across replicates, not within a sample.
2) The discrete RNAseq data can be modeled as Poisson distribution if the variance/mean ration = 1 (e.g. RNAseq from technical replicates). TSPM and GLM algorithms use Poisson distribution. They assess the variance to mean ratio in different ways. Then they use quasi-Poisson distribution if the variance to mean > 1 and Poisson distribution if the variance to mean = 1
3) The discrete RNA-seq data from biological replicates fit better in NB distribution, which assumes the data are overdispersed (variance > mean). The algorithms “EdgR”, “DESeq”, and “baySeq” use negative binomial (NB) distribution to model RNA-seq data. The difference between these algorithms is mainly in their ways of measuring or assessing the overdispersion.
4) A recent paper by Wanger et al., used a combination of two distributions. Exponential distribution for non-expressed genes (inactive genes if the count is ≤ 2 TPM) and Negative Binomial distribution for expressed genes (active genes if the count > 2 TPM).
“Wagner GP, Kin K, Lynch VJ. (2013) A model based criterion for gene expression calls using RNA-seq data. Theory Biosci”