To calculate the statistical quantities like variance or mean expression of a gene across cells, it is unfair to compare expression counts of a say gene A, if we know different cells has not be sequenced with sufficiently large net transcript count or library size and hence there is a low or high net resolution across cells due to variation in the library size. I am interested to know how does one deal with this problem. For example, in the attached plot, I have calculated the mean and std of every gene from a dataset across all the cells even though I show in the 2nd plot that the sum of the transcripts (library size) across cells have a distribution. I am sure because of this, certain highly varying genes maybe not having a high Fano factor or high variance when calculated where as low variance genes may be also picked up as highly variable. One way I did fix this issue is by looking at the statistics individually in each library size bin. Do you think there are some better methods out there yet ?
#scRNAseq #dropletsequencing #NGS #bigdatagenomics