In differential expression analysis, why are sequences with less than 10 reads filtered out?

More Bronwyn Adriana Rotgans's questions See All

How to do % confluencey for flat cells in ImageJ?

I have some phase contrast images of cells that are quite flat. This means that when doing subtract background/threshold and measure % area for confluence it doesn't hit all of the cells as the...

04 November 2018 744 6 View

Best gender analysis tools for the Southeast Asian context?

I'm working on a project evaluation for a maternal health project with ethnic minority women in Vietnam, and would like to conduct a gender analysis to understand gendered roles and access to...

01 May 2016 9,815 5 View

Would you use a paid panel to collect data from business owners?

Data collection is becoming a challenge for my PhD (exploring stress and coping in entrepreneurs). If anyone has used paid panels (from online survey companies) to gather data from entrepreneurs...

08 June 2015 7,019 9 View

Social survey for permaculture and climate resilience

I need assistance please. I am still busy with my research proposal and my feedback from my professor is somewhat vague. Its as if he didnt read through my survey questions or look at the...

01 January 1970 2,307 3 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View

Philipp Drewe Popular answer

Hello Adriana,

typically low read counts are much noisier than high read counts, i.e. in replicates their variance is bigger in comparison to their mean (See for example http://genomebiology.com/2010/11/10/r106 for a paper that describes this behaviour of reads).

Thus, the fold-changes of genes with few reads have a high variance and can be easily very high just by chance (e.g. 1 read vs. 10 reads) without meaning necessarily that the gene expression has changed.

The fold changes, however, can also be caused by genes that have a low abundance in one condition but are completely silenced in another condition, which one would like to detect as differentially expressed.

To properly say whether a change in read counts between two condition is is observed just by chance or reflects a true change in gene expression, an approach that is reliable is to first model the read distributions. The modelled reads distributions can then be used toidentify statistically significant differential gene expression. This approach is taken by tools such as DESeq or edgeR.

Tools that do not perform the modelling of the read distributions try avoid false calls of differential expression by not looking looking at genes with a low expression (e.g. by requiring a minimum number of reads for genes). This means, however, that these tools cannot detect differential gene expression for lowly expressed genes.

Vivek Tanavde

Such few reads usually represent transcriptional noise. With so few reads it is difficult to ascertain whether these are truly actual instances of expression or randomly present due to sequencing errors. This can easily be ascertained by having multiple replicates of the same sample. If these reads are present in one, but not in other replicates then they most probably are transcriptional noise. We used this approach to ask the same question of small RNA expression library. To answer this question we used K-S statistics applied to frequency distribution plots of replicates. The rationale here was that the cut off should be the lowest number of reads where the distance between the two frequency distribution plots of replicates should be minimized. This was a valid assumption, since if the replicates were exactly the same the frequency distribution curves would overlap and the distance would be zero. Thus we defined noise as the minimum number of reads that prevents frequency distribution curves of replicates being close to each other. I am attaching the paper that describes this in more detail.

Article Analysis of deep sequencing microRNA expression profile from...

Philipp Drewe

Johanna Nelkner

Hello Bronwyn,

As i know, you can call Corset with the option -m :Filter out any transcripts with fewer than this many reads aligning. Default: 10

Considering the background as explained by Philipp Drewe, you shouldn't set this value too low, depending on the quality of your data.

Paweł P Łabaj

Another approach is to use tools which sort of assume that everything is expressed just not sequenced deeply enough and thus assign a prior. An example is BitSeq (Bayesian Inference of Transcripts from Sequencing Data) - doi:10.1093/bioinformatics/bts260.

The general discussion about influence of low expressed transcripts/genes on DE analysis you can also find in some of the Web Collection of MAQC consortium SEQC project papers: http://www.nature.com/nbt/collections/seqc/index.html

Christian Cole

When setting arbitrary thresholds we have to be very careful that we're not introducing bias in our data. I strongly dispute the comments that low counts are simply 'noise'.

Read counts for a gene are a function of expression and length. A 10kb transcript with 10 reads is very different from a 200bp transcript with 10 reads: it's a 50-fold difference in expression. In using a simple cut-off they are being treated the same, which may be a problem in your experiment.

I agree with Thomas in that cut-offs, if any, should be performed across *all* samples not per sample. So, if a gene's average expression is ~3 reads in all samples, then it's unlikely to be informative.

I would set the threshold much lower, or 'off' even, and then analyse the cufflinks results at the end keeping in mind the average expression. Any 'significant' genes with low expression should be treated with caution.

Bronwyn Adriana Rotgans

Thank you all very much for the insight and links! This has been incredibly helpful information.