How much data is needed for analysis of differential expression using RNA-seq data without genome/transcriptome reference?

Depending of the species and the number of transcripts that you have represented in your population. Genes more expressed (with more transcripts) will have a higher depth and will produce longer consensus sequences. Also it will depend of the sequencing technology and the software used for the assembly, some assemblers produce better results than other. For example in the Trinity article (Grabherr MG. et al. 2011)(http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3571712/) they reconstructed ~ 86% of S. pombe transcripts with 50M reads (~3.65 Gb). For mouse, with more genes and more complex patterns, the percentage is similar (~86%) for the expressed genes (54% in this case), with 52.6M of reads (~4 Gb). A more recent paper (http://www.biomedcentral.com/content/pdf/1471-2164-14-167.pdf) support values of the same order (~2-3 Gb) (Francis WR. et al 2013).

Once you have your reference, probably you will need around 20-25M mapped read per sample to have a good expression measure (http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf), so with the combination of two samples (of the same accession/strain/variety) should be enough to reach ~4 Gb (although more sequences could be better if you have a server big enough to assemble them).

Also I think that the length of the reads should be consider. Longer is better (my experience is that 100 bp can give good results, but I imagine that 150 or 250 should be better).

I guess that these numbers are applicable to diploid and not highly heterozygous species, in more complex scenarios like some plants probably you'll need two or three times these sequence amounts.

Aureliano Bombarely

Also I think that the length of the reads should be consider. Longer is better (my experience is that 100 bp can give good results, but I imagine that 150 or 250 should be better).

I guess that these numbers are applicable to diploid and not highly heterozygous species, in more complex scenarios like some plants probably you'll need two or three times these sequence amounts.

Georgios Georgakilas

there is increasing evidence (i will re-post with a link to a recent study) that the number of replicates is far more important than the depth of the experiment. In the study i will link, they state that at least 3 replicates are needed in order to extract meaningful results (they suggest at least 5).

Guilherme Toledo

Hi, thanks for help.

Georgios, I agree that replicates are the really important issue due to statistics involved in further analysis. But my concern is about how much to sequence in order to achieve a minimum amount of transcripts that can be compared between replicates and treatments.

Here in our lab we work with several non-model plant species, without any information about genome/transcriptome size in some cases. And some species are polyploids.

Aureliano, what you say is that I have to try to assemble the transcriptomes near to completion to achieve a useful comparison between samples? If we assemble a partial transcriptome (but same ratio for all samples) is enough for detection of some differentialy expressed transcripts?

We're considering use of 4 gb for each replicate, but your last statement about plants and ploidy got us re-thinking about all the experiments.

Aureliano Bombarely

Hi Guilherme,

You really don't need to assembly the transcriptomes near to completion, but more complete that you have your transcriptome easier is going to be the analysis. For example if you don't have enough sequence some of the transcripts will be assembly in two or more contigs. In your analysis you will treat them as separate genes, but they will be the same. I think that 4 Gb per replicate is good.

Assemble a polyploid transcriptome can be hard. I have experience trying to assembly N. tabacum using 454 and we estimate that we collapsed more than 2/3 of the transcriptome (http://www.biomedcentral.com/1471-2164/13/406/). Illumina reads are shorter so the assembler can have more problems to split the homoeologs regions. Collapse homoeologs in a polyploid is not a problem if you collapse a high percentage of the homoeologs, you always can try to separate them later (the problem is when you collapse the most conserved ones)

Here my suggestions for polyploid assemblies:

+ Use pair end information (you will be able to phase polymorphism from the same gene).

+ Longer reads, better (min. 100, 150 is great and 250 even better).

+ If you know the diploid progenitors, sequence them. You can use them to separate the homoeolog reads before the assembly.

+ If you are using OLP assembler, use a high identity percentage, but keeping in mind the error rate of the sequencing. If you are using a Bruijn graph assembler use high kmer values (so don't use Trinity as first option because it uses a fixed kmer of 25, try SOAPdenovo-trans and ABySS-trans first).

I teach a bioinformatics course and I did a presentation for RNAseq analysis, explaining some of the problem that you can find with polyploid and recent whole genome duplications (http://www.slideshare.net/aubombarely/rnaseq-analysis-19910448)

Good luck !!!

Guilherme Toledo

Aureliano, thank you so much for all the time you spend answering my basic questions. All this information is of great value to a bioinformatics beginner like me. Best regards.

Leandro Costa do Nascimento

Hi Guilherme

I think that 5 Gb per replicate is a good quantity. But, if you sequence more you will have a better and less fragmented assembly.

As stated, the species of your work is a good question. I worked with denovo transcriptome assembly of a lot of plants, and the better result that I reached was about two contigs per gene.

Speaking about how much to sequence I recommend this article:

http://www.biomedcentral.com/1471-2164/13/734

Moreover, if you want to talk personally (we are both at Unicamp) about this, please let me know.

Best regards,

Leandro

Use of pooled samples in RNA-seq experiments. Can we have research gate community experience and opinion about this issue?

What kind of evolutive questions can be resolved with RNA-seq experiments?

Is there a problem with my RNA pellet?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Strugglling with m6A dot blot any suugesstion ?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

How to confirm the site-directed mutagenesis result without performing NGS?

RNA Extraction Using Hot Borate Method No Longer Working?

Does Anyone have expertise in in vitro transcription and RNA pull down assay?

Are there any instruments for studying time similar to the way it is in space?