I read in seqanswers something about 4-5 Gb of paired-end data for each replicate to assure some depth. Any help from people with some experience in this kind of analysis is appreciated.
Depending of the species and the number of transcripts that you have represented in your population. Genes more expressed (with more transcripts) will have a higher depth and will produce longer consensus sequences. Also it will depend of the sequencing technology and the software used for the assembly, some assemblers produce better results than other. For example in the Trinity article (Grabherr MG. et al. 2011)(http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3571712/) they reconstructed ~ 86% of S. pombe transcripts with 50M reads (~3.65 Gb). For mouse, with more genes and more complex patterns, the percentage is similar (~86%) for the expressed genes (54% in this case), with 52.6M of reads (~4 Gb). A more recent paper (http://www.biomedcentral.com/content/pdf/1471-2164-14-167.pdf) support values of the same order (~2-3 Gb) (Francis WR. et al 2013).
Once you have your reference, probably you will need around 20-25M mapped read per sample to have a good expression measure (http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf), so with the combination of two samples (of the same accession/strain/variety) should be enough to reach ~4 Gb (although more sequences could be better if you have a server big enough to assemble them).
Also I think that the length of the reads should be consider. Longer is better (my experience is that 100 bp can give good results, but I imagine that 150 or 250 should be better).
I guess that these numbers are applicable to diploid and not highly heterozygous species, in more complex scenarios like some plants probably you'll need two or three times these sequence amounts.
Depending of the species and the number of transcripts that you have represented in your population. Genes more expressed (with more transcripts) will have a higher depth and will produce longer consensus sequences. Also it will depend of the sequencing technology and the software used for the assembly, some assemblers produce better results than other. For example in the Trinity article (Grabherr MG. et al. 2011)(http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3571712/) they reconstructed ~ 86% of S. pombe transcripts with 50M reads (~3.65 Gb). For mouse, with more genes and more complex patterns, the percentage is similar (~86%) for the expressed genes (54% in this case), with 52.6M of reads (~4 Gb). A more recent paper (http://www.biomedcentral.com/content/pdf/1471-2164-14-167.pdf) support values of the same order (~2-3 Gb) (Francis WR. et al 2013).
Once you have your reference, probably you will need around 20-25M mapped read per sample to have a good expression measure (http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf), so with the combination of two samples (of the same accession/strain/variety) should be enough to reach ~4 Gb (although more sequences could be better if you have a server big enough to assemble them).
Also I think that the length of the reads should be consider. Longer is better (my experience is that 100 bp can give good results, but I imagine that 150 or 250 should be better).
I guess that these numbers are applicable to diploid and not highly heterozygous species, in more complex scenarios like some plants probably you'll need two or three times these sequence amounts.
there is increasing evidence (i will re-post with a link to a recent study) that the number of replicates is far more important than the depth of the experiment. In the study i will link, they state that at least 3 replicates are needed in order to extract meaningful results (they suggest at least 5).
Georgios, I agree that replicates are the really important issue due to statistics involved in further analysis. But my concern is about how much to sequence in order to achieve a minimum amount of transcripts that can be compared between replicates and treatments.
Here in our lab we work with several non-model plant species, without any information about genome/transcriptome size in some cases. And some species are polyploids.
Aureliano, what you say is that I have to try to assemble the transcriptomes near to completion to achieve a useful comparison between samples? If we assemble a partial transcriptome (but same ratio for all samples) is enough for detection of some differentialy expressed transcripts?
We're considering use of 4 gb for each replicate, but your last statement about plants and ploidy got us re-thinking about all the experiments.
You really don't need to assembly the transcriptomes near to completion, but more complete that you have your transcriptome easier is going to be the analysis. For example if you don't have enough sequence some of the transcripts will be assembly in two or more contigs. In your analysis you will treat them as separate genes, but they will be the same. I think that 4 Gb per replicate is good.
Assemble a polyploid transcriptome can be hard. I have experience trying to assembly N. tabacum using 454 and we estimate that we collapsed more than 2/3 of the transcriptome (http://www.biomedcentral.com/1471-2164/13/406/). Illumina reads are shorter so the assembler can have more problems to split the homoeologs regions. Collapse homoeologs in a polyploid is not a problem if you collapse a high percentage of the homoeologs, you always can try to separate them later (the problem is when you collapse the most conserved ones)
Here my suggestions for polyploid assemblies:
+ Use pair end information (you will be able to phase polymorphism from the same gene).
+ Longer reads, better (min. 100, 150 is great and 250 even better).
+ If you know the diploid progenitors, sequence them. You can use them to separate the homoeolog reads before the assembly.
+ If you are using OLP assembler, use a high identity percentage, but keeping in mind the error rate of the sequencing. If you are using a Bruijn graph assembler use high kmer values (so don't use Trinity as first option because it uses a fixed kmer of 25, try SOAPdenovo-trans and ABySS-trans first).
I teach a bioinformatics course and I did a presentation for RNAseq analysis, explaining some of the problem that you can find with polyploid and recent whole genome duplications (http://www.slideshare.net/aubombarely/rnaseq-analysis-19910448)
Aureliano, thank you so much for all the time you spend answering my basic questions. All this information is of great value to a bioinformatics beginner like me. Best regards.
I think that 5 Gb per replicate is a good quantity. But, if you sequence more you will have a better and less fragmented assembly.
As stated, the species of your work is a good question. I worked with denovo transcriptome assembly of a lot of plants, and the better result that I reached was about two contigs per gene.
Speaking about how much to sequence I recommend this article:
http://www.biomedcentral.com/1471-2164/13/734
Moreover, if you want to talk personally (we are both at Unicamp) about this, please let me know.
Is 6GB data (Illumina Hi seq, paired end, 100 bp read length) is sufficient for de-novo assembly of a plant having genome size 2500 MB? How it should be determined?