Actually that depends, on genome size, degree of annotations and genome build.
e.g.
for Humans if 10M reads are mapped then its good for differential gene expression analysis. Since genome is well annotated and assemble for humans so you would get most of reads mapped to genome.
In your case if genome is not constructed well, then you would expect lesser reads mapping to genome or transcriptome. so you would need deeper sequencing to get >10X coverage of genome of your interest.
Reference genome is present for weed species on which I will work. I am planning to go with 25 M reads per sample. Will it be ok or I must decrease the depth to somewhat 20, 15 or 10? Waiting for your answers.
I cannot add anything to read depth since I agree with the answers above.
But even more importantly (down to a certain threshold) than read depth, is the number of replicates since they are basis on which you are going to do your statistics. Here I would always recommend 3-4 since that is where most tools approach the saturation level.
Did you read Michael's suggesteds paper? From reading it you should be clear that depth doesn't matter as much as replication.
For a given lane of HiSeq data you get about ~150M paired-end reads. You could get 15 samples with 10M reads or 1 sample with 150M reads. Getting one sample with 150M will *not* give you better data than 5 replicates of 3 three samples each 10M reads. Always go for replicates over depth.
For our Arabidopsis studies we always do a minimum of 30M reads per replicate. Stranded RNAseq is very important.
Hi, I want to ask as to how to modify the script in R so that it runs the data at sequencing depth 30M instead of 2.5M. The current data reads at 2.5M. Thanks.
Hello Christine and Manvendra: Thank you. I am very new to R and just learning it. I have a data which is at 2.5M and would like to convert so it reads at 30M. Dont know how to do that. Here is the script:
# Read the data in "count_matrix_sub_2.5M.txt"
# and run a DESeq2 experiment comparing the transcriptome
# of untreated MCF-7 tumor cells to those treated with estrogen.
# There are 7 replicates for each of the conditions (untreated, treated)
Hello Christian and Manvendra: I just tried changing 2.5 to 30M wherever it appears in the script, but not sure if that's the correct way of doing it. Also, wanted to know the plots generated are different (histogram, MA plot) in both scripts. What is the difference in plots when I use 30M instead of 2.5M - I know it gives better results but want to know exactly how? sorry, I may be asking basic/simple questions but I really want to understand the difference it makes when I change the read depth. I also want to know if I need to change the "condition" from a character to numeric by doing this: #f
You're overcomplicating things; your makeDataSet() function is unnecessary. Most importantly specifying the number of reads like you're doing is wrong unless you have a very good reason to do so. Especially as you're setting each sample to exactly the same number of reads, this is never the case in RNA-seq. DESeq2 expects the read depth for sample to be different in order to normalise them appropriately. Just let it do it's calculations for you.
I recommend you follow the DESeq2 manual, esp section 1.3.3 Count Matrix Input. You should only need to do something like this:
Dear Manvendra, As u said sequencing read depth for RNAseq depends on genome size. Plz let me know if what is the rule of thumb for that? If I have 2500 MB genome size and want to perform de-novo assembly (seq by illumina Hi-seq, paired 100 bp read length), then what should be minimum amount of sequencing data for differential gene expression analysis?
@shweta I presume you mean genome assembly? Is the genome diploid, haploid or multiploid? The higher the ploidy of a genome the more complex assembly will be.
Genome assembly is highly dependent on the length of repetitive elements in the genome. The more there are and the longer they are the harder the assembly will be. For that reason raw sequence read length is amongst the single most important requirement for a good assembly. 100bp PE illumina sequencing will guarantee you get a highly fragmented and incomplete genome, especially one the size of 2.5GB.
For any assembly, I would recommend going with PacBio or Nanopore sequencing which gives reads in the 10,000-100,000bp range. Ignore the "problems" of read errors, they are far simpler to fix than an incomplete/fragmented genome.