I am trying to multiplex as many bacterial genomes as possible, and I would like to know what the target currently is for depth of coverage: 100x, 50x, 25x ?
According to my study (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0104579) 50x (Illumina) is necessary for SNP detection using reference-guided assembly. Using de-novo we have calculated that 60x coverage is necessary for accurate SNP detection. Generally, I would say aim for 75x.
Depends on what are you trying to do. If you have a reference genome and you are resequencing, even 20x coverage will do it to have an idea of the variation of your sample. However, if you are assembling it, then between 50 and 100x will be more than enough. All these assuming you are talking about Illumina reads.
Again, you have to tell what is your goal. If you are using other techonolgies, for example, 454 or PacBio, 20x and 70x will be enough, respectively.
So, unless you tell us more, it will be difficult to answer your question.
I would add that whatever the coverage is, try assembling different amounts of the data, because I have seen 454/Illumina assemblies get worse (more and shorter contigs) as coverage increases. More is not always better.
According to my study (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0104579) 50x (Illumina) is necessary for SNP detection using reference-guided assembly. Using de-novo we have calculated that 60x coverage is necessary for accurate SNP detection. Generally, I would say aim for 75x.
What they said ^ but also keep in mind that evenness or consistency of coverage is also important. Data of X mean coverage with a small standard deviation produces different results to data of X mean coverage with a larger standard deviation. Especially with de novo work. On some platforms PCR-free prep methods improve results by removing amplification bias and evening out coverage. More even coverage can mean that lower depth of coverage becomes suitable for a given analysis. So for an amplified library 100X may be needed to minimize 0 coverage regions while 50X would be fine on a PCR-free library.
The chosen sequencing depth (or coverage) should be high enough to minimize the size of unsequenced regions, but within limits. This is so because most programs will use a de Bruijn graph representation of the data to assemble the genome sequence. Considering the high error rates of current NGS technologies, using too many reads will make the graph unnecessarily complex (e.g. with more bubbles, more tips, more erroneous connections) , requiring more computational resources (e.g. memory, time) and making the assembly significantly more difficult.
To sequence genomes de novo for bacteria (6 to 10 Mb genomes), with illumina hiseq (2x 100 bp, paired end) we have tried to multiplex 3 to 6 genomes, and from the assemblies we estimated that we can multiplex up to 9 still getting our full genomes in
Must consider the G+C content of your organism before selecting sequencing depth and/or assemblers and assembly parameters..... for example ... on a de Bruijin graph based assembler the, 1000X illumina data of a 45%G+C bacteria will perform completely different than same depth data of a 75% G+C organism (like Actinomycetes)
As noted it depends on your goal. Genome mapping for SNPs, novel gene clusters or de novo sequencing, assembly and closure. We sequenced 42 bacterial genomes (5-6 Mbp) with Nextera libraries in a single Illumina HiSeq 2 x 150 run. We got ~160 million reads (80 million pairs), and resulted in genome assemblies with 100-200X coverage and 50-100 scaffolds. The two biggest factors we have noticed effecting our genome assemblies are 1.) average insert size of the nextera library (smaller 150nt bad, 250-300 better) and 2.) the size and amount repeats (in our case IS elements). If you want to close genomes like ours, using a complementary approach (PacBio) or TrusSeq libraries with significantly larger insert sizes are necessary.
Patrick I would be happy to have an idea of the distribution of pairs on your genomes assemblies. Was it very variable from one genome to another, and what was the average of pairs sizes and range of most fequent ones? Cheers.
Thank you to everyone who has answered. I guess part of my problem was the simplicity of the question. I am not able to answer many of the additional questions asked, but these answers(and questions) will be helpful in guiding me towards the right Coverage Depth for my current project.
Lionel, using the Nextera xt kit the library distributions going into the illumina sequencer were pretty broad (150bp-1kb). This distribution was based on the fragment profile from the BioAnalyzer chip. However, the output (sequenced) fragment sizes seemed pretty tight. I don't have the stats handy at the moment but, I'd estimate that >90% inserts were within +/-50nt of the mean (150-250nt). I'm guessing this is an artifact of amplification bias during colony formation? If you need more firm numbers I can talk to my student.
Thank you Patrick. I'm very worried on the insert sizes you gave me! With the TrueSeq it was much better as we could get inserts of 700 bp in mean by performing a sizing on gel of the fragments, and the assemblies were very good! But now illumina stopped this kit and we have to use the Nextera... !
Hi, to summarize and add a few points: The quality of the genome assembly, e.g. measured by contig number or N50 value, depends on
- coverage
- simple reads versus paired-ends/mate-pairs
- read length (HiSeq vs. MiSeq vs. 454 vs. PacBio vs. ...)
- the genome structure itself.
We have used 96x multiplexing of HiSeq, 2 x 100 nt reads. The first experiment used insert sizes of 250 - 300 bp, the second experiment used the Nextera XT with insert sizes from 250 to 1500 bp. We did not see any significant change due to the larger insert sizes (although I did not confirm them - this statement is just based on the company's information).
We targeted bacterial genomes of about 5 Mbp. The coverage was between typically between 50x and 100x, resulting in ca. 200 contigs for enterobacteria, between 250 and 1000 contigs for Xanthomonas (depending on the species) (e.g. BioProjects PRJNA266384, PRJNA266386, PRJNA266603, PRJNA266604, PRJNA267193, PRJNA266578), and more than 1000 contigs for one strain of Burkholderia (genome size 6.6 Mbp, BioProject PRJNA267193). We tried Edena and Velvet for assembly. Depending on the species, one or the other assembly program performed better and Edena seemed to be better for repetitive regions. Watch out for our Genome Announcements in issue 3 (1), in press.
Is there anybody here who has used Pacbio for Bacterial Genome sequencing with not more than 10x coverage? what was the result?
I mean, concerning new Pacbio kits capabilities for producing of reads with mean length of more than 10kbp, is it essential to have more coverage in denovo genome assembly? I think by having reads with mean length of 10kbp , a lower coverage (10x) is enough.