I have been working to close the gaps of a bacterial genome and have been using different programs but somehow end up at dead ends. I've used Contiguator, Mauve, and currently CLC genome finishing module.
How have you sequenced your genome, Illumina, PacBio or other? What is the length of your reads and your coverage?
You cannot expect to get a (reliably) closed genome with short reads. No assembly algorithm can resolve repeat regions and/or insert regions with short reads. If you are using Illumina, you can try to also sequence with PacBio, which produces longer reads and will give you a higher chance of spanning low complexity regions in one read. You can also try to create a whole genome map (Opgen) to map your reads against. If you are close to closing the genome with your current strategy (only a few contigs), you can try to design primers against the edges of the contigs and use good old sanger sequencing to extend the edges. However, if you still have a lot of contigs, this is not really feasable.
How have you sequenced your genome, Illumina, PacBio or other? What is the length of your reads and your coverage?
You cannot expect to get a (reliably) closed genome with short reads. No assembly algorithm can resolve repeat regions and/or insert regions with short reads. If you are using Illumina, you can try to also sequence with PacBio, which produces longer reads and will give you a higher chance of spanning low complexity regions in one read. You can also try to create a whole genome map (Opgen) to map your reads against. If you are close to closing the genome with your current strategy (only a few contigs), you can try to design primers against the edges of the contigs and use good old sanger sequencing to extend the edges. However, if you still have a lot of contigs, this is not really feasable.
Before I came on the project, the genome was sequenced using a combination of Illumina and 454 Roche. The output was combined using Seqman pro and produced 29 scaffolds. From my understanding so far, the contig length ranges from ~800bp to ~350kb and a coverage average of about 3x. I have aligned the contigs to a reference genome and it does show a lot of gaps. My confusion is that a lot of the large contigs only partially align to the reference, and two contigs do not align at all. I really appreciate your response btw, maybe just telling somebody about this problem will help me figure it out.
Three-fold coverage sounds far too low to expect to get closure. Are you sure that is right? I think it is more typical to have 100-fold or better coverage before attempting to get a complete bacterial genome. Depending on what species of bacteria, it may or may not be reasonable to expect your isolate to have a large amount of synteny (same gene order) over large regions with any given isolate from the same species.
Brian is right. 3x average genome coverage is too low to get full closer. All the same, how closely related is this reference organism to yours? If the same strain, these gaps and unaligned contigs are troubling. If only the same genus...meh, might be real, or might be misassembly or user error...to me, it's always better to start from the beginning. Redo the assembly. That way you know how all the data was processed.
Also, how sure are you that you sequenced a pure strain? Maybe you have some contigs from another organism and that's why a few don't align? Or maybe your organism has a plasmid the reference doesn't have? What are the genes on those contigs--plasmid genes? Or, can you classify those contigs to see if they're from the same organism?
For gap filling you could use Abyss-sealer to close gaps and FinishM to try and join contigs together (https://github.com/wwood/finishm). However, FinishM will change your contig names. Both programs try to take the reads and walk from one end of a gap to the other.
Steven and Brian, both of you mentioned some good ideas,and things I have not thought of. I did suspect that 3x is low in comparison to other data sets that I have seen. As far as the purity of the organism, I do not think its likely since more than 90% of the contigs aligned to the reference. The contigs are from a pathogenic strain of S. dysgalactiae, and are being aligned to a couple references found in NCBI. Which brings another question, there are distinct differences in the alignment depending on the which reference I use. I have been focusing on the reference with the closest alignment, but I have also aligned the contigs against the two references at the same time, and I have been tempted to design primers on that but am not sure how that would work out.
Steve, I have thought that it would be much better to understand the data if I started over, but this is an undergraduate project am choosing to do on the side and I don't have access to the equipment yet. However I will learn more about Abyss-sealer and other program you mentioned.
Do you known of a website or program to see if the large contig that doesn't align may contain pathogenically important genes?
Ah, I see. Undergrad project is a different story. As for the purity of the dataset, who knows. It is more common than people think to have microbial contamination in reagents and especially in cultures. Multiple strains are often seen in metagenomic datasets...it's an interesting topic for assembly, as similar DNA sequences with small differences will confuse the crap out of assemblers. This is often where abyss-sealer is useful. The assembler may have died because there were too many possibilities in extending the contig, so it chooses to die instead of create a chimera. What abyss-sealer does, depending on the parameters you set, is create a consensus of the multiple possible sequences it finds to walk further than the assembler wanted to. So closed gaps may not represent one organism...this is in the case of multiple present strains, of course.
Anyway, i'm not a clinical microbiologist, so i'm not aware of databases or programs to identify pathogens. A quick, but adequate solution would be to blast those contigs to NCBI and see where their best hits are. Hopefully they're to your bug! You could also use the program CheckM, which looks for single copy marker genes present in a given genome and gives you a report of the "contamination" of your genome. Multiple copies of those marker genes indicate contamination of your genome. If you have redundant marker genes on those contigs that didn't align, you have some evidence that they don't belong in that genome.
For an undergrad project, you just have to decide what's worth the effort :-). Do the blast thing first off.
Is your project specifically to produce a closed assembly? Or is there another scientific question you want answered? In a lot of cases, you do not need a closed assembly to answer your scientific question, so it would be ridiculous to go above and beyond to produce a closed assembly. For instance, if you want to know the relatedness of your isolate with others, a closed assemby is nice, but not necessary. There are more than enough techniques to do this without a closed assembly.
As mentioned by Brian and Steven, a three-fold coverage is extremely low. Is this a coverage you derived from the assembly, the alignment to the reference or from the raw read file?