- I mapped my whole genome against the reference strain, and found 100 bp gaps between all the contigs? Is it something to be worried about? How can I overcome this shortcoming in my sequence?
Depending on how you got your base reads sequenced (I.E. Illumina HiSeq, PacBio nanopore, etc), the size and complexity of the target genome, the depth of coverage, and the amount of resources you spend on it, it may be able to remove gaps.
You should be able to submit it without filling everything. Pretty much every genome I've seen has Ns in it, but I've only dealt with plant genomes so I may be wrong there.
As I understand it, with bacterial genomes the poly-N areas depict 1 of 2 things; either an actual gap in the sequencing or a separate piece of DNA (IE plasmids). I don't know how the assembler would be able to differentiate between the two, so adding a large segment of Ns denotes the uncertainty of what's being assembled. I'm not familiar with contiguator, so I can't comment on the specifics of that software though.
Your genome will be incomplete with those gaps. Do design primers to fill the gaps with flanking region, sequence the product and fill the gaps before submission.
Thanks for your kind reply. The sequencing company reported no gaps in draft genome. The gaps appear only after I align draft genome with reference strain. Is it making any sense?
Yes these are poly-N,and the software (CONTIGUator) has a an option " Do not use N to separate the contigs" when I check this option I see no "Ns or gaps".
I think those 100bp N are used to separate contigs.. so that‘s mean you can see the start of your contigs and the end..plus you can check whether the read coverage is high or low.. if you see some reads overlapping with the 100bp N but with low coverage i dont think you should worried about them.
thanks for your reply. I am very naive in bioinformatics. I could not find any gaps in my contigs other than those 100bp gaps between all contigs? Is it possible to not have any gaps in your WGS at all? Can I submit my whole genome sequence with the above gaps? Or do I need to fill them?
Depending on how you got your base reads sequenced (I.E. Illumina HiSeq, PacBio nanopore, etc), the size and complexity of the target genome, the depth of coverage, and the amount of resources you spend on it, it may be able to remove gaps.
You should be able to submit it without filling everything. Pretty much every genome I've seen has Ns in it, but I've only dealt with plant genomes so I may be wrong there.
As I understand it, with bacterial genomes the poly-N areas depict 1 of 2 things; either an actual gap in the sequencing or a separate piece of DNA (IE plasmids). I don't know how the assembler would be able to differentiate between the two, so adding a large segment of Ns denotes the uncertainty of what's being assembled. I'm not familiar with contiguator, so I can't comment on the specifics of that software though.
It is normal for genomes to be published with gaps. Gaps of 100 Ns are short anyway and won't affect much of the downstream analyses you may want to do as long as you have sufficiently-large contigs.
The way to close these gaps is to do Sanger sequencing by desigining primers to produce sequences that, ideally, will span the gaps. Another way is to use long reads (e.g PacBio), usually complemented by short reads (e.g. Illumina).
my impression is that these gaps of 100 bp (100 "N") were introduced artificially and do not say anything about the actual size of the gap. Some people insert 50 "N", other 100 "N". If it is always the same number it comes for sure from the assembly. The next consequence would be that the sequence before these "N" and the one after it correspond to individual contigs and do not necessarily be next to each other in the genome. Saying this, PCR won't help, except if you want the go for all combinations. Moreover, since you do not know the size and complexity of the gaps, you may even not be able to PCR amplify the gap (e.g. if the gap is 10000 bp in reality).
Well, if I would be you, I would submit the genome sequence as it is but I would first remove all these stretches of "N" and generate separate contigs.
BTW, could you please tell us how many of the N100 pieces do you have?
The ordered contigs are 50 in number and the N sequences are roughly in the same number. Can you tell me can we use unmapped contigs for filling these gaps and How?
I do not know what you mean with untapped contigs? I guess you work with Escherichia coli? You could check for the most closely related strain that has been completely sequenced, and then map all your contigs against this reference genome sequence.
I assume if you want to fill up gaps between contigs in the draft genome, in bioinformatics you can 'scaffold' the contigs using tool like SSPACE, or Gapfiller to fill up gaps, i.e. to get a better assembly of your draft genome. Not sure if this is helpful.
Well during sequence submission at NCBI, one has to introduce 100Ns artificially if the distance between 2 contigs is not know but they are supposed to be part of the same scaffold. I think you should align the contigs on to the strain before and after the gap is introduced.
Merging available draft genomes abd re assemble them. Secondly, removal of repeat elements significantly helps in gap removal and to coverage improvement.We found assembly tools such as SPADES are realy useful in reassembly after merging the draft genomes.