Will appreciate explanation of criteria used in determination of quality of SNPs in bacterial whole genome sequences?

Tyler Chafin Popular answer

There are different measures of SNP quality... You could consider quality in terms of the "confidence" of the sequencer, by Phred score. For example in a cluster of fragments on the Illumina flowcell, fragments might become desynchronized such that the cluster yields conflicting fluorescence. The ratio of fluorescence(s) would then be a determinant of the quality of the base call- which is communicated by Phred score. A SNP variant which was called based on low quality nucelotides would then be considered of low quality.

SNPs could also be considered in terms of their coverage. If a particular position was sequenced 50 times (e.g. its had 50X coverage), and "G" was only recovered once, than in this case, it is unlikely that "G" represents a true variant. So essentially, the purpose is the separation of true variation from artifacts of sequencer error, PCR error, and other sources of error.

SNPs might also be filtered based on linkage disequilibrium, minor allele frequency, complexity of the surrounding region, other factors regarding the context of the flanking region, number of alleles, degree of mismatch of the region containing the SNP to the reference genome (mapping quality) etc. There is some good nformation here: https://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html

That being said, the SNP score probably varies by what variant detection method you are using. One metric used by GATK called the "VQSLOD" attempts to combine information regarding all of these various annotations (coverage, quality score, sequence context, error rate, etc) and uses a machine-learning algorithm to "figure out" what characteristics determine true quality of a SNP while minimizing false positives (artifacts slipping through the filters). http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr

Other variant callers (like freebayes: http://arxiv.org/pdf/1207.3907v2.pdf) do things differently, but the concept is the same.

Hopefully that helps in some way

Tyler Chafin

Other variant callers (like freebayes: http://arxiv.org/pdf/1207.3907v2.pdf) do things differently, but the concept is the same.

Hopefully that helps in some way

Plasmid sequencing: bacterial whole genome sequencing or direct sequencing??

How do I construct a phylogenetic SNP tree using whole genome sequences of bacteria?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

How to confirm the site-directed mutagenesis result without performing NGS?

Who of all the Global Scientific community will help me Prof. Dr. Yoshida make way for TPEOM, MEC ~EMC to return the atmospheric gases to the norma ?

Does anyone have issues using Prepman Ultra reagent for MicroSeq ID bacterial, fungal and yeast sample preparation?

How is the bacterial genome's high protein count verified as genuine despite 800+ contigs and good metrics (98.55%completeness, 0.68% contamination)?

Recovery Viurses from bacteria genome?

Should the amount of DNA input used for ChIP-seq library preparation be matched between the control and experimental groups?

Why microbes are used in environmental engineering and role of microbes in reducing environmental pollution and improving environmental quality?

If my gene of interest has high GC content can it be problematic in sequencing? What kind of error is expected with GC rich gene sequences??

Does post-translational protein modification cause devisions on observed pI verses calculated pI?