Could somebody offer simple explanation of different parameters use to determine good quality SNPs in bacterial whole genome sequences during reference-based assembly?
There are different measures of SNP quality... You could consider quality in terms of the "confidence" of the sequencer, by Phred score. For example in a cluster of fragments on the Illumina flowcell, fragments might become desynchronized such that the cluster yields conflicting fluorescence. The ratio of fluorescence(s) would then be a determinant of the quality of the base call- which is communicated by Phred score. A SNP variant which was called based on low quality nucelotides would then be considered of low quality.
SNPs could also be considered in terms of their coverage. If a particular position was sequenced 50 times (e.g. its had 50X coverage), and "G" was only recovered once, than in this case, it is unlikely that "G" represents a true variant. So essentially, the purpose is the separation of true variation from artifacts of sequencer error, PCR error, and other sources of error.
SNPs might also be filtered based on linkage disequilibrium, minor allele frequency, complexity of the surrounding region, other factors regarding the context of the flanking region, number of alleles, degree of mismatch of the region containing the SNP to the reference genome (mapping quality) etc. There is some good nformation here: https://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html
That being said, the SNP score probably varies by what variant detection method you are using. One metric used by GATK called the "VQSLOD" attempts to combine information regarding all of these various annotations (coverage, quality score, sequence context, error rate, etc) and uses a machine-learning algorithm to "figure out" what characteristics determine true quality of a SNP while minimizing false positives (artifacts slipping through the filters). http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
Other variant callers (like freebayes: http://arxiv.org/pdf/1207.3907v2.pdf) do things differently, but the concept is the same.
There are different measures of SNP quality... You could consider quality in terms of the "confidence" of the sequencer, by Phred score. For example in a cluster of fragments on the Illumina flowcell, fragments might become desynchronized such that the cluster yields conflicting fluorescence. The ratio of fluorescence(s) would then be a determinant of the quality of the base call- which is communicated by Phred score. A SNP variant which was called based on low quality nucelotides would then be considered of low quality.
SNPs could also be considered in terms of their coverage. If a particular position was sequenced 50 times (e.g. its had 50X coverage), and "G" was only recovered once, than in this case, it is unlikely that "G" represents a true variant. So essentially, the purpose is the separation of true variation from artifacts of sequencer error, PCR error, and other sources of error.
SNPs might also be filtered based on linkage disequilibrium, minor allele frequency, complexity of the surrounding region, other factors regarding the context of the flanking region, number of alleles, degree of mismatch of the region containing the SNP to the reference genome (mapping quality) etc. There is some good nformation here: https://bioinf.comav.upv.es/courses/sequence_analysis/snp_calling.html
That being said, the SNP score probably varies by what variant detection method you are using. One metric used by GATK called the "VQSLOD" attempts to combine information regarding all of these various annotations (coverage, quality score, sequence context, error rate, etc) and uses a machine-learning algorithm to "figure out" what characteristics determine true quality of a SNP while minimizing false positives (artifacts slipping through the filters). http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
Other variant callers (like freebayes: http://arxiv.org/pdf/1207.3907v2.pdf) do things differently, but the concept is the same.
According to my experience there are two really important parameters: Phred-scale SNP quality and strand-bias wich produces false heterozygotes in haploid bacterial genome.