As Ricardo points out these settings are not unheard off and generally yield good results, greater depth and validation would naturally add great value and reliability. This is not always feasible or mandatory when performing more general/meta analysis, as done here on the mutation spectra and types of genes.
To answer your question more in-depth in regard to cancer, these/similar setting have been used by others to maximise discovery rates. The difficulty with tumor samples is that these are rarely pure, i.e. they usually contain a fair percentage of healthy tissue. In combination with the possibility of having cells from multiple lineages within the tumor this can have strong impacts on the % variation and a such the number of variant reads.
In your original question you state that:"many of them look like this". I took the liberty to take a quick look at the data and would encourage you to do the same. The statistics for the number tumor variant reads indicate that more than 75% of the data has >= 7 variant reads in the tumor. So there may be many variants with 6 reads but they are a minority. Additionally most variants (91%) have a coverage of 20X or more.
If you are interested in how sample mixture, sequencing depth and different algorithms influence the detection power for these type of variants have a look at the paper in the provided link.
Can you provide a citation, or better yet, a link to the actual paper you are referencing?
What was the purpose of the study? Was it using or reporting on a novel analytical algorithm or approach, or using established analyses? What did the authors actually say to justify their thresholds and approach?
Although the research I am involved in is not complex given the simple Mendelian inheritance pattern observed in many corneal dystrophies, I think I may be able to speak on the practical sense of at least one threshold.
I should start off by saying that the first threshold seems reasonable , I can't speak to the rationale of the third one. I'll provide my rationale below for the first two thresholds you've noted.
1. The 6 supporting reads is reasonable given the threshold of 10% of total reads should support the variant. So, really the lower limit of total reads for any given nucleotide is 60. As a somewhat related side note, we set our lower limit of total reads to 5 meaning that as few as 2 reads had to support the variant. We found a high degree of concordance (>95%) with Sanger results. We also observed low false positives. This is all to say that NGS is pretty robust at calling variants even for nucleotides with low coverage.
2. Phred score > 20 as far as I can tell is pretty standard and this is what we use. I am certain others more knowledgeable than I can support its use.
It would have been nice if the authors could have validated say 10-100 of their variants by Sanger...that would would have supported their use of these thresholds.
Again, we work with simple Mendilian disorders so some of the thresholds will obviously be different. It would seem to me that the 10% of all reads supporting the variant is really the threshold in question. Hopefully someone can provide an evidence supported rationale for the use of the 10% threshold.
I am curious, what would be your thresholds, and the rationale behind choosing them?
Ricardo, you see, the problem is - you understood my sentence as 6 supporting reads. However I'm going to disappoint you, the authors used a threshold of 6 reads covering an exon (no matter if they are supporting of the same as reference genome). This means that if there is an exon with only 6x coverage, 1 read would be enough to call a variant (1/6 = 17%).
To illustrate that, please look at the the examplary called variant I have pasted into my question. 6 reads in normal, 10 reads in tumor (6 of which contained "mutation"). That's what's bothering me.
As Ricardo points out these settings are not unheard off and generally yield good results, greater depth and validation would naturally add great value and reliability. This is not always feasible or mandatory when performing more general/meta analysis, as done here on the mutation spectra and types of genes.
To answer your question more in-depth in regard to cancer, these/similar setting have been used by others to maximise discovery rates. The difficulty with tumor samples is that these are rarely pure, i.e. they usually contain a fair percentage of healthy tissue. In combination with the possibility of having cells from multiple lineages within the tumor this can have strong impacts on the % variation and a such the number of variant reads.
In your original question you state that:"many of them look like this". I took the liberty to take a quick look at the data and would encourage you to do the same. The statistics for the number tumor variant reads indicate that more than 75% of the data has >= 7 variant reads in the tumor. So there may be many variants with 6 reads but they are a minority. Additionally most variants (91%) have a coverage of 20X or more.
If you are interested in how sample mixture, sequencing depth and different algorithms influence the detection power for these type of variants have a look at the paper in the provided link.
I have concerns as well about the tresholds used in general. One has to realize that NGS data analysis is done by computer scientists and they are not aware of any artefacts in PCR and sequencing reactions. Illumina is always claiming about their high quality reads, but this does not quarantee that no mistakes are made by the sequencing chemistry used. For instance FFPE DNA samples differ in quality and give rise to PCR artefacts. The aligment algoritms don't take in account that there are 12000 pseudogenes in the human genome with sequence homology to their parent genes, also many homologous genes are present, resulting in misaligment of the reads and false mutation callings.