is there some database or the online source where it is possible to find all to date known SNPs for novel coronavirus? Table, bedgraph, or some similar well-arranged output?
Dear Václav Brázda , thanks for this article! It is useful as a starting point for me, but on the other hand, it is more than 1 month old. I suppose, that there must be a daily actualized database of SARS-CoV-2 SNPs. According to the NextStrain project, over 4 000 SARS-CoV-2 genomes were sequenced till date. But I was unable to find a summary table of all SNPs, please, help!
Dear Omar Ali Al-Khashman , thank you very much for the useful suggestion. It is great to know the most frequent mutations of SARS-CoV-2. This is further progress, sampling of sequences was ended on 23rd March 2020...but there is still an urgent need for up-to-date SNPs database. Does anybody know the link? Thanks in advance
Martin Bartas , perhaps it's not so difficult to create your own data each day a new sequence appears. I haven't fully verified this, but here's what should work:
1) Get sequences. I get FASTA files from https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/ using the following URL (credit for download goes to Daniel Plakinger):
3) Use SNP-sites to get a VCF file from this multiple alignment.
snp-sites sequences_aln.fasta
Note: Perhaps this pipeline could be modified to get all the sequences aligned to a reference sequence. Alternatively, it may suffice to simply transform the coordinates ex-post from alignment positions to reference genome positions.
Dear Matej Lexa, thank you very much for your suggestion, I tried similar approach a couple of days ago, unfortunately, my standalone version of MUSCLE was not able to align such a big input of long sequences and similarly, all web-based tools are limited for 200 FASTA sequences (Phylogeny.fr, UseGalaxy tools, etc.)
Martin Bartas I can confirm the size problems. Here's a few alternatives:
1) I started a CLUSTALW alignment, which is still running, unfortunately, but will probably produce results tomorrow
2) Because there is a reference genome and all the sequences are very similar, it may be sufficient to align all the sequences to the reference. A quick way to do that is to use BLAST in this way, takes just a few minutes
- or to convert it to FASTA (or some other MSA format), use something like
mview -in blast sequences_blast0.out -out fasta > sequences_blast0.fasta
3) As another alternative, I managed to run muscle on about 250 sequences, so one can produce four alignments and then attempt to merge them with something like poa aligner (http://manpages.org/poa), don't know if this will actually work well
Matej Lexa, wow! Many thanks, I will continue with your .vcf file in vcfR library. Our goal is to overlay data from PQS and cruciforms prediction with the known variants. Can I inform you directly about the result? Also, in the case of publication of this, we would like to offer you co-authorship.
At the HIV Databases at LANL, we are working on this problem and will very soon have a web site up and running to not only track all of the SNPs but also the prevalence of each SNP in geographical and chronological space. This will allow researchers to see which mutations are being positively selected, which are associated with a particular geographic location, etc... Our paper should be available in the preprint archives be the end of this week.
The COVID-19 analysis at Datamonkey is very worthwhile. It not only keeps track of all mutations in the virus genomes submitted to GISAID, but also analyses selection pressure on the mutants, are they increasing or decreasing in the population and how many times has each mutation occurred.
http://covid19.datamonkey.org/
There are several other databases that list all polymorphic sites that have accumulated, but most of them do not make it easy to tell which sites are more important because they are found in a significant percentage of isolates or less important because they have only been observed one time.
I find it highly useful to look for sites of interest in the datamonkey site, and then map those sties onto the tree at the nextstrain site. I am attaching figures from datamonkey and nextrain to illustrate. The D614G mutation in Spike is increasing over time but seems to have only mutated a very few times from D to G. The L5F mutation in spike seems to have mutated more times (but still just a few) but none of the F lineages seem to be successful.
However, I recommend to download publicly available genomes and run variant calling to have all variants. Very simple and straightforward pipeline for variant calling & variant annotation you can find for example in this publication: