Dear Community,
I would like to reconstruct the phylogeny of 16,000 E. coli strains based on their genome sequences. What do you think about the following concept?
1. ORF prediction in all genomes. (Prodigal)
2. Identification of the core genes based on reciprocal BLAST of the proteins of the reference E. coli strain to all proteomes. (Diamond)
3. Align the core genes. (MAFFT)
4. Keep SNV sites only and concatenate them to a "super" alignment.
5. Calculate a maximum likelihood tree based on the "super" alignment. (PhyML)
My questions are the following:
A) Should we use only those core genes that we found in all genomes? This is impossible since the overlap contains 0 genes. (We didn't find a single common gene.) Then should I exclude some strains to gain common core genes or can I use genes that are not common in all strains but represented at least in 15,000 strains?
B) Is it a good idea to concatenate SNV sites and infer a tree based on them?
C) How should I choose nucleotide substitution model for such a long and multitaxa alignment?
Cheers,
Eszter