I want to analyze mutations in the core-genome of Xanthomonas oryzae, a bacteria species that comprises two pathovars: pv. oryzae and pv. oryzicola. I already have the core-genome identified (those genes conserved in at least 95% of the strains in each pathovar), and now i want to identify those genes that are less conserved. To do it i am thinking in calculate some conservation score for each gene based on the multiple sequence alignment by measuring the probability of each aminoacid occur in each position in both pathovars and then extract the mean shannon entropy. The Shannon entropy would be used to as the main measure of the conservation of the gene, and would be calculated for each pathovar-gene and then a mean . Thus, with a score for each gene, i would calculate the z-score and select those with outlier z-score and perform a GO enrichment analysis on them. The idea is to identify genes that are differentially variable or differentially conserved in the pathovars and the species.

Can anyone give me some advice or say if this method is appropriate?

Similar questions and discussions