For phylogenetic studies, it is good to choose orthologous genes instead of paralogous genes. My question is, How to recognize whether a gene is orthologous or paralogous? Can you please give reference?
To be certain, one should have mapping populations and know the segregation of the markers in question. That is seldom practical. Often this is done by setting a more or less arbitrary threshold of similarity. That is, alleles passing a given similarity are considered orthologs, those not passing are considered paralogs. This is of course silly. It is necessary to consider the known genetic history of the organisms and especially their ploidy. Painstaking phylogenetic analysis is a start, and formulation of a hypothesis that fits the data. There is no test for this. Consider Smissen et al, 2011,
To add to the answer above, whether to use orthologs or paralogs in phylogenetics studies depends on your question. If you are say, interested in inferring the evolutionary relationships of a group of species, you would likely use orthologous genes. If you were interested in the evolution of gene family within a taxon, you might want to use paralogs.
I mostly agree with the answers above, but the sentence "That is, alleles passing a given similarity are considered orthologs, those not passing are considered paralogs." makes no sense to me. Recently duplicated genes (although there could also be other reasons for high similarity) genes often have very high levels of similarity (we are talking 99.9%), and they are still paralogs.
Furthermore, I am not sure that I fully understand the question.
The silly answer is that you recognize orthologous genes by belonging to two different species, and paralogous by belonging to a single species.
So the way I understand the question is:
If you have a gene X from two species (1X and 2X), how can you be sure that this gene has not undergone duplication in both species, resulting in the existence of paralogues Xa and Xb.
So now you are wondering whether you may be using 1Xa and 2Xb in your dataset. Is that the question?
The answer to that question is, first we have to assess whether it matters at all.
Gene duplication can lead to: 1. loss of function, 2. evolution of novel functions, or even 3. co-existence of two gene copies with almost identical functions.
The distinction between genes with 'same' and 'different' functions can be very arbitrary.
If you have the case 3, or simply a very recent duplication event, mirrored in high level of similarity, it may not matter at all.
So you may conduct pairwise similarity analyses on your dataset, and then identify the outliers with respect to the expected phylogenetic relationships. As mentioned in the reply above, this means that you cannot set a single threshold for similarity, as phlyogenetically distant taxa are expected to have low(er) similarity. Therefore, you should infer a similarity curve with respect to presumed phylogenetic distance, and then look for outliers from that curve. Somewhat complicated.
Outliers may indicate function loss, novel functions (in these two cases, these genes cannot be considered orthologues any more), but also a (mostly) orthologous gene undergoing an exceptionally high mutation rate, which can be driven both by adaptive and nonadaptive pressures.
In either case, these outliers are not particularly suitable for inferring phylogenies, and probably should be removed from the dataset, as they produce compositional heterogeneity and can cause long-branch artefacts.
I would probably try running analyses using a model designed for comp. het. (e.g PhyloBayes CAT-GTR model) on datasets both including and excluding the 'problematic' genes.
what about this: https://www.cell.com/trends/plant-science/fulltext/S1360-1385(16)00059-5 or this: Article Characterization of angiosperm nrDNA polymorphism, paralogy,...