I am working on the analysis of fungal populations sampled from 12 geographic locations using RADseq data. The dataset contains 7439 polymorphic loci with ~10 snps per RAD locus (75,482 snps total). I ran Structure and DAPC using dataset with 7439 loci and 1 snp per locus (to avoid linkage). As a result, 5 genetic clusters were identified in the whole dataset. While reading literature on RADseq population analysis, I have noticed that people also run analysis on the detection of loci under selection using Bayescan.

In order to proceed with the population analysis, I want to detect potential loci under selection in my dataset as well for two reasons: 1. To ensure I use only potentially neutral loci for population structure analysis; 2. Possibly identify genes that participate in local adaptation (environment etc) or code important phenotypic traits (e.g., linked to pathogenicity and such).

Since, I have never done such analysis before, I am worried whether I am approaching this task correctly, and I am also concerned about false positives. I was hoping to get perspective from scientific community and possibly some suggestions on few things.

First, if I want to detect whether any of markers I use for population analysis (e.g., single snp per locus) are under selection, I should use dataset with population strata 5 for Bayescan and dataset in which only single snp per locus recorded. Is it correct?

Second, if I want to detect loci under selection to further check for links to local adaptation and/or phenotypic characteristics, I should use population strata 12 (geographic locations) and dataset that consists of haplotypes that were built from all the snps detected in all loci. Is this correct?

Based on the logic described in "First" and "Second", I did test runs with Bayescan and plotted results. Two plots are attached. My questions regarding plots:

1. The distribution of data points on both plots diverges. My interpretation is that Bayescan detected loci under selection with very high Fst and low Fst (am I right?). Would that be a correct conclusion to make that loci with low Fst (lower string on the plot) are probably under balancing selection, while with high Fst under diversifying selection?

2. The difference in number of loci under selection reported from two separate analyses is significant. In the first run with pop5 and single snp (two alleles), I have got 272 outliers. In the second run with pop12 and haplotypes derived from all snps in loci (multiple alleles) I have got 700 outliers (roughly 10% of the total dataset). I don't know how to evaluate this result. Is it real or do I most likely have false positives here (since there are so many of them)?

If you can give input/perspective/warnings/point on my mistakes on at least some of the questions I highlighted above, I will be very grateful. I just want to make sure I am not doing some fundamental mistakes in the beginning and setting my analyses incorrectly.

Many thanks for reading and taking time to think with me.

Olga

More Olga Kozhar's questions See All
Similar questions and discussions