What is the most appropriate way to assess the significance of pairwise Fst values?

01 January 2015 7 7K Report

I have been computing pairwise Fst for a number of populations with a large SNP dataset and it got me thinking about the methods we use to assess significance of these values.

The R packages I have found to do this efficiently for many SNPs (StAMPP & diveRsity) seem to use the technique of bootstrapping over loci to generate a confidence interval around the observed Fst. You can then see if your CI includes 0 and use that to decide if your marker set is reliably estimating Fst for a given population pair.

I am curious why bootstrapping over loci is preferred to permuting individuals in the target populations? It would seem that this would be a way to generate a null distribution that you could compare to your observed Fst to compute a p-value. I haven't found any references or software that do this, so I wonder if I am missing something?

So far the only suggestion I have found of doing it this way was in a reply from the author of the adegenet package (http://lists.r-forge.r-project.org/pipermail/adegenet-forum/2011-February/000214.html). This method would permute individuals across all populations in the dataset to generate a null distribution of Fst under panmixia. An alternative method that I was considering was to permute individuals 2 populations at a time to get a null distribution for each population pair that you could compare to the corresponding pairwise Fst.

Does anyone out there have any thoughts on these methods and the benefits/pitfalls of each?

http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12129/abstract

http://dx.doi.org/10.1111/2041-210X.12067

Marcelo Corrêa da Silva

Hi Nicoas,

Does an empirical distribution of the Fst values and outliers defined as the values above the 99th percentile satisfy you? Are you only intereset in permutation and P ?

Cheers

Nicholas R Polato

I think that using extreme values of the observed distribution is fine for finding loci that are Fst outliers (i.e. under selection) but I am just thinking about multilocus pairwise Fst between 2 populations. In which case there isn't really an empirical distribution to compare against. That is why I was considering a permutation approach.

Timothy A Ebert

An answer to the question in paragraph 3:

Permutation tests are great when there are relatively few permutations. However, the number of permutations is a factorial of the number of observations. It also gets tricky getting the permutations correct if this is a combinatorics problem.

Bootstrap: resample the observed data with replacement.

Randomization: resample the observed population without replacement.

Monte Carlo: observed test statistic is compared to random samples from an assumed relationship. If this relationship is developed from the observed data this becomes a bootstrap or randomization test.

Jackknife: recalculate statistics leaving out the nth observation.

If I have fewer than 20,000 permutations then I will use a permutation test. Between 20,000 and 100,000 I will try the permutation test if I can convince myself that my program does the permutations correctly. If it is anything over 100,000 then I will use a randomization test or bootstrap.

There are many choices. The obvious is a pairwise comparison. You could also do a multiple comparison in panmixia. The randomization approach would be to put all observations into a large vat, and withdraw observations to create null-model groups of the same size as the observed ones. Calculate frequency distribution of pairwise comparisons, and sort them. You then compare the largest observed difference against the frequency distribution of largest differences. If you don't sort, then the null-model is that there are no differences, but we already know that we should see observed differences -- we just don't know if those differences are significant given the variance. If you do sort then the null model is testing if the nth ranked difference is significant amongst all possible outcomes for the nth ranked difference.

Nicholas R Polato

Thanks for your answer Timothy. I have 16 populations that I am comparing on a pairwise basis. So it seems that using bootstrapping to take 1000 randomized samples of individuals from a given population pair would be appropriate to generate a null distribution of Fsts under local panmixia for each pairwise comparison. To get a p value from the bootstrap I was just using the formula : (( # of bootstraps > actual Fst)+1) / (# of replicates +1).

I'm not sure I follow your description of sorting and comparing the pairwise differences though. What do you mean by a frequency distribution of "largest" differences?

Timothy A Ebert

So I have a total of 1600 observations from 16 populations (A through P) where there are 100 observations from each population. I toss all values into a pot and mix them up. In the first random draw, I find that the rank order is A>B>C>D>E>F>G>H>I>J>K>L>M>N>O>P. In the next random draw I get B>P>O>M>N>J>L>K>D>C>A>E>F>G>H. In the next I get some other arrangement. So the outcome for a multiple comparison test will be partly determined by the frequency of any ordering of these 16 populations. If you sort so that the largest mean value is always assigned to population A (no matter where it appears in the list), then you are testing if the largest value is significantly different from the next largest value.

This approach was published in my dissertation (long ago), from Oklahoma State University.

Kevin Keenan

Hi Nicholas,

Apologies for my late response, but I took a break from researchgate to complete my PhD thesis.

Just to clarify, diveRsity calculates 95% CI by bootstrapping individuals within samples by default, not loci. Users can specify 'ci_type = "loci" ' if they want to bootstrap loci, however. The major difference between the two approaches is in terms of coverage. In a basic tests between the two approaches it appears that bootstrapping across individuals give tighter CIs (a desirable feature) that bootstrapping across loci (see the code here http://rpubs.com/kkeenan02/citype ). This pattern is only seen when D (Jost 2008) is > 0.05, other wise locus CIs have tighter coverage.

Anyway, I hope this helps, and I hope to add permutation derived p-values to diveRsity soon.

Kevin

Safia Messaoudi

IHello,

I have the same problem I get the FST matrix by poptree2 but I don't know how to get the p value.

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

Separation of organic acids-HPLC?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?