Analyzing Structure software results (microsatellite loci) with the algorithm of Evanno et a 2005, has anyone ever obtained 2 peaks in the deltaK estimation? How can that be interpreted?
I obtained as well such results with Evanno algorithm. In my case I had a high peak for K = 2 and a smaller one for a larger K. I assumed that the population I was studying was divided in two very different groups, and actually this information agrees with other data (different geographical origin, different morphological characterisics...). Then, I thought that these two big groups presented some substrucures, and also this information was confirmed by other data. In my case, I was studiyng sheep populations of the Alpine area belonging to different breeds. So, I think that the results you obtained with Structure should always be interpreted also using other information on the population. The Evanno method shows, as the biggest peak, the most important division while the smaller peak could be an additional substrucure inside the first one.
Thank you for your time in responding Chiara. Good to know that "this happens". I actually saw a publication from China that had the same interpretation that you are giving, a smaller peak representing a smaller level of substructure. Which could make sense in my data, since there are some of the populations closer geographically than others.
My question raises however, because the peaks (at k=3 and k=5) are the same HIGH height, so I'm not sure if the substructure explanation could still fit??. I'll keep reading. thank you!!
As Chiara said the Evanno method is used to find the uppermost hierarchical level of structure so it is possible there is substructure present. When looking at your Delta K values, you should also pay attention to the estimated Ln Pr (X|K) and the variability across runs for each K. Sometimes STRUCTURE will have one run with an extremely high Ln Pr (X|K) compared to other runs within the same K. For example, you might get values ranging from -2840 to -2970 and then a run will have a value of -4500. That single run can greatly influence your Delta K values if you are using only a few independent runs for each K. So if that is the case you may need to run the program longer or use a different burn-in.
I agree with Robert. I use to run several runs (10) for the same K and to compare the Ln Pr (X|K) to verify their similarities. If the results are very different I agree with Robert and I think you should use a longer burn-in period.
By experience, the Evanno approach is very sensible to simple variation on one run indeed. I prefer personally to interpret directly the factors used in the formula i.e. (i) the evolution of Ln Pr (X|K) and its stabilization after a given K, and (ii) the increase of Ln Pr (X|K) variation according to runs.
It would help if you could provide some more information regarding the model you used for Structure simulations, plus some info regarding how your runs look like (i.e., parameters convergence, variability for individual qis in repetitions of the same K, etc).
Anyhow, it could happen that your model gets trapped into a local likelihood peak. One solution provided in the manual for weak Structure signals is to use LOCPRIOR. If you work with the User Interface (the fancy way), just add one column to your dataset with a number providing the sampling location. Then, when you select your model use Admixture + tick the "use sampling locations as prior" box, and continue normally. Please be aware that instead of sampling locations the prior could reflect the species, subspecies, etc...
Thank you all for the very helpful comments. Pablo the locprior was an option but I wanted to get the genetic clustering independent of the a priori population information.
I've used the admixed model with correlated frequencies (per scene compatibility with the 2003 paper of the software authors suggestions).
Indeed as many of you suggested (and found a discussion forum with Pritchard commenting) I've increased the number of iterations (from 10 to 40) and the burnin/MCMC (from 30k to 50k). one of the peaks (the lower structure level k=3) disappeared. The curve have a little hump but not a true peak, so I've ended up with the number of clusters that agrees with the LnPr(x/y) that was always looking good at k=5.
I'm glad I've asked and did not try to come up with a biological explanation of the weird results.... it was more than likely an outlier run that generated that 2nd peak.
thanks so much to everybody that took the time to respond.
great! thanks!! I'll check it out. I do like to plot the obtained K and some below and one above to examine the clustering. I find it Interesting to see sublevels of substructure and to track the behavior of individual MLgenotypes in relation with the rest of the sample. Thank you!
I agree running more iterations and increasing your burn-in, but I think many miss a critical step to running Structure. I see it all the time where people run Structure once, slap up the q-plots, and say I can't decide between these two clusters. The multiple peaks in Structure are better interpreted by first: iterative runs and second: comparing with a second spatially explicit Bayesian program. Many processes can mess with Bayesian programs including isolation-by-distance, weak barriers, family groups, inbreeding, and sampling scheme to name a few.
So, Structure is always going to find the highest level of differentiation per run. When you have a second peak in deltaK, it is almost always due to additional substructure within your dataset. To fully understand the true number of K, you should conduct what is called iterative runs within Structure. Iterative runs involve running each cluster you get individually until there is no structure remaining.
After your first run, assign the individuals to each putative cluster (this cluster should be based on your highest peak in deltaK). Then, run all the individuals in those new clusters separately in Structure again. I have had many datasets where I have had to do 3-5 sets of iterative runs before I did not find any more structure.
As a check, I recommend running a second Bayesian program that uses spatial coordinates because they often do not detect hierarchical structure, but instead spit out the final set of clusters. I often use spatial BAPS, so it can give me an idea how many clusters I am dealing with.
For an example of how to run iterative runs and analyze them, I recommend reading Balkenhol et al. 2014. The authors do a great job of explaining hierarchical structure in datasets.
Liz! thank you!!! I was not aware of those possibilities! you've significantly extended my horizon! I'll look into those publications... and I might have to ask you directly if I come up with questions to apply the suggested iterative methodology.