Dear Colleagues,

We want to use microsatellite data to analyze the historical dynamics of mammalian populations and, in particular, to test the method of coalescent Bayesian scaling. We used Beast2 v.2.7.6. For the test analyses described below, we used a sample of 50 individuals genotyped at 12 microsatellite loci. In all cases, a strict "1" (default) clock was used. Default values were also left for all parameters not mentioned below.

The manual for the BEASTvntr module (https://github.com/arjun-1/BEASTvntr) is based on the example of bacteria. In prokaryotes, all loci are localized on a single DNA molecule and therefore analyzed as a single partition. But in eukaryotes, microsatellite loci can be located on different chromosomes and are not linked. It seems logical that in this case each locus should be analyzed as a separate partition.

I am aware of only a few publications in which the BEASTvntr module was used for eukaroytes and the corresponding methodology was not described in detail. Two publications (Kjartanson et al., 2023 (DOI:10.3390/d15030385, the subject was a fish, Acipenser) and Prewer et al., 2020 (DOI:10.1093/biolinnean/blz175)the subject was a mammal, Ovibos) said nothing about their data design. Escudero et al, 2023 (DOI:10.1093/aob/mcad087, the subject was a plant, Carex) noted that "clock, tree and site models were linked for the 33 SSRs following recommendations in the BEASTvntr manual" and Rugna et al, 2018 (DOI:journal.pntd.0006595, the subject was a protist, Leishmania) described that "the diploid data were entered as two distinct partitions". From this it can be understood that in both cases all loci were combined into a single partition, albeit taking diploidy into account in the second case.

We tried this approach to reconstruct the Bayesian skyline plot and analyzed a matrix combining all 12 loci (two-column format) and imported into Beauty as a single partition. We used the Sainudiin or Sainudiin Computed Frequencies model (SCF) and Gamma Category Count (GCC) 6 or 4. For this case we have no problem with the timing of the calculations and got ESS values >> 200 at 35 mln iterations. The resulted plot looked plausible (see top of the attached picture ) and was consistent with what we actually expected for this population based on paleoclimatic data. But since the very approach of jointly analyzing alleles of different loci seems incorrect, I cannot be sure that this plot reflects the real situation.

Dr. Santiago Sanchez-Ramirez suggested to combine multilocus data into a single matrix based on a single-column format (one column - one locus, two rows per individual) and importing such a matrix into Beauty with the multiple partitions option (https://groups.google.com/g/beast-users/c/8qwtobmIWqo). We tried this option to reconstruct an Coalescent Extended Bayesian skyline (EBSP) and obtained a very strange plot, hardly reflecting the real situation (middle part of the figure). This plot (except for the numbers on the axes) remained unchanged regardless of whether we used linked or unlinked site and/or clock models for different partitions, whether we used the Sainudiin model or the SCF model and GCC parameter.

Finally, we organized a separate two-column matrix for each locus and imported them, one by one, into Beauty as single partition each. In the end, our data were also represented as 12 partitions. In this case (linked or unlinked models (Sainudiin, GCC 6) and/or clocks, we got the graph shown at the bottom of the figure. I can't say anything about its plausibility.

In both variants of data representation in the form of 12 partitions we had very big problems with calculation time and ESS values (about 5-9 for 10 mln iterations or less and about 15-25 for 20-40 mln). In the case of 12 separate matrices, the calculations were regularly interrupted with the error message "java.lang.RuntimeException: Could not find zero eigenvalue" (although they could be continued using the "appends log to existing files" function).

Can anyone comment on the above and advise on the best way to organize such data and analyze it? Maybe this or that way implies obligatory change of some additional parameters, which we left set by default?

Many thanks in advance!

More Ilya G Meschersky's questions See All
Similar questions and discussions