I was asked to analyze a "community" response and the suggested response is a dissimilarity matrix. Ecologists take a sample at different locations (i) they however want to know the change in community composition as a function of the distances between DVs among the locations (i).
The IV: The samples contain information on the type of species and abundance at each location. To calculate a "community" response the Bray-Curtis dissimilarity is calculated over the species abundance at each site. This represents the response as having the domain [0, 1] (possibly but not always including 0, 1).
The DVs: positive real numbers with a domain (0, Inf). The pairwise distance between the DV at each location is calculated as the absolute distance (Euclidean distance).
The IV now consist of a matrix of i rows and j columns where each row and each column represent a location and each ij combination a Bray Curtis dissimilarity. The diagonal and above diagonal are removed and the matrix is converted to a long format. The long format contains the unique comparisons ij. If we assume an equal size matrix of n=40 then i=1 is compared to each j={1, ..., n} although not exact because only i=1 is compared to all j minus itself. For the DV each combination of the location i to j the absolute distance was calculated and merged with the corresponding response.
This matrix comparison is problematic because for n=40 we end up with 40*(40-1)/2=780 comparisons. This is a huge inflation of the degrees of freedom (df). One could model the variance of each i to j={1, ..., n} comparison separately. But this does not reduce the issue of non-independence because each i or j is still more closely related to its counterparts. More stressing is that the df is still inflated.
I have provided an example in R code in the appendix, but feel uncomfortable with this method, but was asked to do so. I have left all inferential statistics out, with exception of the point estimate. Moreover, I have in detail laid out the argument of why to so. However, the reviewer suggested this article: https://doi.org/10.1111/geb.13459. The authors in the suggested article mention that “subsampling site-pairs would limit the degree to which this assumption of the underlying GLM methodology is being violated”. Yet, this is similar to modeling the variance of each pairwise combination.
Thus fitting a beta-regression and modeling each ij comparison would provide similar results. Every decision therefore seems a bad decision in this case. However, the method is surprisingly common to use in ecology.
Any kind words of advice?
Best,