Hello,

I am currently analyzing a complex dataset for my master thesis and I can't find which model is appropriate. First note : I am using R to analyse my data.

I have collected the data on pairs (subject-partner = 1 pair) of vervet monkeys in the field. The two variables are Groomings, collected ad libitum (every time the behaviour is seen, it is recorded), and Proximities, collected through scans (every 30minute, the neighbours of the monkeys and their respective distances are collected).

Due to the monkey's groups being pretty big (30 to 60 monkeys per group) and me being the sole collector, we have a huge zero inflation in groomings (I can't see and thus miss a lot of groomings), and some zero inflation in the scans as well (I can't scan all monkeys at every scans) but to a much lesser extent.

These initial data are counts. For the pair A and B, we have the count of groomings of A to B (GRa>b), the count of groomings of B to A (GRb>a), and the number of scans in which A and B were in 2m of each others.

The groomings of the pairs are transformed into a Grooming Reciprocity Index, measured as follow : (GRa>b - GRb>a)/(GRa>b + GRb>a). A ratio is necessary for standardizing environmental impact on the general groomings given by the monkeys, and a reciprocity index is necessary (instead of a classic in-degreee/out-degree ratio GRa>b / GRb>a) to deal with the times when a monkeys groomed 0 times the other (to avoid the division by 0 problem). The Grooming Reciprocity Index (GRRI) is created from counts but is no longer a count, it is now a bounded continuous variable between -1 (A does all the groomings of the pair), 0 (A and B contributes equally) and +1 (B does all the groomings of the pair).

The proximities of the pairs are transformed from the count to Percentages of Scans in which A and B were in 2m of each others (because sometimes I couldn't perform the same amount of scans). So again, this new variable is created from counts, but is no longer a count, it is now a bounded continuous variable between 0 and 100.

My goal is to investigate whether an experiment affects the relationship of the pairs that take part in the experiment. I have measures of the reciprocity index before and after the experiment, for pairs of monkeys that participated in the experiment (test) and for pairs of monkeys that did not (controls). Similarly for the proximity proportions.

So I want to see if being part of the experiment induces a change in the variable. I have therefore transformed both variables (GRRI and Proximity percentages) into two variables, the Difference before/after in GRRI (DiffGRRI), which is thus a continuous variable bounded between -2 and +2, and the Difference before/after in Proximity (DiffProx), which is thus a continuous variable bounded between -100 and +100.

Those are my response variables, for which I want to build two separate models for.

My fixed predictors, for both, are :

Experimental Status : Test vs Control (qualitative binomial)

Kinship : yes (the two monkeys are related) or no (they are not) (qualitative binomial)

Sexes of the pair: male-male, male-female, female-male or female-female (categorical)

Ages of the pair: adult-adult, adult-juvenile, juvenile-adult or juvenile-juvenile (categorical)

Random factors, for both, are:

Identity of the Subject in the pair

Identity of the Partner in the pair

The maximal model (without interactions) would be for the groomings variable:

DiffGRRI ~ Experimental_Status + Kinship + Sexe + Age + (1 | Subject_ID) + (1 | Partner_ID)

That I then plan to simplify towards a minimal model with model reduction (using drop1 function).

PROBLEM : using a simple linear mixed effect model (lmer in R), when I then run diagnostic graphs on the residuals, the necessary assumptions of linearity, homoscedasticity and normality of the residuals are not fullfilled. The residual vs fitted plot and the scale location plots have non-horizontal trend lines (sometimes it even looks like a sinus function...), with very clear trends in the scatterplot points (in Diff GRRI, typically at 0, 1, -1), and the qqplot with qqline most of the time have extreme departure from normality at the edges, or looks like a staircase.

So I wonder if somebody here would have an idea of what to do with this type of data that is coming from counts, turned into a form of proportion and then difference between those proportions, and that ends up being neither discrete nor fully continous. Also, with a stronf zero inflation.

I have looked into :

- Transforming the data : with so many zeros, log transformations are out. I did use log + 3 (to shift the DiffGRRI variable from -2;0;2 to 1;3;5), but the transformation doesn't change the diagnostic. I also tried using boxcoxfit to find other transformations, to no avail.

- Dealing with the zero inflation : this might be the cause of the non-linearity, so I looked into zero-inflated models, but they are designed around Poisson and Negative Binomial distributions, which only works for positive values, designed for count data.

- Finding an appropriate General Mixed Model using an appropriate distribution, but I didn't find one. The Skellam distribution is close (defined as the difference between counts of two independant variables) but not exactly what I have (the difference is measured between two non-independant variables, as it is the same pair before/after). Also, there is no Skellam Mixed Model regression available anyway, only a simple Skellam regression with no random factors.

- Alternatively, I could run some Wilcoxon tests if I simply look at the difference coming from the experimental status (test or control), but I'll then loose the other fixed factors. Also, in that regard, should I do a two sample wilcoxon unpaired test using DiffGRRI ~ Experimental status (test for a significant difference in the difference between test and control), or should I do two separate two sample wilcoxon paired test using GRRItest ~ before/after (test for a significant difference before/after for test group) and GRRIcontrol ~ before/after (test for a significant difference before/after for control group). Or maybe an ANOVA with both factors in ?

If anybody has any light to bring on this topic, I would be extremely grateful. Actually, if anybody takes time to read this, I would already be thankful.

Best regards !

More Lucas Zermatten's questions See All
Similar questions and discussions