What regression model to use with bounded zero-inflated negative-positive response variable if linear mixed-model doesn't apply ?

13 November 2021 2 9K Report

Hello,

I am currently analyzing a complex dataset for my master thesis and I can't find which model is appropriate. First note : I am using R to analyse my data.

I have collected the data on pairs (subject-partner = 1 pair) of vervet monkeys in the field. The two variables are Groomings, collected ad libitum (every time the behaviour is seen, it is recorded), and Proximities, collected through scans (every 30minute, the neighbours of the monkeys and their respective distances are collected).

Due to the monkey's groups being pretty big (30 to 60 monkeys per group) and me being the sole collector, we have a huge zero inflation in groomings (I can't see and thus miss a lot of groomings), and some zero inflation in the scans as well (I can't scan all monkeys at every scans) but to a much lesser extent.

These initial data are counts. For the pair A and B, we have the count of groomings of A to B (GRa>b), the count of groomings of B to A (GRb>a), and the number of scans in which A and B were in 2m of each others.

The groomings of the pairs are transformed into a Grooming Reciprocity Index, measured as follow : (GRa>b - GRb>a)/(GRa>b + GRb>a). A ratio is necessary for standardizing environmental impact on the general groomings given by the monkeys, and a reciprocity index is necessary (instead of a classic in-degreee/out-degree ratio GRa>b / GRb>a) to deal with the times when a monkeys groomed 0 times the other (to avoid the division by 0 problem). The Grooming Reciprocity Index (GRRI) is created from counts but is no longer a count, it is now a bounded continuous variable between -1 (A does all the groomings of the pair), 0 (A and B contributes equally) and +1 (B does all the groomings of the pair).

The proximities of the pairs are transformed from the count to Percentages of Scans in which A and B were in 2m of each others (because sometimes I couldn't perform the same amount of scans). So again, this new variable is created from counts, but is no longer a count, it is now a bounded continuous variable between 0 and 100.

My goal is to investigate whether an experiment affects the relationship of the pairs that take part in the experiment. I have measures of the reciprocity index before and after the experiment, for pairs of monkeys that participated in the experiment (test) and for pairs of monkeys that did not (controls). Similarly for the proximity proportions.

So I want to see if being part of the experiment induces a change in the variable. I have therefore transformed both variables (GRRI and Proximity percentages) into two variables, the Difference before/after in GRRI (DiffGRRI), which is thus a continuous variable bounded between -2 and +2, and the Difference before/after in Proximity (DiffProx), which is thus a continuous variable bounded between -100 and +100.

Those are my response variables, for which I want to build two separate models for.

My fixed predictors, for both, are :

Experimental Status : Test vs Control (qualitative binomial)

Kinship : yes (the two monkeys are related) or no (they are not) (qualitative binomial)

Sexes of the pair: male-male, male-female, female-male or female-female (categorical)

Ages of the pair: adult-adult, adult-juvenile, juvenile-adult or juvenile-juvenile (categorical)

Random factors, for both, are:

Identity of the Subject in the pair

Identity of the Partner in the pair

The maximal model (without interactions) would be for the groomings variable:

DiffGRRI ~ Experimental_Status + Kinship + Sexe + Age + (1 | Subject_ID) + (1 | Partner_ID)

That I then plan to simplify towards a minimal model with model reduction (using drop1 function).

PROBLEM : using a simple linear mixed effect model (lmer in R), when I then run diagnostic graphs on the residuals, the necessary assumptions of linearity, homoscedasticity and normality of the residuals are not fullfilled. The residual vs fitted plot and the scale location plots have non-horizontal trend lines (sometimes it even looks like a sinus function...), with very clear trends in the scatterplot points (in Diff GRRI, typically at 0, 1, -1), and the qqplot with qqline most of the time have extreme departure from normality at the edges, or looks like a staircase.

So I wonder if somebody here would have an idea of what to do with this type of data that is coming from counts, turned into a form of proportion and then difference between those proportions, and that ends up being neither discrete nor fully continous. Also, with a stronf zero inflation.

I have looked into :

- Transforming the data : with so many zeros, log transformations are out. I did use log + 3 (to shift the DiffGRRI variable from -2;0;2 to 1;3;5), but the transformation doesn't change the diagnostic. I also tried using boxcoxfit to find other transformations, to no avail.

- Dealing with the zero inflation : this might be the cause of the non-linearity, so I looked into zero-inflated models, but they are designed around Poisson and Negative Binomial distributions, which only works for positive values, designed for count data.

- Finding an appropriate General Mixed Model using an appropriate distribution, but I didn't find one. The Skellam distribution is close (defined as the difference between counts of two independant variables) but not exactly what I have (the difference is measured between two non-independant variables, as it is the same pair before/after). Also, there is no Skellam Mixed Model regression available anyway, only a simple Skellam regression with no random factors.

- Alternatively, I could run some Wilcoxon tests if I simply look at the difference coming from the experimental status (test or control), but I'll then loose the other fixed factors. Also, in that regard, should I do a two sample wilcoxon unpaired test using DiffGRRI ~ Experimental status (test for a significant difference in the difference between test and control), or should I do two separate two sample wilcoxon paired test using GRRItest ~ before/after (test for a significant difference before/after for test group) and GRRIcontrol ~ before/after (test for a significant difference before/after for control group). Or maybe an ANOVA with both factors in ?

If anybody has any light to bring on this topic, I would be extremely grateful. Actually, if anybody takes time to read this, I would already be thankful.

Best regards !

D. Eastern Kang Sim

I did not read your post entirely and my response is based on your title and skimming through few details. I assume you are familiar with linear mixed effect modeling. There are many ways to tweak your LME model and there exist several extensions of it. One LME extension I would suggest you to explore is two-part mixed effect modeling - aka zero-inflated gaussian mixed effect model. Google 'hurdle mixed model' or 'zero inflated gaussian' and you will find few packages that apply these techniques.

Lucas Zermatten

Hi D. Eastern Kang Sim,

Thanks for your answer. I read it a month ago and have tried to apply it to my dataset. It seems that it would theoretically have been a good option, although diagnostics were terrible. I assume it comes from me not actually having a zero-inflated dataset, but a -1, 0 and +1 inflated dataset. Indeed, due to the index of reciprocity drawing groomings from two monkeys to establish a single value, the same process generating 0 inflation (no data for both monkeys) also generates -1 and +1 inflation (no data for one of the two monkeys).

For posterity, if somebody finds this thread due to having the same issue as I had, I ended up simplifying the question and simply performing Wilcoxon paired tests.

Thanks for your answer nonetheless.

Como ficarão as vagas de emprego com o advento da Inteligência Artificial?

How do i quantify cells inside a stirred tank bioreactor?

Kolmogorov-Smirnov test using ordered quantitative data. Can you help me?

If thermodynamics obtains a positive Gibbs energy in the adsorption process with negative enthalpy and entropy, how can it be justified?

How to align sequences using Interpro profiles?

Does anyone have experience interpreting the output generated by Bottleneck software?

Does anyone knows how to convert an h5ad file into rds file, using an python script?

Which variables may be relevant to a sales forecast for a supermarket chain?

What is this strange cylindrical concretion in lacustrine sediment core ?

Identification of millimetered structures in sediment from core (Uzbekistan lake) ?

GC-MS retention index prediticon?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

In order to run Multinomial Logistic Regression, is it required that the data be in the long format?

Why do we equate male and female arousal?

How to analyze multiple phosphorilation sites?

I need the datasets of Microgrid for system identification?

Should I remove an item from a scale to raise Cronbach's alpha and McDonald's omega or is it better to leave it if they are both over .7 already?

Does continues partial charging/discharging do any harm to the BESS performance and degradation?

Talking therapies for bipolar, psychology?

How to Post-fix Mice pup's brains?