Hi,
I want to know is whether a process is predictable by some numerical indices, even when obtained in very different conditions (treatments).
Data is empirical.
The design consists of four replicated environments. On each replicate a process of interest and four of predictor variables are measured.
Process = success/fail variable
Predictor variables = 4 numerical, normally distributed variables
Environment = Categorical variable with 4 states
Replicates for each environment = 5,5,4 ,4
I am interested on determining the relevant variables to explain the process. If these variables are truly accurate surrogates of the process they should be robust regardless of the environment. Thus I was thinking on a model where the environment does not apply as a predictor variable.
model_1 = glm(Process ~ Var1 + Var2 +Var3 +Var4, family = binomial)
On this model I carried Anova using car pckge in R, type II LS, to ascertain the relative contribution of the variables
I also tried Akaike’s Information Criterion in order to determine which variables provide the most clear explanation of process and reduce overfitting (routine stepAIC).
However, because my design is not orthogonal, the weight of two of the environments is greater on the outputs.
So I used a generalized mixed linear model (glmer) in ade4 in R, then I control variations within environments.
(Process ~ Var1 + Var2 +Var3 +Var4 + (1|Environment), family = binomial)
This is as far as I understand not incorrect, or even recommendable, to take into account that these observations do not represent an entire population. But model selection is kind of a nightmare because subtle changes in outputs from a choice or another are always tricky to interpret. So I wanted to get some feedback about whether this procedure is correct, or I should do something totally different.
Moreover the model selection routine (stepAIC) can’t be run in ade4 models, and I’ve been unsuccessful finding satisfactory alternatives.
Any recommendations or thoughts?
Thanks in advance
Alex