25 September 2019 6 401 Report

Hi everyone! I'm fairly new to regression modelling, and was hoping someone could help with what I hope is a very basic question.

I have a set of repeated observations on a regular grid (i.e. they vary by: latitude, longitude, and time). Of these observations I have a dependent variable, y, which I would like to model as a function of an independent variable, x. From literature it is known that x has a significant negative effect on y, but the magnitude of this effect is often mitigated by other variables which also need to be accounted for. A lot of the variables I have to choose from are also quite correlated with each other, as well as with y.

So far, I've tried using a backward stepwise regression with ordinary least squares: I start off fitting all the variables I have against y, and then remove variables until the p-value for all remaining variables is < 0.05. If needed, I then remove variables until the variance inflation factor (VIF) for all remaining variables is < 10.

While this produces models that have a fairly high R^2, I find that often either x has the wrong sign, or was removed by the stepwise process entirely. I am sure that my approach is too simplistic, as in some areas my variables do not appear to linearly correlate with y. I've tried getting around this by fitting separate models for different regions, but this does not help.

I understand that longitudinal data requires at least a linear mixed effects model, but am unsure as to how to set one up for my dataset from the examples I've seen online. I've also read papers where tools like LASSO are used to select the best parameters, but these also require some pre-conditioning of the data which I am also unfamiliar with.

I would be most grateful for any advice anyone here could give me.

More Jasdeep Anand's questions See All
Similar questions and discussions