My explanatory variables are count data and some have many zeros. In linear regression variance is increasing with increasing explanatory and response variable. Which type of regression should I use? Is quasi-poisson suitable for this type of data?
You should use a Zero-inflated regression model for both the Poisson regression and the Negative binomial regression. These models have two components: i) For explaining the excess of zeros and ii) for explaining the count data.
In the Stata software the commands are "zip" and "zinb".
You should do a diagnostic test in order to select the adequate regression model.
LRM: Never try to define a variable in a N subset if you do not have 3 of it and less than N-3. Starting from this up to you to identify how you will indetify Under/Over Fitting.
You should use a Zero-inflated regression model for both the Poisson regression and the Negative binomial regression. These models have two components: i) For explaining the excess of zeros and ii) for explaining the count data.
In the Stata software the commands are "zip" and "zinb".
You should do a diagnostic test in order to select the adequate regression model.
If you are using count data, dou can indeed choose between Poisson, Negative Binomial or a Zero Inflated Model, depending on the skewness of your variable.
Here is a really nice tutorial on these models in R : http://datavoreconsulting.com/programming-tips/count-data-glms-choosing-poisson-negative-binomial-zero-inflated-poisson/
I agree, zero inflated models seem the way to go with the data you describe. You can combine a binomial model for the probability of "having or not the effect" and a Poisson or negative binomial for the process that you are measuring. In both cases you can include covariates using generalised linear model (GLM) framework.
It is difficult to say without comparing various options. Sometimes an overdispersed poisson or NB model will be sufficient. However, if there really are excess zeroes then either ZINB or ZIP models are an option, but you should also consider hurdle models:
Yes, Negative Bionomial (NB) and zero-inflated poisson or NB are generally the way to go. If you want to go a step further and have the time to spend learning the approaches, then you could go a step further and use Hierarchical Models to model the state and observation process to account for the excess zeros. The ecology literature has dealt with this a lot. There is a good text...
Royle, J.A. and R.M. Dorazio. 2008. Hierarchical Modeling and Inference in Ecology: The Analysis of Data from Populations, Metapopulations, and Communities. Academic Press, San Diego, CA. xviii, 444 pp.
Many of the appropriate models are implemented in the R package "unmarked"; the text above also covers fitting models using Bayesian approaches in WinBUGS.
Seeing all this discussion about Poisson and Negative Binomial models makes me wonder if people misread the original post (or alternatively if I'm missing something!). The original post says that the EXPLANATORY variables are zero-inflated count data, but it says nothing at all about the nature of the outcome variable. All it says is that "In linear regression variance is increasing with increasing explanatory and response variable."
So to Matija: What is the outcome variable? How do the residuals look apart from the heteroscedasticity?
My response variable is continuous (originaly count data - deer pellet group counts, but became continuous because of several corrections/conversions) with positive values and some (12 out of 120 data) zeros. I must note that some zeros are actually false (animals were present, we just did not detect them) zeros and some true (animals were absent). Actually also my explanatory variables are continuous (originaly count) because of avereging.
When applying linear regression my residuals are clumped at small values and increasingly scattered towards higher values of predicted values. There is obviously overdispersion in my data.
I was suggested to use GLM with gamma distributon. Since gamma does not allow for zero values I could use this two ways: 1. adding small value to all zeros (which is statisticaly not correctly, but ecollogicaly meaninful, since most of zeros are false anyway) and 2. model as a two part model (modelling separately probability of zeros and continuous part with gamma glm). However the second approach does not seem logical to me, since most of my zeros are false.
Anyway, I still do not know how to deal with excess of zeros in my explanatory values! All of different methos/regressions seem to be tailored for different types of response variable.
Okay, bease upon your above quote "my residuals are clumped at small values and increasingly scattered towards higher values of predicted values.", you have a heteroscedasticity problem. That can result from many issues, and has several potential solutions. However, I think that it may result from trying to contort count data into a continuous variable. In typical poisson regression, the variance is assumed equal to the mean, thus resulting in similar sort of result as you see, and could be the source of your problem.
I presume that the corrections you applyied were for detection probability? If so, I would imagine your response variable is not really as continuous as you imagine. Try creating a histogram of them...in my experience, applying detection probability adjustments to raw count data tends to create a series of separate gaussian distributions around individual counts.
If so, try reverting to your raw counts (drop your corrections) for your response variable, and try using a Hierarchical model as I suggested above (see Royle and Dorazio 2008 refernce). In a hierarchical model, you can model the variable of interest (you state variable (i.e count data) using poisson or NB regression model, while simultaneously incorporating your detection probability model for the observation process to account for the observation process.
Alternaitvely, you can use your corrections, then round the data to nearest whole number and use a count based regression.
If none of the above fit your situation, you could try weighted least squares regression (WLS). The kind of heterscedasticity you describe does not tend to bias parameter estimates (slope), but can cause underestimates of variance (SE of parameter estimate), and WLS will account for this.
Finally,if I am off base, perhaps try pasting the graph of your residuals/OLS regression diagnostic graphs into this discussion forum. I am not certain if this forum allows graphics, but a picture says a thousand words.
Steven, thank you for your very comprehensive answer.
There were several (five) corrections made so no way I can go back to raw counts and model all the corrections. So I think it is inevitable to include my response variable as it is with all of its corrections.
Rounding data to nearest integer means loosing part of information, so I would prefer to avoid that.
And to get even beter insight into my problem I also added my data. "Response" column obviously means response variable.
OLS model above includes no intercept since I want my prediction to be zero when all predictors are zero. I also would want my predicted values to be nonnegative.
It isn't really sensible to have no intercept just because y is zero when x is zero - it biases the estimates of the slopes except in very unusual cases.
So in WLS variance is a function of one predictor variable (if I understand right). But I have 5 continuous predictors. How should I select the right one? Or have I misunderstood the metod.