I would like to analyze data where the dependent variable is a count and the independent variables are categorical. The count is the number of positive events out of the total. Which method when the total number of events is the same in all groups to be compared? And which method when it is different in each group?
I generally asume that most count models will be overdispersed. But not all are. The reason to run a prelim standard Poisson model on the data is to determine the prima facie dispersion status of the model. If the Pearson dispersion statistic (Pearson Chi2/(residual dof) is greater than 1, the model is likely Poisson overdispersed. If under 1, then it may be underdispersed, If the latter, a NB model will not converge. You need to use a generalized Poisson, a hurdle (eg Poisson-logit, or even NB-logit), or a generalized NB model such as Waring NB regression or Conway-Maxwell Generalized Poisson If there are far more zeros in the response term, given its mean value, then it's appropriate to try a zero-inflated mixture model. If no zeros are allowed, as in length of hospital stay (LOS) , then use a zero-truncated model. There are a number of options to take, each depending on the structure of the data. But I always start out with a Poisson to get a first impression of what way I will go in looking for the best fitted model. There is software in R, Stata, and Limdep for the above models, and others.
The Poisson model is what you are looking for. In case you have too many zero counts, the Zero-Inflated Poisson model is an alternative.
This sounds like a regression model with either a Poisson or binomial link. The choice would depend on the total number of trials or population at risk. In SAS, the variation in total is handled by PROC GENMOD with an offset (usually log of the population at risk).
A log-linear model is appropriate in both circumstances mentioned. the dependent variable should be noted as the number of events per some standard count (e.g. number of positive events per 100 total events). When the total number of events is the same for each record, then this is already the case. Poisson regression approach is appropriate if the mean number of events is close to the variance of the measure, otherwise (in cases of over-dispersion) a negative binomial regression approach is more appropriate
Actually, if you have a "total", a logistic regression might be fine.
A logistic or Poisson will depend on whether you want the output as a log odds ratio or a log relative risk. The logistic usually has better convergence properties, but if the proportions are small th answers will be similar.
Thanks Lung-Chang and Eric. However, I have some question. For example, when groups to be compared are composed of the same number of individuals (e.g. 30 individuals), and one group has three individuals diseased (therefore 3/30) and another group has 15 individual diseased (therefore 15/30, which method should I use, Poisson or negative binomial? And, for example, which method when the groups have different size (e.g. group1 3/30, group2 17/45)?
Lots of analyses will lead to similar results. You can, as others say, use the Binomial model. You can also consider your count data and use what's called an offset by setting it up as a Poisson. For estimation on the boundary, like 3/30, the Poisson or negative binomial will probably work better than the binomial...some suggest not using the binomial if the proportion is 90%.
Thanks Dena, Nathaniel, Ken and Michael. Thanks to all of you.
I will check mean/variance relationship to choos between Poisson and Negative Binomial models. I suppose that the only way for two-by-two comparisons of groups is the contrast/t-test, is it right=?
It seems you don't have many points or events and I don`t know what you want to do exactly. Anyway, it could be also interesting to consider a Cox process.
All the options mentioned are viable, but the binomial (successed per trial) seems to most directly relevant to your problem. That being said, each of these distributions have built within them some degree of assumptions regarding the independence of events...that is, there are no "clusters" of events either over time (e.g. counts per person)...or I suppose regarding uncontrolled known or unknown covariates. SO...you may want to use classical sandwich estimation to tweak the base model.
I think, you have your expected response. As a whole, regression for count data. Good luck. If you use more than one methods, do some cross validation.
GOod point, Jason makes. You can do this in most good stats packages and also include an overdispersion index that would help with some of the clustering issues.
Thanks Jason. Due to other topics in RG, I know you are really an expert in statistics. Unfortunately, I have not a deep knowledge in statistics, thus some your suggestions cannot be understood by myself :( . However, I think that the most important things you sayid are straightforward for me.
By the way, just another question: and in the case there is not a total number of events (or individuals), but just a count per each group?
I agree with Michael Cambell. Poisson or Logistic regression is the way to go. It depends on what you need: RR or OR. In some cases OR can be a good approximation for the RR and vice versa. For more details look here: http://www.ncbi.nlm.nih.gov/pubmed/19127890
Thanks Jacek. I don't know Marascuillo procedure, but seems interesting. May you indicate some references, and software performing it?
Thanks Lidia for the reference
I was thinking about Poisson regression. It provides you with ratios against a reference category of each categorical independent variable and it is right for count data
Regression for count data (Poisson Regression) may be thwe suitable method for your analysis.
See the resource: http://cameron.econ.ucdavis.edu/racd/count.html
see the book Regression Analysis of Count Data by the authors Cameron and Trivedi
Given the type of variables you indicated, if you wish to test the hypothesis of equality of proportions of occurrence of the positive events across the different (say k ) groups (samples) of comparison, you can use the chi-square test for equality of proportions , whether or not the number of events is the same in all groups.
You can use Generalized Linear Models to analyze this type of data. If you want to use the proportion of positive events as dependent, use a binomial distribution. In this case the poisson wouldn't be appropriate.
Dear all,
using a dataset obtained from an experiment in which DNA methylation had been determined upon infection by virus, I have done a comparison among logistic, Poisson and Negative Binomial regressions. The response variable was the number of methylated cytosine (C_met) out of the total number of cytosine in the analyzed sequence (C_tot; the same for all groups).The groups were two (A and B), the replicates were three (plants; in the model I specified plants nested within groups). I found that mean was higher than variance in all groups.
The results of the comparison is: using only the raw count (C_met), the between group-significance in Poisson regression (Chi2=335.59) was higher than in Negative Binomial regression (Chi2=4.36); using the proportion C_met/C_tot, the significance in Poisson regression (Chi2=536.71) was lower than in Negative Binomial regression (Chi2=5419.04), while in the logistic regression it was intermediate (Chi2=766.61).
Finally, I definitively used the logistic regression with C_met/C_tot, because it seems to me the most correct and, in this case, biologically realistic.
Therefore, I guess that the use of proportions, if possible, makes the analysis more accurate even if the total counts per each groups are the same, while it is obligatory when the total counts are different per each group. Is it right?
If your dependent variable is a proportion it cannot have a poisson distribution, for several reasons: the poisson is un-bounded, while a proportion will always be bounded to 1; also the poisson is an discrete distribution, which a proportion is not.
You can use the poisson for the raw counts (in the case where the totals are the same) but not for the proportion.
The logistic regression would be correct, provided your data fit all the assumptions. Some of those assumptions are relaxed (or dealt with) using GLMs. Since your dependent is the number of positive cases out of a total (no matter if the totals are the same in all groups) this fits precisely with a GLM with a binomial distribution and a logit link. Which would probably give you the same answer as the logistic, though.
Thanks Francisco. The fact that Poisson regression cannot be used with proportions is novel for me. Can you briefly summerize assumptions for Logistic, Poisson, Negative Binomial, Log-Linear models, particularly with respect to the kind of response variable?
Hi Giovanni. Sorry, no time for that. I can recommend a book though: Bolker, B.M.: Ecological Models and Data in R.
In general, the distribution of your dependent variable should match the assumptions of your model.
Hi Giovanni,
If the dependent variable is the number of events per individual (a discrete numerical variable) and the possible number of events is bounded, say on a 0-5 scale, you can use some of the simplest methods available: t-tests, t-distribution confidence intervals, and linear regression. For this problem, parametric methods are robust, powerful, and generally superior to non-parametric methods. Poisson regression assumes an unbounded scale and may thus be unsuitable. Negative binomial regression is based on a sequence of event/non-event trials with constant probabilities. That may not be an appropriate assumption in your case (and in many other practical situations).
Please, find more information in Fagerland et al. (2011), http://www.biomedcentral.com/1471-2288/11/44
Cheers,
Morten
Thanks Morten. My experiment, for example, included the dependent variable as the number of C_met (ranging from 0 to 186) out of the number of C_tot (186; the same for all groups). Is it a case for Logistic regression or for parametric tests?
If your data is a proportion use the Beta Regression of Ferrari and Cribari
Hi Giovanni,
If your dependent variable is the number of events out of a total, then neither the Poisson nor the negative binomial models are appropriate. They are suitable for modeling counts, not the number of events out of a total. Beta regression is suitable only for proportions that are continuous random variables, so it also isn't appropriate for your purpose. Logistic regression (i.e., a binomial GLM) is suitable, and I would recommend estimating the Pearson scaling statistic to take over- or under-dispersion into account. If there is over-dispersion then you may want to consider moving to a random-intercepts logistic regression.
Logistic regression is suitable only if your response is binary
A binomial GLM (as I mentioned in my post) is the standard model for "grouped" binary data in the form of r counts out of a total of n for each case, for the simple reason that r can be considered to follow a conditional binomial distribution.
There are many particularities in analyzing count data, especially as these tend to be skewed. There is a superb review on the topic by Neal Alexander: http://www.ncbi.nlm.nih.gov/pubmed/22943299
As I understand for his phrase:
" The count is the number of positive events out of the total"
it looks like as if he is talking about proportions
Sounds like your problem is amenable to a contingency table analysis which simply involves counts and leads to a homogeneity chi-square that measures the degree of relationship between the rows and columns of the table. The results are not dependent on any particular distribution.
The general approach I've used when analyzing count data is to first perform a Poisson regression analysis. If the assumptions of the model are not met (i.e., that the mean and SD of the dependent variable are equal) I would then use a negative binomial analysis, which is recommended in the Alexander (2012) cited by Andre Siqueira above.
I would start to explore the data by looking at the contingency tables, in other words to explore the counts of the positive events out of total by the categories of the explanatory variables. You can also estimate the relative risk if you are interested to compare the outcomes for different categories. Proc freq in SAS will do the job.
If you want to model your data the Poisson regression providing the assumptions are met will work for you. Look here for more details: http://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_genmod_sect006.htm
I think you are getting a lot of confusing answers because the question is unclear as are the actual structure of the data. If you clearly state your hypothesis and the data structure, I am sure you will get more refined answers.
Well! for that one "Which method when the total number of events is the same in all groups to be compared?" ... I don't think you need a method ... I mean usually statistical methods try to capture how things are different and eventually would propose an inference approach for it ... so things need to seem different which ever apparently meaningful statistic you 're using and proportion of outcomes in each groups is certainly the simplet to think of .... unless you have a pretty good idea of the sampling distribution to test that this equality means something ? which will give the H1 distribution (allowing to test if things are not equal whilst observing that they seem to be ...) ... for the other case "when nb of events are not equal?" supposing you want to know if this is true (statistically) the negative binomial approach seems to be the rigth one ... is there a negative multinomial distribution?
If your responses are proportions, you might consider regression of the logit transformed proportions. Here's a link to an application using this approach: http://www.ats.ucla.edu/stat/stata/faq/proportion.htm
Some of this discussion seems relevant to what is referred to as zero-inflated data. This occurs when using a binomial response where many of the respondents have a zero (e.g., a specific disease did not occur, a specific behavior such as heroin use or suicide ideation was not experienced). From what is being suggested here, it sounds like a Poisson or more probably a negative binomial may be best for dichotomous data when there are excessive zeros. Does anyone have experience with procedures for analyzing zero-inflated data?
If you have access to Stata software, just type help binomial. help poisson or help zinb. You will find answer to many of your questions.
Depending on your situation the analysis will vary. If the interest is in finding association between categorical variables and there is no response/explanatory distinction in the variables the analysis is best done using log-linear models. However when there is a clear response and the objective is to find the affect of the explanatory variables on this response then Poisson regression is usually used unless there is an excess of zeros where a negative binomial regression is more appropriate. A crude yet simple way of analysing this type of data is to transform the data in to a normal form and use multiple regression. The transformation usually used is the log transformation.
To compare counts between levels of a categorical variable start with Poisson regression (a general linear model in which the family is specified to be “Poisson”), taking care to enter the categorical variable as a factor, so it is not treated as an ordinal (or continuous) variable. Select a "sandwich" (aka, White-Huber) estimate for the sandwich error. If you don't choose this standard error approach you could obtain misleading results. Note: If there is only one subject per level then this standard error approach is not possible and one has no choice but to assume that the counts are truly Poisson distributed (e.g, mean=variance), or otherwise no inference is possible.
It turns out that the results you get this way, will be very similar to the results you would get using the old standby, ANOVA, especially if the sample is “large” (say, 30 subjects or more per level) and if the counts are large (e.g. not a high frequency of zeroes). The following simulation in R illustrates this.
library(lmtest)
library(sandwich)
N
Giovanni,
Since I don't know the software you are using I won't go too far with this, but assuming you meet the general distribution for the Poisson model there are two variations of it, one for constant exposure (same number of trials, questions, whatever for everyone) and one for variable exposure. The software you use for this sort of model should have those options one way or another. The other comments, such as zero-inflated models should be accommodated as well. Of course, if your count data distributes in some atypical way, let's say normal, than you would address it differently, but those of us suggesting Poisson models have some experience and are guessing Poisson, possibly zero-inflated, will be the best fit.
Bob
Hi Bob-- My practice lately has been to assume that count and count-like data are negative binomial rather than Poisson. I see it as an easy win-- a lot more flexibility at low cost in DF or interpretation. I'm open to testing whether Poisson might fit adequately, but it's hard for me to see why I should bother. Do you know of good arguments to assume Poisson by default? Or even to assess Poisson ever? --Ken
Ken, I don't have an argument for that. Given the tool kit of generalized mixed models, it seemed easy enough to go with the Poisson model for the typical-looking count distribution, and I have been pretty unquestioning about that for these many years since those classic models were first documented (late 80s was it?), and, frankly, I don't personally execute them very often, although I do advise. I think your view point is intriguing and I think if you have adequate model fit that seems sensible enough. I certainly will consider that myself the next time I am looking at typical count or count-like data. Thanks, Bob
I generally asume that most count models will be overdispersed. But not all are. The reason to run a prelim standard Poisson model on the data is to determine the prima facie dispersion status of the model. If the Pearson dispersion statistic (Pearson Chi2/(residual dof) is greater than 1, the model is likely Poisson overdispersed. If under 1, then it may be underdispersed, If the latter, a NB model will not converge. You need to use a generalized Poisson, a hurdle (eg Poisson-logit, or even NB-logit), or a generalized NB model such as Waring NB regression or Conway-Maxwell Generalized Poisson If there are far more zeros in the response term, given its mean value, then it's appropriate to try a zero-inflated mixture model. If no zeros are allowed, as in length of hospital stay (LOS) , then use a zero-truncated model. There are a number of options to take, each depending on the structure of the data. But I always start out with a Poisson to get a first impression of what way I will go in looking for the best fitted model. There is software in R, Stata, and Limdep for the above models, and others.
Joseph, I start by looking at the dispersion parameter estimated by R function glm() using family=quasipoisson. The estimate is output with summary(glm(...)). Assuming you are an R user, if I have a fitted Poisson glm() model object named "f" in R, how do I calculate the Pearson dispersion statistic?
The quasipoisson "family" is not a family at all, but is rather a Poisson model where the standard errors have been scaled by the Pearson dispersion statistic. What that does is provide SEs that would be the case if there were no overdispersion in the data. It is an adjustment to the SEs. The Pearson dispersion can be obtained in R by using the following lines after using glm or glm.nb. It's meaningful for use with Poisson, negative binomial (NB), and grouped binomial models. Suppose that you have used Rs glm function and have called the resultant model name, "mymodel", For example
mymodel
I used to fit the Poisson and examine the Pearson chisq/df as you suggest, but I so rarely encountered underdispersed data that I switched. I usually notice at a preliminary (pre-modeling) stage if there is an overabundance of 0s and switch then to a mixture or more specifically a ZI model. I do tend to use your book as a guide in most matters, though!
I'm in the process of writing a 200 page paperback on modeling count data in which I am giving as clear guidelines as possible on model selection, fit and evaluation, plus a number of new models It will be geared for researchers without a background in modeling counts, unlike my Negative Binomial Regression book which has a lot of theory as well and is 570 pages. This will be purely applied. Cambridge Univ Press is publishing it. Other obligations and travel has caused more delays than I would have preferred, but f'm working on it. It appears that you are working on the correct approach now.
Thanks, gentleman. Joseph, your forthcoming book sounds extremely useful. You certainly have nothing to apologize for in terms of the delays, but I am sure it will be well used once it is available. Best, Bob
I'll be looking for the book on count data. I think it will fulfill a real need!
Quick note: in a Poisson distribution the variance, not the SD, is equal to the mean.
I can highly recommend Joseph Hlibe's book on Negative Binomial Regression. Single parameter distributions (such as the Poisson) unlike two parameter distributions (such as the Normal) can be dangerous ways to approach regression. In a clinical trial of epilepsy, for example, if you use Poisson regression you effectively assume that the total number of counts from any two patients with the same covariate pattern is all that matters. So 10 seizures is the same information whether composed of 5 + 5 or 10 + 0. Furthermore, if you consider the case of a model with time as on offset, you are effectively assuming that studying the same patient for twice as long gives you the same information as studying another patient for the same length of time. This is a strong assumption. The negative binomial for me is a far better default.
You may be interested in learning that Andrew Robinson and I have recently posted the "msme" package to CRAN. One of the functions is called nbinomial. After you install and load msm, type ?nbinomial for help. I have made the default to be the direct parameterization of the dispersion parameter to the mean, which is how Stata, SAS, SPSS, Limdep, Genstat, etc parameterize the dispersion parameter - and which makes much more sense. The greater variability in the data, the greater the value of the dispersion parameter, alpha. An alpha of 0 is Poisson, indicating no overdispersion. I have provided an option so that if you model the negative binomial using the inverse relationship as given in glm.nb, I also provide in default output a summary of the Pearson residuals, the null and residual Pearson Chi2 statistic (which means that R users don't have to calculate it each time) and the dispersion statistic (in addition to the dispersion parameter). The dispersion statistic is Pearson Chi2/(residual DOF, which I have previously demonstrated using siimulations to the best way to access overdispersion. Moreover, I have an option that allows you to parameterize the dispersion parameter itself so that you get coefficients for the predictors you want to use for modeling the dispersion. this allows you to determine which predictors have a significant influence on overdispersion. A great tool for learning about your data. this is called heterogeneous negative binomial regression. ALSO, my new book with Robinson, titled Methods of Statistical Model Estimation (Chapman & Hall/CRC) is due to be published in 2 weeks. It is for R programmers. Full code is provided for showing how to construct estimation methods including least squares, IRLS, ML, EM and Bayesian with an example using Metropolis-Hastings algorithm. ALSO, Another book I recently finished with Alain Zuur is, "A Beginner's Guide to GLM and GLMM with R" which shows how to use R for analyzing GLM and GLMMs using both standard and Bayesian methods. For Bayesian modeling we use JAGS. Complete code is provided showing Bernoulli, binomial, Poisson, negative binomial, gamma and beta-binomial models using both frequency-based and Bayesian modeling. techniques. Also GLMM. It will be published about June 6. We decided to self publish the book under Zuur;s consulting business in Scotland called Highlands Statistics I will be managing US and Canadian sales. Its a book I wish I had when first learning R, to b sure. I am now working on Modeling Count Data (Cambridge Univ Press), planning to finish by June 15th. It should be out in September. It will be a 200 page paperback using R and Stata for all examples. SAS code, where available, will be at the end of each chapter, or on the book's web site.
Someone above asked about using the NB model as a first count model to use on data rather than Poisson. I think that there are good reasons to just use NB - except that your data may in fact be underdispersed, which cannot be modeled using NB,and it assumes that the NB is the only alternative. In Modeling Count Data I will be showing why the Poisson inverse Gaussian (PIG), or zero-inflated PIG might be a viable alternative for many count data situations. James Hardin and I very recently wrote Stata commands for PIG, ZIPIG, 4 types of generalized NBs, and several additional count models that may well be better models than simply traditional NB for overdispersed data. R's gamlss has PIG and ZIPIG. and several of the generalized NB models and generalized Poisson have R functions avaialble. The book will show how to determine the best model for a specific type of data. it's not just a choice between Poisson vs NB regression. So I check Poisson first and the nature of the data. I data is overdispersed try to find out why.
Joseph Hilbe ([email protected])
My paper on counting experiments (also in the presence of background):
On frequency and efficiency measurements in counting experiments,
Nuclear Instruments and Methods in Physics Research Section A:
Vol. 614, pages 105-118 (2010)
Specific for binomial counting in the presence of background:
On the measurement of binomial data with background
Nuclear Instruments and Methods in Physics Research Section A:
Vol. 669, pages 85-96 (2012)
Perhaps I merely haven't had enough coffee this morning, but keying in on "out of the total" and some of the other bits and pieces. That causes me pause with regards to dispersion issues...it may not be zero-inflation or a simple non-identity relationship between mean and variance. It may be shape.
Someone suggested analyzing proportions, which is related to my suggestion, but not quite the same. It seems to me the language is implying events are constrained to something called "the total". It would seem to me that a binomial (events/total...successes/trials) would be the most appropriate approach. Poisson or Negative binomial would not appropriately model the changing shape of the distribution you'd expect when constrained between 0 and 1, which will shift from positively skewed at the low end, symmetrical at 0.5, and negatively skewed on the high end. This approach also appropriately scales to the size of the denominator ("total")...which means it will cover both scenarios (groups same total vs. groups different total). ...e.g. "had at least one drink on 80% of days not incarcerated" - will never drink on 110% of days.
That being said, if event rates are low, Poisson or Negative binomial may very well approximate the distribution. In this case, handling the case of varying total would be accomplished by using the natural log of "total" as an offset in the model...which, with the log link function of Poisson and Neg Bin, will also model the count as proportion of total [log(a)-log(b)=log(a/b)] when the link function is inversed. The distinction being that these models are perfectly happy with rates greater than 1...e.g." 2.5 drinks per day on days not incarcerated"...which may be at odds with this scenario.
So, for me the primary distinction between these two approaches (Binomial v.s Poisson/NegBin) depends primarily on the interpretation of total.
A quick reply. Pure proportion data; ie data in terms of probabilities between 0 and 1, are best modeled using beta regression. With respect to grouped binomial and Poisson/NB models, the denominator for binomial models are covrariate patterns. The mean response to be modeled is the number of successes in a given pattern of covariates. For example, a single observation may appear in the form of death/cases ~ gender + age (with female: gender=1 and male:gender=0) as 3/10 ~ 1 + 35. This specifies that 3 died out of the 10 cases of 35 year old females. The mean response for count models is the count of y given a specified area, time, or space, which is entered in the model as an offset.
I agree with Dr's Senn suggestion to prefer negative binomial (ie Poisson-gamma) models or its analog Poisson-lognormal rather than pure Poisson models.
There are plenty of examples in the literature suppporting this approach. In addition, the Poisson log normal models are especially flexible for including explanatory variables (discrete as in your case or continuous) and random effects.
Joseph Hilbe,
Could you post your draft of your 200 page manuscript?
i would recommend you the negative binomial (ie Poisson-gamma) models for the data analyzing.
I think one method that you can use in addition to what others have already mentioned is finite mixture model also known as latent class models. An advantage of this model over other models is that this method allows you divine the population into latent classes or groups and therefore allowing you to observe unobserved heterogeneity. For example there might be different types of utilizing a particular healthcare - not necessary they have to be from the same population. I would suggest you to use this as well other for example Poisson log normal and select the model based on criterion like AIC, BIC etc.
Luisiana :
The manuscript you are referring to is called "Modeling Count Data".(Cambridge University Press) It has come out to some 250+ pages. The proofs are being prepared by the production staff now, and it's due out the month after next. It's coming out in paperback, selling for about $30. I requested that it be published for as inexpensive a cost to the reader as possible. R, Stata and SAS code is used for examples. I introduce several new count models, in addition to providing a full explanation of the standard models. It's a guidelines book for practicing analysts.
I cannot just post the manuscript without violating the copyright. A lot of time has also gone into its preparation. Finite mixture models are also discussed, as are Bayesian count models.
Dr Hilbe -- I hope you will alert us when the book is finaly published. It sounds like it will be a significant contribution. As this thread suggests, how best to deal with count data is a matter of discussion. It's ironic - count data are actually the "simplest" form of data in existence. It's when we think about distributions that things start to get complicated.
Please, could anyone provide materials on how to conduct the poisson regression and negative binomial regression in SPSS? Thanks
Jos,
Perfectly fine if you are only interested in association between two variables. When looking at multiple variable relationships, like for example, association between cancer incidence and a treatment variable, but you also want to adjust for effects of age and gender more fire-power is needed. Hence the generalized linear models approaches.
To Rolu's question: Poisson and Negative Binomila models are accessed in SPSS via the Generalized Limear Model option under Regression. If I recall correctly, these routines will do the standard PRM and NBRM models, but not zero-inflated. Stata does them, however.
As it was said before, the assumption of a poisson distribution may not be correct. Instead I would use a neg-bin distributional regression
There are several methods and these vary in the level of sophistication. The simplest method is usually to log transform the data and use methods available for normal responses if the relevant assumptions are satisfied. A more complex approach is to assume that the counts have a Poisson distribution and do Poisson regression or fit loglinear models. If there is an excess of zero counts the Poisson model may not fit. In this case a negative binomial response model is more appropriate.
Roshini:
What you say above is not really what you want to do. If you have count data to model, NEVER log transform it and use a normal regression to estimate parameters. There are a host of distributional violations being committed when using this method. It is pretty rare to be able to use Poisson regression on real count models, although it happens. The assumption of the equality of the mean and variance of the counts is not often the case. If the Pearson dispersion statistic of a Poisson model is about 1, then you can likely use a Poisson model. If its under 1, then use generalized Poisson, or perhaps a hurdle model. If over 1, then the first model to look at is the negative binomial, especially if you don't know the cause of the overdispersion. In the case of excessive zero counts, you can use a zero-inflated Poisson (ZIP), but usually a zero-inflated negative binomial model (ZINB) is preferred. I have found that use of a Poisson inverse Gaussian (PIG) model is many times preferred to a NB for overdispersed data. R and Stata have it. Of course generalized Poisson, and a variety of 3-parameter NB models can be used as well, together their zero-truncated and zero-inflated versions. Other count models are available as well, including a variety of 3-parameter count models. Stata and R have all of these models. Lastly, remember that it is recommended to use a robust or sandwich variance adjustment as a default for standard errors when modeling count data.
William, you asked when my Modeling Count Data (Cambridge University Press) will be published. I was told last week that it is going to press today, but it takes a few weeks to print and bind it. Since its coming out in hardback, paperback and ebook formats at the same time, this caused delays. It is supposed to be ready for shipping to Amazon between July 8-15, and to general public by the 15th. The paperback version is some 299 pages and now costs $34 with Amazon. From previous experience though, when its released the Amazon price will very likely go down, and then start to move up after awhile. It lists for $37. Stata and R code are used for examples in the text, with SAS in the Appendix.
Dear Joseph,
Of course as I mentioned these different methods have their own level of sophistication. For a non-statistician, transforming the counts and using normal methods is recommended (Statistical Methods in Agriculture and Experimental Biology by Mead, Curnow, Hastead and Curnow, 2002). For the more advanced statistician, one of the most respected statisticians in the area of categorical data, namely, Prof. Alan Agresti, recommends exactly the methods recommended by me in his book, Categorical Data Analysis, 2nd Edition, pages 125-132
Thanks. I'm surprised Alan would say this. I looked up the pages you gave in my copy of Agresti's book. You must have another edition; there is nothing about it in my copy. I looked at other places in the book where he discusses count models and didn't find it either, but this is definitely not to say that he did not say something about it in your edition. Count data is positive only, discrete, and right skewed. Analysts used to log the response and run it using an OLS regression when they did not have the software for Poisson and negative binomial regression. I agree. But the underlying PDF of a normal model is the Gaussian or normal distribution, with a variance of 1 and an identity link; i.e. the mean and linear predictor are the same. The variance of count data varies with the value of the mean. But with a log(y) OLS model the variance is constrained to 1, and it cannot vary. I will admit that if an analyst uses a GLM program and models count data with a Gaussian family and log link that the coefficients and SEs will be generally close to that of a Poisson model, but that is not log-transforming the response and modeling with an identity linked normal model. Again, if one dose not have access to GLM, or to count software, then sure, do what you suggest. But it is not as well fitted a model as if you use a Poisson, negative binomial, or some other count model. With GLM software available in most commercial software, an analyst should e encouraged to model count data correctly.
In two of my earlier books, Negative Binomial Regression (2007, 2011) and Generalized Linear Models & Extensions (2001, 2007, 2012 with James Hardin) I state that if no GLM or count model software is available that log transforming y in an OLS regression is preferable when modeling count data to not transforming it and still modeling as OLS. But the analyst should know why that is not the optimal method, and that care must be taken when interpreting coefficients and predicted values.
In addition, most of the time when one is modeling count data the coefficients are exponentiated, providing incidence rate ratios. There is little statistical justification of doing this with log-transformed response OLS models. The log-y models are assumed to be continuous, whereas Poisson and other count model are assumed to be discrete counts. Predicted values of a log-y model are not predicted counts.
Dear Joseph,
After reading your previous answer, I realized that you have misunderstood my explanation. As I explained log transforming is only recommended for the non-statistician as it is a simple method and this has been recommended by Professors Roger Mead and Robert Curnow et al. in their book Statistical Methods in Agriculture and Experimental Biology.
Professor Alan Agresti recommends methods for the more advanced statistician and in his book which I have attached in my previous comment he recommends what I have recommended that is to assume that the counts have a Poisson distribution and do Poisson regression or fit loglinear models. If there is an excess of zero counts the Poisson model may not fit. In this case a negative binomial response model is more appropriate.
OK. My only recommendation is that if you find that you have excessive zero counts in your data, which depends on the mean of the count variable, start by modeling a zero-inflated Poisson, check the AIC and BIC statistics, then try a zero-inflated negative binomial. Check the boundary likelihood ratio test which is a test of ZINB vs ZIP as well as a Vuong test of the ZINB vs NB model. I'm not sure that these tests were available when Agresti's book was published (the one you sent me). Remember, statistics, like science, has new developments over time. Thank you for the ebook, its a newer edition than mine. But its still old. It has a copyright of 2002, which means that it was mostly written in 2000, and a part of 2001. That's 13-14 years ago. When I wrote the first edition of Negative Binomial Regression (2007, Cambridge Univ Press), most of it was written in 2006. It was out of date by 2010, at which time I felt it necessary to write a second edition, which came out in 2011 - its 570 pages just on count models. There is new material in Modeling Count Data, which is just three years later. I am now writing the second edition of Logistic Regression Models (2009, Chapman & Hall/CRC) because there have been many new advances in the area. Its due to be out next July. The first edition is 656 pages. The next edition will have some 200 more pages, if the book is formatted the same.
You may want to check newer resources on some of these points. OK, I need to get back to work. I understand your point. Thanks for the feedback.
I just read the question. If all independent factors are categorical, the log linear model will work well for count data. In SAS, you can use CATMOD or GENMOD procedures. There are a lot of examples, which you can follow. In some cases, you can use logistic regression. However, if two or more independent factors are responses, such as if GPA>3, if has sumptom, you should use log-linear rather than logistic regression. If you have many zero accounts, you may reduce the number of levels for the categorical varaible. Youshould also distiguish structure zero and random zero. You can design your own model and include interested parameters by CATMOD procedure, ...
Yuanzhang: It makes absolutely no difference to a count model or to a binomial model such as logistic regression whether the predictors (independent variables - a term that is generally not used by statisticians anymore - but that's another story) are continuous, categorical or binary. I am confused though by your use of the term "response". Most statisticians use the term "response": to indicate the variable being modeled. The response term for a count model is a count variable. The response for a binary logistic regression is a 0,1 binary variable. For a grouped logistic regression the response has two components - the numerator (number of successes) and a denominator (number of observations having the same pattern of covariates). However, I think you may be referring to a response variable with more than two levels; ie a categorical response. If the response is ordered, then most analysts use an ordered logit (also called a proportional odds morel) or ordered probit model. If it is unordered its referred to in general as a multinomial model. If each level requires the previous existence of lower levels, then use a continuation ratio model; eg if the categorical response is year of high school. You must have been a freshman and sophomore to be a junior. If there are 6 or 7 level then one might consider a count model, but it depends on several things, including what you are modeling the data to determine. Do you want predicted counts, or predicted probabilities.? If you model as a count, is the data over- or under-dispersed. If so, then you must adjust for it using either a robust variance adjustor, or by using an alternative count model that is appropriate for the data.
.
Joe,
Thanks for your comments.I might not state clearly.
Loglinear models treat categorical response variables symmetrically, focusing pn association and interactions in their joint distribution. Logistic model describe how a single categorical response depends on explanatory variables. Hence for a loglinear model, forming logits on one response helps to interpret the model. In such a case, logit models with categorical explanatory variables have equivalent loglinear model.
Suppose we have three factors (X,Y,Z), the final log-linear model is (XY,XZ), and Y is response, then the corresponding logistic model is (X+Z), however, if the final loglinear model is (XY,XZ,YZ) or (XYZ), there is no corresponding logistic model.
As the two response variable, suppose we have data (X, Y, Z,W)
X: teaching method, Y Do you pass the exam, Z: gender, W: Your evaluation of the class
Both Y and W are response, which can’t be decided at beginning of the class.
Both X and Z are explanatory variables.
In such a case, we should use log-linear model rather than logistic.
Regards!
Yuanzhang Li
I agree perfectly with the use of the zero-inflated models, ZIP, ZINB, zero-inflated hurdle etc. However, care needs to be taken when interpreting the model parameters. The parameters have latent class interpretation and hence inference is not target at the entire population as with the base model, the Poisson model. Also, incident rate ratio (or rate ratio) varies for all levels of other variables in the model as oppose to a constant rate ratio when using the Poisson model.
Can anyone suggest something comparable to the t-test for paired observations with a count dependent variable?
There are different models exist ti analyse count data. Starting with Poisson distribution. However, if here are many 0 counts, one may use zero inflated poisson regression model. Negative binomial (NB) model could be used. Also ZINB model. It depends on the nature if the data.
If your count data have a maximum limit then a censored regression model might be appropriate: A censored regression model estimates the relationship between variables when there is either left- or right censoring in the response variable (i.e. censoring from below and above, respectively). 'Censoring from above' takes place when cases with a value at or above some (a priori known / prespecified) threshold, all take on the value of that threshold, so that the true value might be equal or higher to the threshold.
There are also several R packages available for these kind of models.
For normally distributed data, use Poisson regression analysis, otherwise, use NBR analysis. You can also read a number of published articles to help you out.