Is it correct to use logistic regression when chi-square test is not significant (p>0.05)?.
Logistic regression is a powerful statistical way of modeling a binomial outcome (takes the value 0 or 1 like having or not having a disease) with one or more explanatory variables.
ADVANTAGES
I can see two main advantages of logistic regression over Chi2 or Fischer's exact test. The first is you can include more than one explanatory variable (dependent variable) and those can either be dichotomous, ordinal, or continuous. The second is that logistic regression provides a quantified value for the strength of the association adjusting for other variables (removes confounding effects). The exponential of coefficients correspond to odd ratios for the given factor.
DISADVANTAGE
1) You need enough participants with each possible set of explanatory variable. Using interaction or adding factors that a rare therefore reduce considerably the power of the analysis. This has to be carefully considered at the planning phase to make sure the sample size is large enough.
2) If you are using a dependent variable that is not binomial, you need to test the assumption of linearity before including it in the model. This is possible by first creating dummy variables for each value of an ordinal variable or by cutting down a continuous variable in different categories, and then using them as dummy variables. Likelihood ratio test can then be used to test if the model assuming linearity is similar to the one not assuming it. This has the major advantage of increasing the power of your analysis. It can require some transformation.
3) Logistic regression combines both binomial and normal distribution. This can sometimes cause problems. Quadrature check can be used to verify that these problems did not occur. Relative differences must be bellow 0.01 (1%) for all parameters.
4) Defining variables to enter in the model, adding, or removing explanatory variables can be complicated and must be carefully planned. Avoid important collinearity between variables as this will cause over-adjustement. Identify potential candidates using univariate analysis with a p-value threshold above the one you wish to use at the end as negative confounding can occur. When necessary consider introducing interaction terms if you are to believe some factors might increase the effects of others on your outcome.
Have fun with your analysis!
Hi Brijesh! So, in the context of Generalized Linear Model, Logistic regression analysis is often used to investigate the relationship between a Binary response variables and a set of explanatory, or independent, variables. A Binary response consists, for example, as success and failure. In case of diseases studies your outcome are denoted as Y1=if disease is present or Y2=otherwise.
So, this kind of variable does not follow a Normal distribution and you have to analyze it using a link function named LOGIT.
Best regards.
It is good to use logistic when we want adjusted odds ratio where we know more than one risk factors.
As Juan and Thiago noted the dependent variable in logistic regression can be dichotomous (=zero or one, presence or absence, death or survival). Logistic regression also applies to clusters of dichotomous responses. For example, if the dependent variable is the number of successes in a fixed number of trials. In a way this is a repeated measures situation.
There is no relationship between the results of a chi-squared test and the decision to use a logistic regression. When you have only one dicotomus independent variable, the results from chi-squared and logistic regression to yhe question "there is a significant relationship between these two variables?" will be the same in the majority of the results.
However, if you have two or more independent varisbles, the chi-squared test will not look at the relationship between independent variables and you should use logistic regression to avoid confounding effects. As an example, if you look for effects of gender and level of education on the monthly income, you could find a relationship between gender and income and between education and income. But if the majority of females comes from low education level, probably the gender effect observed in the bivariate analysis is a confounding effect. If ypu analyse the same data with logistic regression, you will observe only the education effect in this situation.
In binary logistic regression the fundamental condition is that the outcome variable is dichotomous and the predictors tend to a linear relationship.
The predictive varibles can be categorical or continuous. The sample size is important in the sense that at least 10 per variable sample elements.
http://www.scielosp.org/pdf/resp/v76n2/a02v76n2.pdf
Some clarifications -- or so I intend.
The rule of thumb is at least 10 events per variable. So if you have an event with a 1% chance, and 10 variables, you want a sample size of around 10,000.
Logistic regression lets you look at multiple predictors and at continuous predictors. You can also include interaction terms. So if you have reason to suspect your chi-squared result was affected by confounders, then you can account for the confounders' influence using logistic regression.
You can also account for a suspected categorical confounder using the Mantel-Haenszel statistic.
Yes, it is correct to use logistic regression when chi-square test is not significant , when you go from univariate to multivariate model, set entry p-value on 0.2 or so
In the above answers sometimes the sample size is mentioned. For small sample you may use Firth-corrected method implemented in R and SAS. The correction also accounts for separation, a very common incident for logistic regression with small sample.
Brijesh,
1. I suspect that you have a particular application in mind. If so, I think respondents would be more helpful to you if you described your application.
2. Note that there is more than one chi-square test that could be used with logistic regression. One of the modern versions is the Hosmer-Lemeshow, named for the authors of a respected text/self-study book. Different measures of quality measure different things and it is useful to understand what is being tested. I haven't done a stringent evaluation of the following link but it does quickly illustrate my point about the existence of multiple chi-square tests: http://www3.nd.edu/~rwilliam/stats2/l83.pdf
3. If you're looking for motivation and knowledge to run the methodology then I recommend the third edition of Applied Logistic Regression Analysis by Hosmer, Lemeshow, and Sturdivant, which was published on April 1, 2013. The new author to this series (Rodney Sturdivant) is a professor of statistics at West Point. Rod told me this book was coming out more than a year ago, but that's life in the world of publishing....
Chuck
logistic regression is an extension of regression that allows us
to predict categorical outcomes based on predictor variables.
In a nutshell, logistic regression is multiple regression but with an outcome variable that is a categorical variable and predictor variables that are continuous or categorical. In its simplest form, this means that we can predict which of two categories a person is likely to belong to given certain other information. A trivial example is to look at which variables predict whether a person is male or female.
In logistic regression, like ordinary regression, we assume linearity, no multicollinearity and independence of errors. The linearity assumption is that each predictor has a linear relationship with the log of the outcome variable.
Logistic regression is a powerful statistical way of modeling a binomial outcome (takes the value 0 or 1 like having or not having a disease) with one or more explanatory variables.
ADVANTAGES
I can see two main advantages of logistic regression over Chi2 or Fischer's exact test. The first is you can include more than one explanatory variable (dependent variable) and those can either be dichotomous, ordinal, or continuous. The second is that logistic regression provides a quantified value for the strength of the association adjusting for other variables (removes confounding effects). The exponential of coefficients correspond to odd ratios for the given factor.
DISADVANTAGE
1) You need enough participants with each possible set of explanatory variable. Using interaction or adding factors that a rare therefore reduce considerably the power of the analysis. This has to be carefully considered at the planning phase to make sure the sample size is large enough.
2) If you are using a dependent variable that is not binomial, you need to test the assumption of linearity before including it in the model. This is possible by first creating dummy variables for each value of an ordinal variable or by cutting down a continuous variable in different categories, and then using them as dummy variables. Likelihood ratio test can then be used to test if the model assuming linearity is similar to the one not assuming it. This has the major advantage of increasing the power of your analysis. It can require some transformation.
3) Logistic regression combines both binomial and normal distribution. This can sometimes cause problems. Quadrature check can be used to verify that these problems did not occur. Relative differences must be bellow 0.01 (1%) for all parameters.
4) Defining variables to enter in the model, adding, or removing explanatory variables can be complicated and must be carefully planned. Avoid important collinearity between variables as this will cause over-adjustement. Identify potential candidates using univariate analysis with a p-value threshold above the one you wish to use at the end as negative confounding can occur. When necessary consider introducing interaction terms if you are to believe some factors might increase the effects of others on your outcome.
Have fun with your analysis!
Paul,
I have just one comment about your first "disadvantage". Although I agree with you that the inclusion of many variables will reduce statistical power, if you are not able to do that, as when you perform a chi2 test, you probably obtain a biased result...
So, I believe that an unbiased result with reduced statistical power is much better than a biased result with great statistical power. Indeed, we must be very selective when choosing variables to the model, but a multivariate analysis is always better than a series of chi2 tests.
Generally, frankly speaking, logistic regression is applicable when the dependent variable is binary while independent variable is either binary or continuous. In such a case (when the dependent variable is dichotomous/binary) logistic regression is only the statistical tool to investigate the effect of independent variable (or a set of independent variables) on the dependent variable.
Chi-square test is used to test the hypothesis that the two attributes (categorical variables instead of quantitative variable) are independent OR there is no significant association between the two attributes. One thing which is important to mention is that in this case we are not bound (or even no such type of concept is used or required for its implementation) to say that one is dependent variable and other is independent variable, but in logistic regression one must mention clearly about these two types of variables.
Similarly, Chi-square test is applicable when the motive is to establish association between the two qualitative variables. In applying Chi-square test both the variables must be qualitative (each variable/attribute must at least have two categories) which is not the case for logistic regression. In addition, logistic regression study the dependence of binary response on binary independent (0r continuous independent) variable(s).
In my opinion, one must be very much clear in:
1. Whether he/she is interested to investigate the dependence of one variable upon other variable(s), OR
2. he/she is investigation the association (relationship) between two attributes/variable.
In case (1) you are required to use logistic regression (by transforming the values according to the need of logistic regression), whereas in second case i.e. case 2 it is suggested to use Chi-square test.
The left hand side variable in the regress is valued at either zero or one.
Logistic regression analysis examines the influence of various factors on a dichotomous outcome by estimating the probability of the event’s occurrence. It does this by examining the relationship between one or more independent variables and the log odds of the dichotomous outcome by calculating changes in the log odds of the dependent as opposed to the dependent variable itself. The log odds ratio is the ratio of two odds and it is a summary measure of the relationship between two variables. The use of the log odds ratio in logistic regression provides a more simplistic description of the probabilistic relationship of the variables and the outcome in comparison to a linear regression by which linear relationships and
more rich information can be drawn.
There are two models of logistic regression to include binomial/binary logistic regression and multinomial logistic regression. Binary logistic regression is typically used when the dependent variable is dichotomous and the independent variables are either continuous or categorical variables. Logistic regression is best used in this condition. When the dependent variable is not dichotomous and is comprised of more than two cases, a multinomial logistic regression can be employed. Also referred to
as logit regression, multinomial logistic regression has very similar results to binary logistic regression.
Chi-square test gives you the association between two variable and in logistic regression you can have multiple variables. There is chance the one particular variable is not showing any statisitical association with your dependent variable in chi-square test but in presence of other variable it can show significant association. You first need to conceptualize which variable are associated then check for association using chi-square and later use regression model.
Lots of good summaries of logistic regression (LR) have been given here, so I will add a simple point regarding sample size and alternate tests, specifically discriminant analysis (DA) which also can be used to determine how well categorical groups can be assigned from one or more continuous explanatory variables.
LR is more sensitive to smaller samples sizes than DA. However, DA assumes that your data is normally distributed while LR does not. So if you have a normal distribution or can transform your data to get one, DA is a better choice when sample sizes are small.
For both tests, you can retain multiple, collinear variables if you think that including them all is important by first reducing them via ordination. A common approach is to use PCoA or PCA to reduce your explanatory variables to axis scores and then run LR or DA. If you take this approach you should make sure that your first axis describes a reasonably large amount of the variation in your explanatory data, but the consequence of using axis scores from an axis that captures little of the variation would simply be a non-significant LR or DA result.
Dear Brijesh,
I think you need to revise your concepts of basic statistics to know that the purpose of chi-square test and logistic regression are entirely different (a common problem with medical researchers who are not that confident in statistical methods). Former one is a test of significance used, among many other things, to test the significance of association between two qualitative variables , whereas, the latter one is a regression procedure used to find the effect of one or more (usually more) independent variables on a dependent variable. Regarding conditions and type of variables, I advise you to refer any STANDARD literature on logistic regression. By the way, I was very much impressed looking at your number of publications (>200), however, more surprised to see that there is no impact factor. I am sure there must be some technical fault due to which RESEARCHGATE is not showing impact factor of your publications. I request you to get the things rectified so that RG family becomes acquainted with the impact factor of your valuable publications.
Dear Murali Dhar,
I believe you have made some mistake here. Chi-square test and logistic regression are not entirely different. Both techniques test the association between two variables, although logistic regression presents less assumptions. For instance, as you stated, logistic regression allows testing associations two or more explanatory variables simultaneously. Moreover, explanatory variables at chi-square must be categorical. But chi-square test is designed, among other uses, to test association between variables, as logistic regression. And if you calculate an OR from a 2x2 table and from a simple logistic regression, the result will be the same.
Certainly, as discussed above, a non-significant result on chi-square test do not prevents the use of the variable in a multiple logistic regression model. But they are not so unrelated techniques as you stated.
Not really a general con but rather a technical weakness of the USUAL logistic regression method which is when there is a perfect (binary) separation in the outcomes ($Y$).
For more details please follow this:
https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/discussion/forum/i4x-HumanitiesScience-Stats216-course-Winter2014-course-material-feedback/threads/52f99ed5ec0f93d02c000063
logistic regression or logit regression is a type of probabilistic statistical classification model.[1] It is also used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). That is, it is used in estimating empirical values of the parameters in a qualitative response model. The probabilities describing the possible outcomes of a single trial are modeled, as a function of the explanatory (predictor) variables, using a logistic function. Frequently (and subsequently in this article) "logistic regression" is used to refer specifically to the problem in which the dependent variable is binary—that is, the number of available categories is two—and problems with more than two categories are referred to as multinomial logistic regression or, if the multiple categories are ordered, as ordered logistic regression.
Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables, which are usually (but not necessarily) continuous, by using probability scores as the predicted values of the dependent variable.[2] As such it treats the same set of problems as does probit regression using similar techniques.
Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C"). In binary logistic regression, the outcome is usually coded as "0" or "1", as this leads to the most straightforward interpretation.[8] If a particular observed outcome for the dependent variable is the noteworthy possible outcome (referred to as a "success" or a "case") it is usually coded as "1" and the contrary outcome (referred to as a "failure" or a "noncase") as "0". Logistic regression is used to predict the odds of being a case based on the values of the independent variables (predictors). The odds are defined as the probability that a particular outcome is a case divided by the probability that it is a noncase.
Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting binary outcomes of the dependent variable (treating the dependent variable as the outcome of a Bernoulli trial) rather than continuous outcomes. Given this difference, it is necessary that logistic regression take the natural logarithm of the odds of the dependent variable being a case (referred to as the logit or log-odds) to create a continuous criterion as a transformed version of the dependent variable. Thus the logit transformation is referred to as the link function in logistic regression—although the dependent variable in logistic regression is binomial, the logit is the continuous criterion upon which linear regression is conducted.[8]
The logit of success is then fit to the predictors using linear regression analysis. The predicted value of the logit is converted back into predicted odds via the inverse of the natural logarithm, namely the exponential function. Therefore, although the observed dependent variable in logistic regression is a zero-or-one variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a success (a case). In some applications the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a case; this categorical prediction can be based on the computed odds of a success, with predicted odds above some chosen cut-off value being translated into a prediction of a success.
Logistic function, odds ratio, and logit
Figure 1. The logistic function, with \beta_0 + \beta_1 x on the horizontal axis and F(x) on the vertical axis
An explanation of logistic regression begins with an explanation of the logistic function, which always takes on values between zero and one:[8]
F(t) = \frac{e^t}{e^t+1} = \frac{1}{1+e^{-t}},
and viewing t as a linear function of an explanatory variable x (or of a linear combination of explanatory variables), the logistic function can be written as:
F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}.
This will be interpreted as the probability of the dependent variable equalling a "success" or "case" rather than a failure or non-case. We also define the inverse of the logistic function, the logit:
g(x) = \ln \frac{F(x)}{1 - F(x)} = \beta_0 + \beta_1 x ,
and equivalently:
\frac{F(x)}{1 - F(x)} = e^{\beta_0 + \beta_1 x}.
A graph of the logistic function F(x) is shown in Figure 1. The input is the value of \beta_0 + \beta_1 x and the output is F(x). The logistic function is useful because it can take an input with any value from negative infinity to positive infinity, whereas the output F(x) is confined to values between 0 and 1 and hence is interpretable as a probability. In the above equations, g(x) refers to the logit function of some given linear combination x of the predictors, \ln denotes the natural logarithm, F(x) is the probability that the dependent variable equals a case, \beta_0 is the intercept from the linear regression equation (the value of the criterion when the predictor is equal to zero), \beta_1 x is the regression coefficient multiplied by some value of the predictor, and base e denotes the exponential function.
The formula for F(x) illustrates that the probability of the dependent variable equaling a case is equal to the value of the logistic function of the linear regression expression. This is important in that it shows that the value of the linear regression expression can vary from negative to positive infinity and yet, after transformation, the resulting expression for the probability F(x) ranges between 0 and 1. The equation for g(x) illustrates that the logit (i.e., log-odds or natural logarithm of the odds) is equivalent to the linear regression expression. Likewise, the next equation illustrates that the odds of the dependent variable equaling a case is equivalent to the exponential function of the linear regression expression. This illustrates how the logit serves as a link function between the probability and the linear regression expression. Given that the logit ranges between minus infinity and infinity, it provides an adequate criterion upon which to conduct linear regression and the logit is easily converted back into the odds.[8]
Multiple explanatory variables
If there are multiple explanatory variables, then the above expression \beta_0+\beta_1x can be revised to \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m. Then when this is used in the equation relating the logged odds of a success to the values of the predictors, the linear regression will be a multiple regression with m explanators; the parameters \beta_j for all j = 0, 1, 2, ..., m are all estimated.
Agreed with Fariborz Hamidi, he has given a detailed description of the use of logistic regression.
Nice question!
Logistic regression is usually used with binary response variables ( 0 or 1 ), the predictors can be continuous or discrete. When we should use logistic regression?, when we assume that we can have a violation to the assumptions of regression analysis, such as normality in the errors, and that happens for example if p is very close to 0 or very close to 1. If for example, If i were studying first service fertility in cattle, whose value is very close to 0.5, I would use weighted regression by least squares for the analysis of that variable, even if the ratio is in the range 0.40 to 0.60, depending of course also the sample size. As I said earlier I would use logistic regression if I'm estimating the proportion is too small or too large.
I would worry is for the case of a variable is continuous (observational research), the safest thing is that it my be better to classify the predictor say for example, based on quartiles or percentiles, to ensure that these ratios are estimated as most appropriate as possible. Obviously this is not an issue of concern if instead of an observational study, we can select certain values of a continuous predictor, but you have several data points for each level in order to avoid errors in the estimation.
I also use logistic regression to identify the relative contribution of each of several variables on the response variable which either is present (Y=1) or it is absent (Y=0). The possible applications are many.
Here are three examples of what I have used LR recently: (1) Identify ecological factors that may impact the presence of some animal in some environment. Concordance was 80%. (2) Identify the most important factors associated with student retention and drop-out. Concordance was 82% (3) Identify factors associated with some rare disease that not much is known about. Concordance was about 75%.
The Concordance rate tells us how good the model is for separating the 0's and the 1's with the chosen model. This can be quite useful in some applications.
I find LR to be more useful than doing a chi-square test, when applicable.
On a related note...there's been some intelligent discussions on the internet as to the minimum number of cases (for each independent variable) used in a model. However, I was wondering if folks had some insights regarding the need for having a minimum number of cases per category of a dichotomous dependent variable.
For example, if one is examining college degree status (no, yes) as an outcome variable, would there need to be a minimum number of cases for both categories of college degree status? Let's assume this involves a sample (n = 200) being used to examine 7 independent variables to predict the likelihood of college degree status.
Looking forward to your comments.
categorical typeof data need logistics regression analysis.
Logistic regression is being used for out put or dependent variable is categorical or dichotomous and independent variable may either be categorical or continous.
Brijesh! Using of logistic regression depends on your type of variables and related data. If your DV (dependent variable) as an outcome is dichotomous in nature then you can use what is known as binary regression to determine whatever related predictor(s) that lead to. Also, if the DV of three categories then you have to use what we call Multinomial Regression. Further, if you are dealing with ordinal type of data, then your choice will be ordinal regression.
What if the DV is in Likert type scales? For example, a 5 point Likert type scale as DV?
Can I ask: can anyone here explain how to interpret Goodness of Fit? I am doing a preliminary ordinal logit regression. The DV is ordinal (10 categories), while the IV is scale (percentage per year). I read somewhere not to go by the g of f p value since the IV is scale. Is this true?
first thanks all for the complete answers. I am currently working with a dataset where my response variable is the number of successes divided by total number of tests per patient. So i summarized the 0 , 1 responses for each test per patient to 1 variable which is the proportion of 1 outcomes. So that my response variable is between 0 and 1 (in many cases it's 0). I have a big dataset with many independent variables and i'm looking for a way to check for multicollinearity between them. any suggestions how i can do it with sas? thanks :)
The dependent variable is dummy and the independent variables can be dummy or continuous variables Before taking the variables to the model:
1. multicollinarity test should be done for continuous independent variables
2. check contingency coefficient of dummy variables for the independent variables
Hi, I am working on logistic regression using a dependent variable with a prevalence of approximately 20%. My question is about the validity of using another dependent variable with a much higher prevalence (80%).
The predictors change significantly, so I wonder if this is valid.
Again, in controlling for other covariates; inclusion/ consideration of factors/ variables while conducting multivariate analysis on data of small sample size, is it okay if I make use of variables that are significant at p < 20 in the univariate analysis in the logistic regression? Would this give a better model than considering only factors significant at p
Logistic regression is used when you have a data consisting of a categorical and continuous variable.
Please, find herafter an example of my own, using logistic regression in financial analysis of agricultural holding :
Introduction to Scoring Methods: Financial Problems of
Farm Holding, CS-BIGS 2(1): 56-7.
It includes a previous pca study and a comparison of the concepts involved.
Best regards,
http://www.bentley.edu/sites/www.bentley.edu.centers/files/csbigs/Desbois.pdf
It is widely used when there is a dichotomous dependent variable and independent and / or predictive variables that can be metric, ordinal or nominal.
Logistic regression measures the strength and statistical significance of each independent variable with respect to the probability of moving from one situation to another, keeping constant the effect of the other variables.
It is used when in my theoretical model I have a dichotomous dependent variable.
As far my concern, there is no link between use of chi square test and logistic regression. it depends upon nature of research problem that what statistical tool should be applied.
Logistic regression is used when dependent variable is categorical in nature and independent variables are categorical, continuous or combination of both. DV may be dichotomous or polytomous thus used logistic regression would be binary logistic regression and multi nominal logistic regression respectively.
Chi analysis has only two variables while logistic regression has at least two (one dependent and at least one independent). Sometimes, the association moves to significance in the logistic regression due to the confounding effect, of other variables, which is removed in the logistic regression. So, sure you can!
Looking through these answers a lot of people are claiming that the logistic regression is used only for binary (Bernoulli) outcome variables. The logistic is referring to the link function being the logit. This is often used also for binomial outcome variables (so often used for proportions ... the response that has the most recommendations says binomial means 0 or 1 rather than the sum of some number of 0/1 events!) so that predicted values are not outside [0,1]. As far as its relationship with Chi-square, you need to provide more details for have this make sense to readers to answer.
In Logistic regression technique input features can be
a. qualitative
b. Quantitaive
C.both A and B
D. None of the above
Hi everyone, can i ask: If all the answer from our data is 1 (Yes): outcome variable (DV), can we still perform logistic regression to test relationship with other set of predictors (IV)? thank u