I am confused about multivariate statistical analysis.
Can you please tell me if it is logical or right to include variables to multivariate statistical analysis if they have no statistical significance in univariate analysis.
If it is logistic regression we're talking about, I presume that you have done an analysis on categorical variables, via OR or RR and calculating a p value using a continuity correction Chi square or something similar.
Normally, if the p value is not significant, you would not proceed with the inclusion of that variable in a logistic regression model with multiple variables. Unless you are suspecting a specific interaction with other variables of your multivariate model, based on a known biological plausibility, that makes it sensible to include that variable. You would have to justify that in your methods, though.
(if it were a multivariate linear regression, I would rather look at the regression coefficient, rather than p value. But for a logistic regression, if the initial OR or RR with that single variable is not exciting and there is no significance, I would leave it be).
In any case, it is very unlikely that a variable that does not give you a significant p value at the univariate stage than magically confers significance in a multivariate analysis.
Thank you very much for your quick response and contribution.
So I am right about this question. I had just review an article and they report that they find a variable that does not give a significant p value at the Student' T test but confer significance in a multivariate analysis.
Do you really think this can happen? Can we expect such a variable as confounder, contributer or what?
Yes, but sometimes the 'p' value used to include a variable in the univariate / students T test is more lenient e.g. at the 10% rather than 5% level. This seems generally 'acceptable'.
1. P-value is affected by sample size. If you work with a relatively small sample, some variables can have a substantive importance, although they are not significant.
2. The decision which variables to include / exclude in the model should rely on some theoretical base. If the insignificant variable "has" to be in the model, according to the theory, you should enter it despite its insignificance.
1. True. But my point was about a hypothetical chi square, which is calculated on proportions within groups. The sample size, unless very small, is not paramount. I did not know it was t-test, at that stage.
2. Point 2 is just a repetition of what I said about suspecting "a specific interaction with other variables based on a known biological plausibility", which is the equivalent of your "has" to be in the model.
I think it depends on the research question and the discipline. In some disciplines, demographic variables such as age is always adjusted for in the multivariate analyses. With regards to the research question, if a variable is of theoretical importance one might chose to keep it in the model despite none significance.
I have a similar case in my research, I have examined the correlation between two variables and the results were not significant. However, I have entered that variable to the PLS-SEM and the results were significant. I consulted a professor in statistics and his advice was to enter all variables regardless of their significant correlation.
Indeed the bivariate insignificant variable can turn up statistical significant in a multiple regression. It's called a suppressor effect. It can even turn from negative effect to positive effect or vice versa. This is called a Simpson's paradox.
In my experience it is not unusual for one or more variables to not have a significant effect individually but be significant when included in a multivariate regression. This often happens, particularly when one or more variables have a very large effect. Variables with smaller effects are often not significant because of the variation due to the variable with the large effect is not in the model and its variation shows up in the residual variation used to construct the tests of significance.
Recommendation. Construct scatter plots of Y vs X for each of the Xs to get a general view of the effects of the variables. Use multivariate regression to test the significance of the effects of the variables (Xs). Be aware of the effects of any multicollinearity on the tests of significance.
if you are goining to enter all variables into the multivariate analysis regardless their significance, then what is the pupose of the univariate analysis?
"Annals strongly disagrees with the technique of using univariate analyses (ie, p-values) to select the variables to include in a multivariate model. The problem with this method is that there are variables that may not be important in a univariate association but are important in the full model. For example, a variable that may not be statistically associated overall may display important predictive value in one group (say males) but not in another (say females)."
Just because a variable is substantive does not mean that we are allowed to include it in the statistical model. We can not ignore the issue of collinearity. In addition, we can not consider small sample size unimportant.
It is clear that based on mathematical proofs, this model reports variables that may have a significant effect as non-significant. This is the wrong result and the wrong report.
The book "Applied Linear statistical model", written by Michael H. Kutner et al., Is a major reference in statistics that explicitly addresses these issues.
I think you meant bivariate analysis instead of univariate. Besides, if the variables were found to be significant in other studies elsewhere you can as long as you quote the studies. Ideally if a variable is not significant, it might have limited chance for it to be significant in multivariate