I didn't notice this bug until I had a dataset where the number of factors was greater than my number of rows. Which lead to an overfitted perfect model when doing partial correlation (i.e. compare two models residuals).

So the correlation would be perfect, for every comparison of any given factor.

I tried to reduce the # of factors to less than then number of rows, but then I realized the closer the # of factors are to rows, the higher the correlation is which also means the greater the chance the correlation will be significant.

So I have a problem. Partial Correlation analysis is dependent on the # of factors being compared to the number of rows.

I'm not sure how to account for this. I have a rule of thumb that I should have 5*(p+2) number of records compared to p (number of predictors). So I wrote a function that splits the factors down into sets of (rows/5)-2. But this is just a rule of thumb.

I'm hoping for a simple solution, because right now I simply check if p < .8*(nrows), and if it is I don't do anything, if it's not, I then split the factors up, and then aggregate them back together until they are p < .8*(nrows).

What I'm doing is doing partial correlation on all the factors and removing the least significant one recursively until all are significant.

And I just read this http://www.psychwiki.com/wiki/How_is_a_correlation_different_than_a_partial_correlation%3F

"You typically only conduct partial correlation when the third variable has shown a relationship to one or both of the primary variables. In other words, you typically first conduct correlational analysis on all variables so that you can see whether there are significant relationships amongst the variables, including any "third variables" that may have a significant relationship to the variables under investigation."

So that may be where my problem is. But I'm not sure how I would implement that. Would I do all pairwise as well as all 3 way comparisons and link them somewhow? I'm trying to wrap my head around how I would code that.

Applied Regression Modelling gives an example of how to root out collinearity using regular correlation analysis. So maybe I can compare factor A to factor B (if found correlated) controlling for Y. I.e. regress A~Y and B~Y and compare residuals.

"Calculate the absolute values of the correlations between Y and each of the quantitative predictors and between each pair of quantitiative predictors; there is a potential multicollinearity problem if any of the (absolute) correlations between each pair of predictors is greater than the highest (Absolute) correlation between Y and each of the predictors."

More Joshua Laferriere's questions See All
Similar questions and discussions