How to equalize results of Partial Correlation Analysis when number of factors exceeds rows?

30 April 2020 3 4K Report

I didn't notice this bug until I had a dataset where the number of factors was greater than my number of rows. Which lead to an overfitted perfect model when doing partial correlation (i.e. compare two models residuals).

So the correlation would be perfect, for every comparison of any given factor.

I tried to reduce the # of factors to less than then number of rows, but then I realized the closer the # of factors are to rows, the higher the correlation is which also means the greater the chance the correlation will be significant.

So I have a problem. Partial Correlation analysis is dependent on the # of factors being compared to the number of rows.

I'm not sure how to account for this. I have a rule of thumb that I should have 5*(p+2) number of records compared to p (number of predictors). So I wrote a function that splits the factors down into sets of (rows/5)-2. But this is just a rule of thumb.

I'm hoping for a simple solution, because right now I simply check if p < .8*(nrows), and if it is I don't do anything, if it's not, I then split the factors up, and then aggregate them back together until they are p < .8*(nrows).

What I'm doing is doing partial correlation on all the factors and removing the least significant one recursively until all are significant.

And I just read this http://www.psychwiki.com/wiki/How_is_a_correlation_different_than_a_partial_correlation%3F

"You typically only conduct partial correlation when the third variable has shown a relationship to one or both of the primary variables. In other words, you typically first conduct correlational analysis on all variables so that you can see whether there are significant relationships amongst the variables, including any "third variables" that may have a significant relationship to the variables under investigation."

So that may be where my problem is. But I'm not sure how I would implement that. Would I do all pairwise as well as all 3 way comparisons and link them somewhow? I'm trying to wrap my head around how I would code that.

Applied Regression Modelling gives an example of how to root out collinearity using regular correlation analysis. So maybe I can compare factor A to factor B (if found correlated) controlling for Y. I.e. regress A~Y and B~Y and compare residuals.

"Calculate the absolute values of the correlations between Y and each of the quantitative predictors and between each pair of quantitiative predictors; there is a potential multicollinearity problem if any of the (absolute) correlations between each pair of predictors is greater than the highest (Absolute) correlation between Y and each of the predictors."

David Eugene Booth

Ordinary least squares regression does not run if p>n, which seems to be your case. Please see a book such as: https://www.bing.com/search?q=steyerberg+clinical+prediction+models&cvid=846293b332a5475a86657fc96afbf9d2&FORM=ANAB01&PC=U531

Best, D. Booth PS please also see my post below.

Abdulmuhsin S. Shihab

You can analyze one dependent Variable Versus one independent Variable and another one as control and this will be more clear. No need to analyze more than one independent Variable Versus the dependent one. For many independent variables you may use multiple regression.

Laila Rahman First, there are many more regression approaches than OLS. See the references I have provided, for example lasso regression. As is well-known, p-value based selection, is a poor method to use e.g. see https://scholar.google.com/scholar?hl=en&as_sdt=0%2C36&q=Austin+Tu+Automatic+Variable+Selection+Logistic+Regression+myocardial+infarction&oq=Austin+Tu+Automatic+Variable+Selection+Logistic+Regression+Myocardial+Inf,. which shows Stepwise methods are terribly unstable. BTW lasso runs if p>n and selects against multicollinearity. For much better approaches(including lasso), see: https://www.bing.com/search?q=steyerberg+clinical+prediction+models&cvid=846293b332a5475a86657fc96afbf9d2&FORM=ANAB01&PC=U531 And the attached papers. Best wishes, David Booth PS I do agree that possible predictors should be picked on the basis of known science, when possible, and never by the "well what's on the shelf today?" method. But otherwise, I am afraid that I could not recommend your suggestions.

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

What is the difference between mathematical R^4 space and physical 4D unit space?

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

Controlling for pupil light reflex when analyzing pupil size time course?

What are a “Farmers Producer Organization” (FPO) and its essential features?

Strugglling with m6A dot blot any suugesstion ?

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How to get moment output in Abaqus Standart?

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Why does my protein refolded to beta sheet during thermal denaturation analysis?