11 November 2015 11 9K Report

The task involves predicting a binary outcome in a small data set (sample sizes of 20-70) using many (>100) variables as potential predictors. The main problem is that the number of predictors is much larger than the sample size, and there is limited/no knowledge of which predictors may be more important than other. Therefore it is very easy to "overfit" the data - i.e. to produce models which seemingly describe the data at hand very well, but in fact include spurious predictor variables. I tried using an ensemble classification method called randomGLM (see http://labs.genetics.ucla.edu/horvath/htdocs/RGLM/#tutorials) which seeks to improve on AICc-based GLM selection using the "bagging" approach taken from random forests. I checked results by K-fold cross-validation and ROC curves. The results seemingly look good - e.g. a GLM which contains only those variables which were used in >=30 out of 100 "bags" produced a ROC curve AUC of 87%. However, I challenged these results with the following test: several "noise" variables (formulas using random numbers from the Gaussian and other distributions) were added to the data, and the random GLM procedure was run again. This was repeated several times with different random values for the noise variables. The noise variables actually attained non-negligible importance - i.e. they "competed" fairly strongly with the real experimental variables and were sometimes selected in as many as 30-50% of the random "bags". To "filter out" these nonsense variables, I tried discarding all variables whose correlation coefficient was not statistically significantly different from zero (with Bonferroni correction for multiple variables) and run randomGLM on the retained variables only. This works (I checked it with simulated data), but is of course very conservative on real data - almost all variables are discarded, and resulting classification is poor. What would be a better way to eliminate noise variables when using ensemble prediction methods like randomGLM in R? Thank you in advance for your interest and comments!

More Igor Shuryak's questions See All
Similar questions and discussions