Poor Validation results in stepwise regression

More Javed Akhter's questions See All

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

To perform transfection with DharmaFECT Duo in AGS cells. Could you tell me what the ideal concentration is to avoid significant cytotoxicity?

I would like to perform transfection with the reagent DharmaFECT Duo (Horizon) on the AGS cell line. Could you please inform me of the optimal concentration to use without causing cytotoxicity in...

03 August 2024 3,851 1 View

What is meant by baseline of FTIR data?

I got comment on my FTIR data figure from a reviewer. The reviewer said "FTIR data in Figure should be repeated. there is no bassline." I made Y off set comparison graph of FTIR on OriginLab. Can...

03 August 2024 6,070 3 View

What is Random Audit?

HI there, I've came across several articles discuss about random audit an Non random to tax evasion or compliance. Most of the articles is relating about effect of audit (random or non random)...

31 July 2024 5,309 7 View

Is the mentioned CV graph a valid one as this graph have only one peak prominent (reduction)?

I have used Prussian blue nanoparticles as a redox couple. The PBNPs have been made using only one salt precursor. Also, during scan rate studies, a small oxidation peak can be consistently found...

31 July 2024 9,697 0 View

How we can use lattice-based cryptography for construction of S boxes?

Please suggest basic literature on lattice-based cryptography. A kind response from the cryptographic community will be highly appreciated.

24 July 2024 6,291 4 View

I am working on III-V based tandem solar cells.Can anyone explain that solar cells work under forward or reverse biased conditions?

I want to know that n-doped side of solar cell is connected with positive or negative electrode.

20 July 2024 1,348 2 View

Can it is possible to find the cleaved sequence when a protein cleaved by a heamaglutanin protease (HA/P) by any bioinformatics tools?

Bioinformatics tools like peptide cutter

15 July 2024 6,453 1 View

Hello, does anyone know how I can get coordinates of two specific points on a moving body in Abaqus when applying a vdload subroutine?

Hello I am trying to simulate repulsive force of two permanent magnets in Abaqus. I have implemented Force equations into a vdload subroutine. sensitive parameters are the center to center...

15 July 2024 3,491 0 View

FACS markers to characterize rat bone marrow-derived macrophages?

Dear all, I am looking for FACS markers to characterize my rat bone marrow-derived macrophages (M0). They were differentiated from monocytes with M-CSF. I am considering to use CD11b/c and...

15 July 2024 4,615 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

James R Knaub

Hello Javed -

Perhaps stepwise regression is still not telling you enough because of complicated variable interactions, so that you are really not using the 'best' subset of regressors. If subject matter theory encourages you to alter the set of regressors to use, perhaps you should give those alternative models a try. I'm just guessing here, but maybe you just need to experiment with other models (but noting that if you try enough of them, you may substantially increase your risk of spurious results.)

Also, it sounds to me as if you may be using too many regressor variables. Perhaps results would settle down with fewer. Perhaps the larger set did well because it covered various contingencies, perhaps where the data should have been catagorized, but if you reduce the number of variables, you may need to reduce substantially to reach a fairly generally needed subset. - Perhaps the full set of regressors may be an overfit itself, and not tested enough to know it.

Still, if you have plenty of test and/or validation data, you might try various models, plotting y on the y-axis and predicted-y on the x-axis, with various models plotted on the same scatterplot graph, and see how they compare.

I'm not sure that this applies to your work, with which I am only vaguely aware, but perhaps it may help you develop an idea.

Cheers - Jim

Jos Feys

Maybe you should split your data in half over the whole periods1951 to 2005 and not by time period.

Jos has a point, but note that if you consider that this could mean that the best regressors change over time, then you have a more complicated problem.

Adam A. Scaife

Hi Javed,

the simplest answers to this is that the predictors you are using are either not robust and so the model is 'overfitted' to the earlier data. Thi is a very common problem with any empirical model. Alternatively, the data may contain non-stationary features such as trends.

Assuming there is little autocorrelation you might deal with some non-stationarity by taking alternate years as your training data (I assume this is a version of what Jos is suggesting above) and the remaining years as validation.

Can you give a little more information on what you are trying to predict and what predictors you are using? It's very important to use a few key drivers of variability that well founded physical relationships to avoid overfitting.

Some simple but relatively successful attempts at this include these papers:

http://onlinelibrary.wiley.com/doi/10.1002/joc.2314/full

http://onlinelibrary.wiley.com/doi/10.1002/grl.50169/abstract

Best

Adam

Michael J. Lynch

In addition to the above . .....I don't know if these methods work for climate/atmospheric models and problems you are encountering, but in aggregate social and economic data, the same problems are often encountered.

One issue is the number of variables you may be using and the fact that they are correlated with one another. You can get unstable estimates in this case due to factors such as multicollinearity, and in time series data, autocorrelation, and non-normally distributed variables.

One way to limit these effects if they are due to multicollinearity is, if there is an option in the statistical package you are using to set the tolerance limit for the variables. The tolerance limit is a measure of the variance inflation factor. If this is available, check to see what the default tolerance limit is. For example, in some programs the default is tolerance = 0.1, which is a VIF of 10. That may be too high, allowing highly co-linear variables to be included, creating unstable estimates.

In the case of multicollinearity, you don't have unique identifiers (the independent variables are related). Another way to solve that problem -- if it makes sense theoretically -- is to create factors representing the pooled effect of related variables.

Another option is trying to standardize the predictors. One way to do this is mean subtraction, which is sometimes preferred because the interpretation of the output is the same as if the variables were not standardized. You can assess whether that works by creating an estimation equation with nonstandardized variables, then run the same equation with the standardized variables and compare the outputs. What you should see, if this works, is a reduction in the standard errors of the coefficients.

using empirical orthogonal functions (eofs) of large scale fields to do downscaling needs to be done with just a few predictor eofs.

First of all, decide how many eofs it is reasonable to use to represent your large scale variables such as geopotential height: you can do this using North's rule of thumb (you can easily find this on the web) to determine where you should truncate your eofs - this is normally about 3 for most large scale atmospheric fields, so if you are using many more than that you will likely be over-fitting the early data and therefore getting worse results in the later period.

Once you have a limited number of eofs for geopotential height, be careful to ensure that you are not over specifying by also including too many eofs of other variables. Rainfall for example and geopotential height are correlated due to the effects of latent heating (associated with rainfall) which will increase the geopotential height. You can do this second step by simply checking that your principal component timeseries between one variable and another are not significantly correlated.

Reducing the number of predictors should hopefully lead to a more robust downscaling relationship.

Best regards

Abdulrazzak Charbaji

Before using stepwise for validation and prediction you should know that you are using TIME SERIES DATA as compared to cross section and I believe you have to check Durbin- Watson , Unit root , Co integration , VEC and VAR models etc otherwise, your findings will be misleading

Chuck A Arize

https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df