I'm doing predictor selection for downscaling from atmospheric predictors using step wise multiple regression during time period 1951-2005. I have split the whole data in two non-overlapping periods- one for calibration (1951-1980) and other for validation (1981-2005). The problem is i'm getting random results i.e. sometimes the model that is showing high R^2 value during calibration shown very low R^2 during validation and vice-verse. For example a model showing R^2 value 0.8 during calibration but producing 0.1 R^2 during validation. Again a model with 0.3 R^2 value during calibration giving 0.4 R^2 during validation.
So, it is getting difficult to select a model due to these random results. I have also tried different calibration and validation periods. But no consistency have been found.
Therefore, I wish to know what may be possible reason for this and how to overcome it.
Thanks.