I have two separate datasets and I am tasked with training my regression model on one dataset while its' assessment of its' performance has to be done on a different dataset. What do I do?
There are several ways to proceed to accomplish your goal (and some software packages may automate this step whereas others do not). Here's one way that would work regardless of the software:
1. Develop your regression model on the training set.
2. Now, upload the test set into your software. Use the final regression model from your training set to compute estimated/predicted values for the DV for the test set. For example, if the final equation from step 1 was Y-est = 2.44 + 0.35*IV1 + 1.27*IV2 - 0.69*IV3, then use those specific coefficients to compute estimated values for each test case.
3. To determine the performance of the model on the test set, compute:
a. r^2(Y, Y-est) across the test cases--an ordinary squared Pearson correlation of the actual and estimated values on the DV. This can be compared directly to the multiple R-squared from the training set. If the values are comparable, then the model works about equally well when applied to new or different (the test) data.
b. To quantify the magnitude of discrepancy, compute the residual for each test case: e = Y - Yestimate. The SD or variance of these values may be compared to the standard error of estimate (or variance error, MSE, respectively) from the training set.
There are several ways to proceed to accomplish your goal (and some software packages may automate this step whereas others do not). Here's one way that would work regardless of the software:
1. Develop your regression model on the training set.
2. Now, upload the test set into your software. Use the final regression model from your training set to compute estimated/predicted values for the DV for the test set. For example, if the final equation from step 1 was Y-est = 2.44 + 0.35*IV1 + 1.27*IV2 - 0.69*IV3, then use those specific coefficients to compute estimated values for each test case.
3. To determine the performance of the model on the test set, compute:
a. r^2(Y, Y-est) across the test cases--an ordinary squared Pearson correlation of the actual and estimated values on the DV. This can be compared directly to the multiple R-squared from the training set. If the values are comparable, then the model works about equally well when applied to new or different (the test) data.
b. To quantify the magnitude of discrepancy, compute the residual for each test case: e = Y - Yestimate. The SD or variance of these values may be compared to the standard error of estimate (or variance error, MSE, respectively) from the training set.
I do agree with the comments of David Morse, but getting confusion on the comments of Glenn Wayne Jones. In the question it is clearly mentioned that he will use one data set for training and another one is used for testing, so what is the requirement of applying folding. In my concern k-fold procedure is not required here unless until both training and testing are not performed on the single data set.