I am a bit confused as to the usage of permutation tests to detect / disprove over-fitting, such as this case is, in machine learning.
Say I have a dataset, and they're divided into training / test set (10% and 90% proportion). There's only two labels (0 and 1) and a value for each data.
I illustrate the training dataset below:
Label Value
0 11
0 35
0 48
0 56
1 67
1 88
1 44
I would normally run a machine learning (ML) method to generate a solution and an outcome.
This solution will then be used in the test set, and the outcome observed.
I should expect similar performance in the test set as compared to the training set.
To measure overfitting, my null hypothesis is that there is no inherent relationship between label and value and I should get the same outcome in the test set if such labels are random. I guess what I'm trying to prove is that if I shuffle the label, I will not get similar performance when applying such solution to a test set.
I was told to use Y-scrambling at the moment. What I understand so far is as below:
1. Scramble the Y-vector (the label) N times to generate N number of permuted training set.
2. Run these training sets into the ML to get its own permuted solution.
3. Apply the solution to the original test set and observe its outcome, hence individual outcomes based on permuted training sets.
I now have the following information:
a. Outcome of the original test-set when using the ML solution provided by the original training set.
b. N outcomes when using the ML solution provided by the N permuted training sets.
My confusion is from this point on, what am I measuring? If there's a y-scrambling plot of X and Y, what would X and Y be? I've been reading about correlation coefficient and statistical test, and measuring differences of mean, but it would greatly help if someone can point up a bit more in detail what to do to represent this overfitting measurement. Thanks so much.