First comment on train/test distribution, the 70%/30% or 67%/33% is not the golden rule, in fact its an old rule which does not work well in many more applications, especially in this era of big data. The recommended practice is to use train/dev/test sets and the ratio differs on the nature of the application and the data availability. Train set is to run the algorithm on, Dev set is to improve learning by tuning parameters, selecting features etc.. and the Test set is to finally evaluate the model on.
Now for the question asked, First some thoughts (you might disagree):
If you are using only train/test splits, it mean you are actually tuning your model on test split. Because you train the model, and then evaluate it with test split. If it is not accurate, you tune parameters and again evaluate it on test data. So indirectly test split is same as dev set and test set. This is not recommended especially when you have large dataset and you expect quite different test data than yours in future when you actually deploy the model to use. Anyhow, lets move to the point:
Whether train and test should be from same distribution or different?
Consider an example (sorry if the example is not perfect),
"You want to detect chicken in images (i.e. if an image contains chicken or not?). Your dataset contains images with and without chickens. But the chickens in these images are of two different breeds let say A and B. Both breeds are of different colors and have shape differences"
If you choose train data to be from a different distribution than test --- it means in the example you will train your model using breed-A chickens. When you apply this model on the test data, it will most likely fail to detect breed-B chickens. So you should train the model to detect both types of breeds. In other words, your train and test sets should be from the same distribution.
If your question is for train, test sample ratio. then my answer is: NO, it depend on the application. My suggestion is use less for training and more for testing.. e.g: 20%(train)-80%(test) or 30% -70%.
And, If your question is for train/test sample value selection from different region of data then the answer is YES
No need. For example for a same subject or class, a train set may have 5 samples and a test may have 1 to n samples. But the number of classes in train set and test set should be the same.
For example if ur using 10 classes in train set for feature extraction, the same 10 classes should be used in test set.
First comment on train/test distribution, the 70%/30% or 67%/33% is not the golden rule, in fact its an old rule which does not work well in many more applications, especially in this era of big data. The recommended practice is to use train/dev/test sets and the ratio differs on the nature of the application and the data availability. Train set is to run the algorithm on, Dev set is to improve learning by tuning parameters, selecting features etc.. and the Test set is to finally evaluate the model on.
Now for the question asked, First some thoughts (you might disagree):
If you are using only train/test splits, it mean you are actually tuning your model on test split. Because you train the model, and then evaluate it with test split. If it is not accurate, you tune parameters and again evaluate it on test data. So indirectly test split is same as dev set and test set. This is not recommended especially when you have large dataset and you expect quite different test data than yours in future when you actually deploy the model to use. Anyhow, lets move to the point:
Whether train and test should be from same distribution or different?
Consider an example (sorry if the example is not perfect),
"You want to detect chicken in images (i.e. if an image contains chicken or not?). Your dataset contains images with and without chickens. But the chickens in these images are of two different breeds let say A and B. Both breeds are of different colors and have shape differences"
If you choose train data to be from a different distribution than test --- it means in the example you will train your model using breed-A chickens. When you apply this model on the test data, it will most likely fail to detect breed-B chickens. So you should train the model to detect both types of breeds. In other words, your train and test sets should be from the same distribution.
The assumption that the test data is similar to the the training data is an old school assumption that is also called closed-world paradigm. It may work for an offline setup but not for real data. Such models that consider their knowledge to be complete or assumes that they know everything that they may face in future. They cannot grow with experience and are not intelligent in true sense. In stead the open-world paradigm assumes that whatever the model is trained upon is not complete. Therefore, it is open to the possibility of getting objects that are very different (out of distribution) from what the model has experienced.
With lifelong machine learning, the idea is to build models that may share knowledge across a sequence of learning tasks and therefore, what is learnt previously may be replaced, updated, or strengthened with new learning.
Considering the example discussed, if the training data for Chicken-B is not available initially, the model should be flexible enough to accommodate it at later stages and generalize its understanding of the chicken.
Since, the model learns automatically, therefore, it will learn wrong associations as well. But its still better then going from being very accurate to very wrong or inconclusive. The learning of wrong associations can be controlled with a follow-up mechanism that would validate the viability of a rule learnt in the following learning tasks.
Taimoor Khan well explained and I agree. However let me clarify my point so that it won’t mislead the reader. I didn’t say that the model in the above example won’t learn at all or will never detect type B chicken, just because it didn’t see this data in the training stage. My point was. that out of the available data, how the train and test sets should be derived. In this case, it is not wise to train the model with type A and test it with type B.
Now once the model is built and deployed to operation, don’t expect data with exactly the same patterns. And here is the stage which Taimoor Khan identified and explained beautifully that your model should have learned in the training stage. If it had, then it will work well on unseen data. Otherwise if it just had memorized the training data (known as over fitting), then it will mostly fail on unseen data. And there are tools used to avoid model to memorize the data only e.g. Learning curves, cross validation. Learning curves usually are not preferred for large datasets due to its extensive requirement of computing resources, However still useful in small dataset. On large datasets, cross validation is used.
Lastly, the wrong selection of train/test distribution can affect the performance of your model. But the impact can be reduced 1) if you have good features
Dear Mr. Asifullah Khan , Can or does not depend on the author's perception, and the purpose of the analysis. this tends to be an induction model. Learn about variable behavior specifically, then apply it generally. Regards.