Generally, there are multiple ways you can partition the data into training and testing set. You can either do random or stratified or systematic sampling (e.g., Kennard-Stone, K-means and so on).
You can randomly sample the data set into a training set (80%) and testing set (20%). But you should be careful not to reject division easily because there is a risk of biasing the results.
In stratified sampling, the dependent (pIC50 for regression or Class for classification) variables are split into groups and random sampling is applied within these groups. In that way, the overall distribution of the of dependent variable across training and testing will be comparable.
In systematic sampling, Kennard-Stone algorithm as for example, the training set is not only be representative of the test set, but also completely independent. The algorithm selects training samples from the complete data set to cover the complete space of the independent variables (descriptors) as good as possible. By selecting the training samples that completely surround the test samples, the prediction accuracy will be inflated.
Depending on your field of research, subject of a model calibration and verification can vary. For example in hydrological, mteorological and ... time series, a continues separation of the time series are necessary. However, Some models which are parametric and order of separation is not important, the mentioned methods from above comments are very good choice.
Today, my book " New Theory of Discriminant Analysis After R. Fisher (Springer)" is published, I introduced your problem by many examples using many discriminant functions including SVM. If you read my book, you can understand many facts.
Professor Linuanalyzege reads my book and tells me that Matryoshka was misspelled by Matroska. Thank you, Linus.
If someone reads my book, please change Matroska to Matryoshka.
Now, I analyze all SMs of six microarray datasets and obtain the wonderful results that will be published in this year, I am completely successful in cancer gene analysis.
Prof. Linus Schrage reads my book and tells me that Matryoshka was misspelled by Matroska. Thank you, Linus.
If someone reads my book, please change Matroska to Matryoshka.
Now, I analyze all SMs of six microarray data sets and obtain the wonderful results that will be published in this year, I am completely successful in cancer gene analysis in statistics.
I hope some medical specialist becomes my co-author to validate my statistical results by medical validation.
I finished and established my research, now. So, I find that my former answer was not proper to your question. I obtained all possible models in the training samples and apply those models for the validation samples. I compare the best model with the minimum mean of error rates in the validation samples (M2). We compare eight best models such as three Optimal LDFS, three SVMs, logistic regression and Fisher's LDF. See my Springer Book "New Theory of Discriminant Analysis after R. Fisher".
There are many examinations using six different common data and six microarray datasets.