I want to QSAR study of some organic derivatives. But I don't know the criteria for classifying the given compounds into training and test set. Anyone please help me to understand this.
Dear Renjith, there are numerous approaches to build training and test datasets (https://en.wikipedia.org/wiki/Sampling_%28statistics%29). I suggest you to read about sampling techniques (Bootstrapping, Stratified Random Sampling, Self Organized Maps, or just a random selection).
For QSAR studies I suggest you to read the paper "Does rational selection of training and test sets improve the outcome of QSAR modeling?" (http://www.ncbi.nlm.nih.gov/pubmed/23030316)
Also you can follow the simple procedure in which the test set compounds are selected manually considering the structural diversity and wide range of activity in the data set.
Consider a set of 30 compound.
First, sort the given compound on the basis of given activity values (like EC50 or IC50, etc.) in ascending order. Generally we consider the training set : test set compound ratio 4:1.
Second, after sorting the compound select every 5th compound starting from the 1st compound of the given dataset so that you will have total 6 compounds in the test set (like 1, 6, 11, 16, 21, 26) and rest of the compounds are consider as training set compounds.
You can refer the following article: http://www.sciencedirect.com/science/article/pii/S1876107013001375
you should try many different techniques as suggest Reisel, do many iterations of each to produce as many training sets. Each model produced will then give you informations on the robustness of your data set and it's capability to fill missing data in the chemical region of your test set.
The guiding principle here is that the test set should be representative of the chemical space covered by the training set. It would not make sense to build model on, e.g., pyridine derivatives and test it with aliphatic hydrocarbons.
We employ the following methods to ensure similar coverage:
- Self-organizing Kohonen mapping
- K-means clustering
- Random
- Every n-th selection (as described above by Mangesh)
But we also trust our customers to come up with a suitable manual selection, thus such an option is also offered.
See http://www.simulations-plus.com/Products.aspx?pID=13&mlID=14&lmID=28
Compounds of the test set should represent the distribution of features and the distribution of activities. Features can be chemical fingerprints or usual descriptors. For small data sets hierarchical clustering is useful to select the cluster centroids as test set: These are representative and the remaining training set accounts for the diversity. Recommended ratio of compounds: 2/3 training set, 1/3 test set. Basically any approach is better than a completely random selection. N-fold cross validation can only be recommended for large data sets (>100 or so).
There are several ways to perform dataset division as already suggested (above comments). I would like to recommend you a freely available tool “Dataset Division” available at attached link. The tool provides three methods for dataset division, which are as follows:
1. Kennard Stone Algorithm - For more info: http://flo.nigsch.com/?p=6
2. Euclidean distance based
3. Activity sorting method - also suggested by Mangesh Damre (commented above).
Along with this you may also use a clustering tool to divide the data. A Modified k-Medoids clustering tool is also available at the attached website. If you still have any query, feel free to ask me.