To compare the biological activities of a set of compounds having a wide range of chemical structure (i.e. different descriptors), the dataset was divided into representative training and test sets using a dissimilarity-based compound selection method called sphere exclusion algorithms (Snarey et al., 1997). In this algorithm, each compound will be represented by one point and the total volume (V) occupied by this point will be defined in the multidimensional descriptor space (K) as described by Golbraikh (2000). Exclusion will start by constructing a sphere whose center is the nearest representative point to the center of the dataset having a radius R=〖c(V│N)〗^(1⁄K), where c and N denote the dissimilarity value and the number of the compound in the dataset, respectively. Test set will comprise all compounds (representative points) included within this sphere apart from the center. The latter compounds will be excluded from the dataset and the process will be repeated with new sphere until all points all exhausted (Snarey et al., 1997, Golbraikh and Tropsha, 2003). Different dissimilarity values were used and then to determine that the best representing test set, different statistical parameters for both sets were calculated (average, maximum, minimum and standard deviation).
First, the inhibitory activity of the test set molecules lies within the activity of those of the training set. Second, the average activity and standard deviation of both sets were close to each other, which indicated that activity is equally distributed in the training and test sets. Third, the sum of the inhibitory activities of the training set is relatively larger than that of the test set, suggesting that all representative points of the training set are well distributed within the entire data (Scior et al., 2009).
An alternative is to do a cluster analysis of the descriptor values including the activity: Use the molecules being centroids of clusters for the test set and all others for the training set. Thus your test set consists of "typical" molecules that cover the whole range of the acivity.