Updated Question : Suppose I want to generate a model that can identify all the metalloproteins from a given random set of proteins by performing binary classification. In order to train my classifier, I need to have a dataset consisting of metalloproteins and non-metalloproteins. My non-metalloproteins sample space consists of all the proteins that are not metalloproteins, which in indeed quite large and more importantly, diverse. Considering this, how should I go about generating a negative dataset, which will be representative of the diversity.