Generating true negative dataset for a binary classification - can anyone help?

This question needs to be worded a bit better, it's unclear what you're trying to ask and/or achieve. Given a set of proteins S, which you are using a binary classifier to group into P and N (positive and negative), you then need to know the /real/ (or a "real" gold standard) sets of P and N in order to perform any sort of ROC analysis.

If (for example) you're using a binary classifier to predict whether a protein is a metalloprotein or not from primary sequence, your classifier would yield you a set of predicted metalloproteins and a set of predicted non metalloproteins.

The reference positive set would be the /actual/ known metalloproteins, and the negative set is the actual known non-metalloproteins.

You can then do various intersections at prediction stringencies to perform ROC analysis.

Fabrice Clerot

Indeed a hard question if i understand it correctly : similar to the problem of "face detection" in image processing : if you want to rely on a classification technique, you have to select a "representative" sample of all the images with no face for the negative examples (in your case, it would be all the other proteins than the specific class you are interested in ... Which i assume is a very very large set - sorry, i don't know much about proteins ...) and, of course, it seems impossible to gather such a "representative" sample because of the size of the negative class.

Is this your question ?

In that case, at least two possible ways (with classification techniques)

- one class classification

- sequential refinement of your classifier : if you have an efficient way to detect false positive and have access to as many test instances as you can process, you can refine your classifier by including in your training set the false positive you discover at each iteration and retraining

Vinay Nair

@Fabrice Clerot - Thanks for the answer. You have provided me a neat workaround to the problem that I was facing.

Christian Cole

You need to be very careful regarding overtraining if you do too much iteration and refinement of your classifier.

In developing a new protein classifier you must have three datasets. A set of true positives, a set of true negatives (not always possible) and a 'blind' set that is representative of the background. (see our papers on Jpred as an example: https://www.researchgate.net/publication/5388763_The_Jpred_3_secondary_structure_prediction_server?ev=prf_pub)

In developing and refining the classifier use the TP and FN sets only. Once you're happy with the performance, then - and only then - test against the blind test. It is very easy to create a very high performance classifier only to find it is not generally applicable. The strategy I describe avoids this situation.

For proteins, to ensure that your TP and TN are representative make sure they are non-redundant - both within the sets and between sets. At the simplest level use a sequence identity cut-off, but this is often not sufficient (not sure about metalloproteins). Check the literature on neural networks on protein structure from 1990s.

Finally, depending on which classifier type you're using (SVM, ANN, GA, decision trees, etc) you may need to ensure you have 'balanced' datasets of equal size for the TP and TN sets.

Article The JPred 3 secondary structure prediction server

Rajeev Gangal

Your question is perfectly understandable. One method to obtain a negative set from the diverse set of non-metalloproteins would be the following:

1. Initially sample non-metalloproteins such that their length and AA composition resembles the distribution of metalloproteins.

2. Find best descriptors after feature selection using the above negative set.

3. If you want to go further, sample non-metalloproteins with similar distributions of features selected in step 2 to those in metalloproteins.

How can I build an installer that would allow me to install and execute my code on other PCs?

What are some methodologies used to pre-process real-life market data for a recommender system problem?

Which Scopus Journal provides the most affordable fees?

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Are there any instruments for studying time similar to the way it is in space?

Weak DAPI staining after immunohistochemistry - how to improve?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

The Curse of Evolution and Complexity?

In the case of a wound l recurrence after radical breast cancer and sentinel lymph node biopsy. Are the sentinel lymph node procedure recommended?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Interested in a SCOPUS collaboration?