There are two different groups of chemical compounds having different roles. Can a machine learning method achieve 99% accuracy in classifying two different groups of chemical compounds.
Theoretically, sure. Practically, maybe, and most likely never at 99% accuracy (but 90% may be attainable, or some other reasonably high level of accuracy). It will depend on the classifiers you are able to develop to distinguish the two groups of chemicals. If you are able to develop a set of descriptors that are discriminating enough, and reliably so enough of the time, then it may be possible. It also depends on just how clearly distinct are each chemical in the two groups. If there is any overlap of function or effect of any of the chemicals, then that will confound the issue of distinguishing them reliably.
But no one can say so for any particular case, until that case has been tried.
How were you thinking to discriminate these two groups?
I have some experience with developing models based on gene expression data to distinguish different categories of chemicals. One of our models (naive bayes) seems to work well with about 90% accuracy, while another has shown 100% accuracy (a partial least squares model) but may be very easily perturbed with even just slightly different data. We have not yet widely validated either model with novel test data sets yet.
It can be a lot of work (and expense), developing the set of descriptors, testing and cross validating multiple models, to derive a final model or models that works as you wish. Even then, you need to independently validate the final model or models with novel test data.
Theoretically, sure. Practically, maybe, and most likely never at 99% accuracy (but 90% may be attainable, or some other reasonably high level of accuracy). It will depend on the classifiers you are able to develop to distinguish the two groups of chemicals. If you are able to develop a set of descriptors that are discriminating enough, and reliably so enough of the time, then it may be possible. It also depends on just how clearly distinct are each chemical in the two groups. If there is any overlap of function or effect of any of the chemicals, then that will confound the issue of distinguishing them reliably.
But no one can say so for any particular case, until that case has been tried.
How were you thinking to discriminate these two groups?
I have some experience with developing models based on gene expression data to distinguish different categories of chemicals. One of our models (naive bayes) seems to work well with about 90% accuracy, while another has shown 100% accuracy (a partial least squares model) but may be very easily perturbed with even just slightly different data. We have not yet widely validated either model with novel test data sets yet.
It can be a lot of work (and expense), developing the set of descriptors, testing and cross validating multiple models, to derive a final model or models that works as you wish. Even then, you need to independently validate the final model or models with novel test data.
Results will depend on the specific sets, descriptors, machine learning methods. Have a look at DOI: 10.1021/ci049702o where we've tried something similar.
Perhaps you should make clear what your question is. "Differentiate" is a concept from mathematics. I suppose you want to discriminate or distinguish? Have you divided chemical compounds into classes and now want to let a machine decide to which class a given molecule belongs? Then the answer depends on the kind of input you offer to the machine. If you want to classifiy chemicals into ketones and other molecules, giving the CAS registry number as input would enable a machine to do the classification, giving merely the taste ("tastes sweet") would not.
Or are you looking for structure–function relations and want to let a machine identify common chemical features (e.g., find out if all sweet-tasting chemicals have something in common)?
lets take an example: supppose there are two groups : one group of molecules interact with carbohydrate and other group of molecules do not interact with carbohydrate. Now, i have pubchem fingerprints (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt) of both groups. Thw question is using pubchem fingerprints is it possible to classify these two groups with a accuracy of 99% using any machine learning algorithm.
Thank you for the clarification! But was “interacts with carbohydrates” really mean? Any molecule can somehow interact with any other – even two helium atoms cannot occupy the same space and therefore repel each other when they come too close. You have to define the kind of interaction you want (e.g., hydrogen bonding), and then you have to find out whether the pubchem fingerprint lets you recognize -OH or -NH2 groups. If it does, you can answer your question without a machine; if not, no machine can help you.
Thankx for the clarification. The interaction example is not suited. I mean group 1 carbohydrate are known to cause X reaction in our body and group 2 carbohydrate are known to cause Y reaction in our body. I checked some fingerprints and found one fingerprint O-O are dominante in group 1, but o-o occurs very less in group 2. A set of 12 such descriptors can classify these two groups with a accuracy of 99%. I was wondering what i did was right or did i done something wrong with the making of datasets. Since, 99% is some what not possible.
OK, this boils down to the question whether a "fingerprint" or other molecule descriptor captures enough of chemical reality to make a search for correlations with biological activity possible. Does the fingerprint system you are using distinguish stereoisomers? If not, forget it. If yes, there may be a chance that a meaningful correlation shows up, and then – only then – you may consider computerized analysis.
Thankx for your suggestion. I'll find it. But my question is if my fingerprint system can distinguish stereoisomers. Is it possible to achieve a accuracy of 99% using machine learning.
Again, a general answer to such a question can not be given. It all depends on your descriptors, training methods, validation protocol, size and diversity of the data sets. 99% accuracy for predictions within the training set is always possible - we generally call this over-fitting! Accuracy of predictions outside the training set depends crucially on the similarity between the training set and test molecules (in addition to whether the descriptors are capable of scaffold hopping - to distinguish stereo isomers you will need 3D descriptors - and whether the modeling method has capacity control - such as SVM).
The discussion is interesting. But some points are missing. For example, Harinder didn’t clarified whether he wants supervised classification or unsupervised classification using machine learning. And also nobody mentioned difference between correlation and prediction.
If we consider supervised learning and correlation, then it is possible to get 99% or even 100% accuracy using machine learning techniques such as support vector machine (SVM). However, most of the times these kinds of correlative models are over fitted and will not produce accurate results for prediction purposes (of new compounds that haven’t used as train set).
Supervised learning and prediction is as discussed above by others. Theoretically (and we wish for it!) it is possible to give a model to predict with 99% or 100% accuracy in prediction, however, you should use a large dataset as training, comprehensive x variables, theoretical background for the model, and a lot of time for model optimization and training. In practice, this cannot be done! So, researchers hope for accuracy in prediction >~75% and if it is >90%, then it is considered as an accurate model (for prediction). In these cases, the accuracy for training is around 90-95%.
Internal and external validation methods could be used to get an idea about prediction capability of the model and occurrence of over fit.
I think Harinder used supervised learning (both x and y values are used for modeling) and the 99% accuracy is for train set, not test set (prediction). And I have no idea about number of compounds (y) or x variables (fingerprint data). There is 12 in discussion, but I’m not sure it is 12 compounds for one group, or 12 parameter for each compound! If number of compounds (cases) is not more than x variables, over-fit will happen. For a good correlation, you must have at least 3 case (compound) for each x variable you use (5 case is better!). And this minimum requirement is for linear modeling. For non-linear modeling, there must be more enough compounds to cover whole space of studied variables.
About unsupervised learning: this kind of modeling might produce your desired results or not! It depends largely on independent parameters (x) you use as input (e.g. your fingerprint from Pubchem) and model optimization parameters. In practice, it is not possible to check all different combination of x and model parameters and their effect on final classification. So, usually certain x (with known effects) are used and model parameters are optimized based on final results and known dependent values (y) to us, not the model. (The y values never introduced to the model). Results of this kind of modeling (both for correlation and prediction) are acceptable, (accuracy > ~75%), but it is really hard to reach ~99%. However, sometimes this kind of modeling might results in new insights to the available dataset (e.g. new classification, new classes, …).
dear harinder ,if you can go for density function theory (DFT calculations) analysis of same compound containing two different functional groups you can easily distinguish them on the basis of energy of HOMO and LUMO.moreover these HOMO and LUMO decides the reactivity of two different functional group in the same molecule therby it is easy to define contribution of each functional group towards biological activity.
Let me throw in another question that you need to ask yourself before going the Machine Learning way: what is your tolerated error rate? If you are desiging a very difficult experiment, than 90% precision with 90% sensitity may be unacceptable. But if you have an easy assay for your biological question, you could live with 10% false positive (or even 50%) , given that you can create a small highly enriched compound library (e.g. purchase it). In other words, you need to define your expectation more accurately before you can really pose the question about how realistic it is.
AND YET ANOTHER WARNING. Many studies show how powerful machine learning is in biomedical research. Almost always, these are over-optimistic. Typically, the same data set is split into training and testing sets; these sets are more similar to each other than will be the case for a classifier when getting out into the real world. In other words, using random train-test divisions, you create datasets that do not incorporate all the variation you encounter in real-world biomedical problems. There was a paper by Birney several years ago showing that practically all expression-based classifier publications over-estimate their accuracy. So even if we (data miners) tell you "we can probably get to 90% accuracy", you need to assume somewhat lower accuracy, because that would be the accuracy in a train-test or cross-validation scenario.
BTW, knowing this I recently published a paper in which we use a completely seperate set to test our accuracy. But still, the other set was generated by the same group (NHANES) and thus we are probably still over-optimistic.