We recently published a paper on machine learning of molecular atomization energies calculated with quantum chemistry. We hardly found any literature in this field. Does anybody know why? Or, if I'm wrong, can you point me to good papers on this?
Part of the answer I think is the goal of the researchers. If I wanted to map electronic structure to atomization energy, then I'd solve the Schrodinger equation. If I wanted a predictive model mapping molecular structure to atomization energy, QSPR is appropriate. DFT is often used as a tradeoff between effort (deriving an ML model) and computational expense. I don't need to think as much to drop structures into an electronic structure program and hit Go.
Another question is reusability of the model, for which I see two factors. The first is just how many atomization energies you'd like to predict. Ten times your training set, definitely worth deriving the ML model. One-tenth your training set, and you would be better off just calculating the DFT energies directly. The second factor is the applicability of your model. If I were interested in the atomization energy of (CH_3S)_4Fe_2S_2^{2-}, would your training set be appropriate? That ties into the conceptual basis being used to represent the property of interest. Electrons and nuclear charges are an appropriate basis set for most phenomena of chemical interest. The components of a machine learning model, while almost certainly less expensive to calculate, have a narrower scope of applicability.
Perhaps a third reason: if a company (where much of the drug design-related QSAR/QSPR is done) were to develop a fantastically successful model for a property of great interest to them, they would probably not publish it and retain it as a competitive advantage. If the property were not of sufficient interest, they wouldn't develop the model in the first place. Academia and NatLabs might have a bias against multivariate statistical models in fundamental research, as not providing knowledge (or if preferred, being too pragmatic and targeted). Another bias might be the misperception of being suited only for drug design. However, I think QSPR/QSAR is a great approach to get at subtle but fundamental correlations, that fundamental research programs may not have taken sufficient advantage of. I see numerous talks on "catalyst design" that take no advantage of quantitative analysis across compound sets. It seems like the catalyst design field is ripe for ML models to map quantum chemical descriptor calculations to catalytic activity...
I think that there is a lack of understanding in this field. Part of the missing information is that some of the results fit outside the realm of what is thought to be possible. This means that papers on the field are not published as they seem to suggest that the old theories are not correct.
The only hope is that some of the papers make it to the light of day and we get to work on areas that were not thought to be possible.
My work points to some of these types of quantum understanding but I have not been able to publish any results as I think people will label be as a crazy. Not that they would not think that anyway but you get the idea. Any thing outside the theories we accept now is going to be shunned until it is apparent that the old theory is invalid.
If I understood the question, the idea of using QSPR for Quantum Mechanics has been with us for a long time. The use of 'descriptors' correlated with linear (or non-linear) equations to produce, for example, the first electronic energy of a molecular system is embedded into Density Functional Theory. The Functional is the QSPR of Quantum Mechanics. Although some effort has been put into given DFT a more sure theoretical footing, the truth is that the most used successful functionals remain an empirical correlation against high level Coupled Cluster CI calculations.
@Orland: QSPR stands for quantitative structure property relationships. The preprint of our paper is here: http://arxiv.org/abs/1109.2618
@George: Why don't you post your work on arxiv.org?
@Mario: Indeed, I agree with your reasoning. However, the conventional way of training has not been very sound from the statistics point of view. This is a (very) recent example of properly applying machine learning to learn (orbital free) DFT: http://arxiv.org/abs/1112.5441
Note that the reference can also be experimental or quantum Monte Carlo which scales more favorably than the post-Hartree-Fock methods ...
Part of the answer I think is the goal of the researchers. If I wanted to map electronic structure to atomization energy, then I'd solve the Schrodinger equation. If I wanted a predictive model mapping molecular structure to atomization energy, QSPR is appropriate. DFT is often used as a tradeoff between effort (deriving an ML model) and computational expense. I don't need to think as much to drop structures into an electronic structure program and hit Go.
Another question is reusability of the model, for which I see two factors. The first is just how many atomization energies you'd like to predict. Ten times your training set, definitely worth deriving the ML model. One-tenth your training set, and you would be better off just calculating the DFT energies directly. The second factor is the applicability of your model. If I were interested in the atomization energy of (CH_3S)_4Fe_2S_2^{2-}, would your training set be appropriate? That ties into the conceptual basis being used to represent the property of interest. Electrons and nuclear charges are an appropriate basis set for most phenomena of chemical interest. The components of a machine learning model, while almost certainly less expensive to calculate, have a narrower scope of applicability.
Perhaps a third reason: if a company (where much of the drug design-related QSAR/QSPR is done) were to develop a fantastically successful model for a property of great interest to them, they would probably not publish it and retain it as a competitive advantage. If the property were not of sufficient interest, they wouldn't develop the model in the first place. Academia and NatLabs might have a bias against multivariate statistical models in fundamental research, as not providing knowledge (or if preferred, being too pragmatic and targeted). Another bias might be the misperception of being suited only for drug design. However, I think QSPR/QSAR is a great approach to get at subtle but fundamental correlations, that fundamental research programs may not have taken sufficient advantage of. I see numerous talks on "catalyst design" that take no advantage of quantitative analysis across compound sets. It seems like the catalyst design field is ripe for ML models to map quantum chemical descriptor calculations to catalytic activity...
Actually QSPR-like ideas are frequently used in quantum chemistry, though typically there is no mention of the phrase "QSPR", or "QSAR". Such ideas are complicit in Pople's ideas of isodesmicity & homodesmicity. Other examples involve extrapolations of a quantum-mechanical property as a function of some sort of extent of approximation -- see, e.g., L. L. Griffin, Jian Wu, D. J. Klein, T. G. Schmalz, & L. Bytautas, “Scaling Behavior of Ground-State Energy Cluster Expansion for Linear Polyenes”, Intl. J. Quantum Chem. 102 (2005) 387-397.
Hi, there are a number of papers in this regard. here are a few to start you off on:
Amari, S.; Aizawa, M.; Zhang, J.; Fukuzawa, K.; Mochizuki, Y.; Iwasawa, Y.; Nakata, K.; Chuman, H.; Nakano, T. VISCANA: visualized cluster analysis of protein-ligand interaction based on the ab initio fragment molecular orbital method for virtual ligand screening. J. Chem. Inf. Model. 2006, 46 (1), 221-230.
Yoshida, T.; Fujita, T.; Chuman, H. Novel quantitative structure-activity studies of HIV-1 protease inhibitors of the cyclic urea type using descriptors derived from molecular dynamics and molecular orbital calculations. Curr. Comput.-Aided Drug Des. 2009, 5 (1), 38-55.
Hitaoka, S.; Matoba, H.; Harada, M.; Yoshida, T.; Tsuji, D.; Hirokawa, T.; Itoh, K.; Chuman, H. Correlation analyses on binding affinity of sialic acid analogues and antiinfluenza drugs with human neuraminidase using ab Initio MO calculations on their complex structures-LERE-QSAR analysis (IV). J. Chem. Inf. Model. 2011, 51, 2706-2716.
Zhang, Q.; Yang, J.; Liang, K.; Feng, L.; Li, S.; Wan, J.; Xu, X.; Yang, G.; Liu, D.; Yang, S. Binding interaction analysis of the active site and its inhibitors for neuraminidase (N1 subtype) of human influenza virus by the integration of molecular docking, FMO calculation and 3D-QSAR CoMFA modeling. J. Chem. Inf. Model. 2008, 48 (9), 1802-1812.
Zhang, Q.; Yu, C.; Min, J.; Wang, Y.; He, J.; Yu, Z. Rational questing for potential novel inhibitors of FabK from Streptococcus pneumoniae by combining FMO calculation, CoMFA 3D-QSAR modeling and virtual screening. J. Mol. Model. 2011, 17 (6), 1483-1492.
Mazanetz, M. P.; Ichihara, O.; Law, R. J.; Whittaker, M. Prediction of cyclin-dependent kinase 2 inhibitor potency using the fragment molecular orbital method. J. Cheminform. 2011, 3 (1), 2.
The only thing I can add is that it does not matter what we think or feel. Current observations (measurements) of phenomenon are not necessarily intuitive. Today we need to make those measurements and analyse them, We are so deep into the existence of matter that our feelings and believes do not matter. We need real data to determine reality, it is not what we may expect.
Also, as a physicist I would refer this question to chemists. Chemists know much more about chemical bonding and quantum chemistry than anyone from other fields.
In my experience, there are two reasons that quantum chemistry is not generally used in the drug design process. First and foremost is time. If your modelling cannot keep pace with the speed at which experimentalists can test their hunches, then you will become irrelevant very quickly whether or not you have the right answer. Secondly, the biological side of things tends to be interpreted in terms of pairwise interactions, i.e. - identifing which residues are contributing to binding or non-binding. (This is one place where Lance Westerhoff's efforts to decompose semiempirical energies from DivCon are particularly interesting.) It often becomes the case that you are forced into approximate methods due to limited computational resources, which in turn lead to approximate results that are viewed with skepticism by all involved. This being the case, most are content to used traditional docking methods to throw out the things that simply won't work and then sift through the rest experimentally. It is far from ideal and this approach is repleat with known flaws, but it is how things get done nonetheless.
There is alos a question about expertise. I have seen design efforts that will throw weeks of MD simulation at a problem that a good QM methods could get right with far less CPU time, but the expertise to run the QM calculations is simply not there. Furthermore, many don't know how to interpret the results once they have them. Making this happen will require at shift in thinking in the drug discovery community and hardware that can get answers in short enough periods of time to be relevant to the process. When Pfizer spins up 10,000 cloud instances to do docking, you know that they aren't doing so to run quantum calculations.
At it's most basic, this could be an excellent area for quantum mechanics since it's probability based. For instance the venerable quantum well ... we have our input (i.e. the potential barrier and energy) now what is the likelihood of our output? (i.e. tunneling or output energy)
Why are few people doing this? Maybe because it's a creative, potentially fruitful area that needs some elbow grease and imagination and few have come around to it yet.
Think of the possibilities ... maybe I don't necessarily know the precise contributions of the interactions of some dipoles or spin coupling, quadrapoles, etc., perhaps a well-designed QSAR/PR model might allow me to just characterize my inputs, and my outputs, and then modify the model.
What if I wanted to predict gas-phase aggregation around 10 nanometers, but the mechanism eludes me? Some of kind of AR/PR model could potentially be used to connect the characterization of the pre and post-aggregation states.
The difference between QM and QSPR is not merely technical. In QM calculation the property is computed, in QSPR the properties are compared (with the similar set of molecules). The first approach (QM) is strictly physical, the other (QSPR) chemical, because the chemist makes his conclusions by analogy. If one wants to study the
mechanism, let him use QM, but if he wants something to improve QSPR is a better choice. So, it does not matter if somebody uses QM or any other descriptor.