I am training machine learning models to predict the binding affinity of small molecules against a protein receptor. During the training set preparation I remove very small and very big moelcules, as they are more likely to binding non-specifically to the receptor and abolish its funtion. Currently, I use z-score thresholds (-2.5 for the very small and 2.5 for the very big molecules). However, the z-score depends each time on the distribution of MW in the available data and hence I believe that actual MW lower and upper thresholds would be more accurate. According to your experience, what should be these MW thresholds?

PS: please don't point me to Lipinski's rule of 5 (180 to 500) or some related rule. When training a machine learning model on known data you have to be more "generous" otherwise you will discard lots of precious experimental data.

More Thomas Evangelidis's questions See All
Similar questions and discussions