Molecular Weight thresholds for training set preparation?

More Thomas Evangelidis's questions See All

Which QSAR descriptors are related to ligand entropy?

I am trying to create a ML model which includes various energy terms accounting mainly for the enthalpic contribution to ligand-binding. However, entropic contribution is neglected, although it...

09 October 2019 8,712 5 View

Any good tool for chemical structure extraction from documents?

Could anyone recommend me a good software to extract chemical structures of inhibitors along with the experimental binding affinities from documents (publications)? I don't mind paying for a...

09 October 2019 6,211 1 View

Best protein-ligand sets for docking from DUD-E database?

I am testing a new scoring function for docking and I am looking for simple protein-ligand data sets from DUD-E (only!) in which the binding cavity is well defined and not very flexible. As such,...

07 August 2018 7,730 2 View

Minimum protein-ligand binding affinity difference to claim that a ligand is stronger binder ?

I have a ligand with binding affinity (Kd, Ki, IC50, EC50) measured by a protein-ligand binding in vitro assay. We don't know the experimental error of the measurements. In you experience, how...

05 June 2018 6,879 15 View

How to create fragments from a ligand?

Is anyone aware of any program that can create automatically fragments (chemical probes) from a large ligand that can be used later for fragment based drug design?

08 September 2016 8,713 0 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Are there instances where molecules with larger molecular weights exhibit greater mobility than those with smaller molecular weights?

Hi, I know that low molecular weight (MW) molecules generally tend to have higher mobility, while high molecular weight molecules tend to have lower mobility. However, in my experimental...

06 August 2024 1,495 2 View

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

Please can anyone support with the survey questions based on RQ measures and propose how to do it in FMCG industry and include as well the role of brand equity Thanks

06 August 2024 949 0 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

Anyone having idea about VN primer for miRNA primer design ?

How to design VN primer to attach with universal reverse primer

05 August 2024 2,116 3 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

For an in-vitro drug release study, what molecular weight cut-off (MWCO) dialysis bag is required for a 117 kDa protein?

kindly reply me. Thanking you in advance.

05 August 2024 7,727 4 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Luay Abdulwahid Shihab

More possible to clarify the question

William P Katt

If you don't like lipinski's rules, or the various derivatives thereof, you're sort of out of luck. The molecular weight range that's likely to be viable is going to vary by target: ATP competitive inhibitors will, by and large, be smaller than Protein-protein-interface inhibitors. And if you're looking at inhibitors of, say, PD-1, you're talking antibodies, and should multiply your molecular weight by a few thousand. If you're looking at metallo-drugs, a different choice of metal could change the weight by 10-100, while doing little to the size and properties.

Personally, I wouldn't discard data at all unless I KNEW it was an off-target interaction. As you say: you risk your model missing some key information that way. But if you want a narrower model? I'd take an inhibitor you like, get chemical fingerprints (or whatever descriptor you prefer), and compare to the fingerprints for the rest of the dataset, then get rid of the 4-sigma outliers when comparing the distance between fingerprints. That should account for weight AND a variety of other properties that might fall outside the model.

Thomas Evangelidis

Hi William,

You made some good points! I don't work with antibodies, just with small molecule inhibitors. But in the case of metallo-drugs, the actual MW threshold would fail as you correctly pointed out. I was thinking instead to use the SMILES of INChI string and set a minimum and maximum lengths to filter out putative non-binders. But, again what would be good values for these limits?

PS: the proceedure that you described about fingerprint similarity upraises the risk of low training data diversity that will result to a poor machine learning model.

True enough, but you're reducing your training diversity by limiting molecular weight as well. Besides, at four-sigma, assuming well distributed data, you're getting rid of less than 1% of the samples. I'd expect more like 5% for real data, but still - if your dataset diversity hinges on that small number of samples, something is probably very wrong somewhere.

Still - you appear to want somebody to give you arbitrary molecular weight values, but aren't happy with the Lipinski set. Literally nobody can do better than that, ESPECIALLY without knowing your binding site. If you don't like just eliminating the molecules most unlike the others, either for all properties or for weight alone, the best I can suggest is to do an alpha-sphere search in the binding site, figure the amount of, say, diamond that would fit into that volume, and use that as, say, 80% of your maximum mass. Then figure that anything that doesn't weigh at least, oh, 20% of that mass, won't make enough contacts to matter. You'll probably have to fiddle with that more than a little bit, but it would at least give you an algorithmic approach to your cutoffs, rather than arbitrary numbers.

As for smiles string lengths - I'd just do the mW calculation properly. It's a simple calculation, quick, and avoids the problem of strings with very little markup 'weighing' the same as strings with lots of rings, branch points, etc.