I'm new in QSAR so I have a simple question. What's the minimum number of molecules that I need to build a model?

More Paula Andrea Giraldo Hincapie's questions See All

Where should the ARS sequence be introduced on a plasmid?

Hi all, I need to introduce an ARS (autonomously replicating sequence) in my plasmid but I'm not sure which position would be the best. Does anyone have any suggestion? A picture of the plasmid...

05 August 2024 1,573 4 View

Water Consumption in Plastic Recycling: Is it Higher Than for Virgin Plastics?

Hello, dear colleagues, I would like your help in finding a reference that provides an average water savings due to plastic recycling compared to the consumption of virgin materials, preferably...

29 July 2024 4,630 0 View

How can i generate a CRISPR knockin mutation zebrafish model with a reporter?

Hey! I aim to generate a transgenic knockin zebrafish line that mimetizes a genetic condtition that leads to a certain disease on human. To do so, I need to insert a codon for mutagenic aminoacid...

14 July 2024 6,240 0 View

I need help with cloning a 200 bp insert and a 4,7 Kb pEGFP-N1 vector. How can I do it?

I've been trying to clone my N-terminal inserts in the comercial pEGFP-N1 vector. Initially, I cloned my N-terminal insert into a pGEMT-easy vector to ensure that the insert digestion was done...

09 July 2024 4,277 1 View

Use magnetic bead purification after gel electrophoresis?

Hi, I am working with Invitrogen Seamless cloning which requires 200 ng/ul insert DNA fragment. Protocol recommends to use restriction enzyme to digest pre-cloned plasmid and then elute specific...

25 June 2024 971 2 View

How to set up focus groups in the face of generational barriers?

I am carrying out a research project in which I am conducting focus groups with older men and women to study the musical stereotypes learned in traditional Spanish oral music groups, but I am...

10 May 2024 3,852 3 View

What is the best LC/MS system for surfactant analysis?

We are an R&D laboratory focused on anionic, cationic, and zwitterionic surfactants and usually tackle surfactant synthesis and analysis of products from customers. In particular, we deal with...

29 April 2024 8,927 1 View

Why is there no 28S RNA in mouse sceletal muscle?

We are isolating total RNA from mouse sceletal muscle, using the beads beater with the sample already being in Trizol, afterwards chloroform extraction and finally Qiagen RNeasy kit. On the TAPE...

29 April 2024 3,952 5 View

What’s the most used inclusive leadership assessment scale?

I’m looking for an inclusive leadership scale to include in a mediation analysis with psych safety and workplace belonging, what would you use

27 April 2024 6,072 2 View

COMET ASSAY: Why do cells disappear?

Greetings, I am trying to set up an alkaline Comet Assay. I have no previous experience with this type of assay, unfortunately, so I am doing a lot of trial and error. After embedding my cells in...

23 April 2024 6,534 0 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

What's the role of IT & AI in Telecommunication Industry?

05 August 2024 8,264 3 View

Can usage of AI tools like chat GPT in research work is recommendable ?

AI tools like ChatGPT can enhance research work significantly when used responsibly and in conjunction with thorough human oversight.

05 August 2024 1,842 3 View

Rafik Karaman

Dear Paula,

The number of molecules is depending on the number of descriptors that are included in the QSAR calculations.

For example if you have 10 descriptors, the minimum number of molecules should be 50. In other words, at least 5 molecules for each descriptor ( 5 molecules per 1 descriptor).

Rafik

Paula Andrea Giraldo Hincapie

Thank you! :)

Ankit Ab

Dear paula,

I agreed to dr, rafik answer. and i want to add in it.

its minimum requirement for proper QSAR model development and try to add as much molecule as possible with different substructures. so that you can get more appropriate model for prediction.

Best of luck.

you can also check this article doi:10.1016/j.ejps.2015.08.017.

Thank you Ankit.

Franklin Bauer

Well, it depends also on the result you want to achieve. You can draw a line with 2 points. But then it doesn't make really sense to use it to predict the endpoint for another molecule. Agreed with Ankit Ab: the more you will have different substructures, and the wider the range of values of your training set will be, the bigger your applicability domain will be.

But I add also another piece of advice: prefer the quality of your dataset over the size of it. You will end with more accurate, more reliable predictions when you have a model with 20 good quality points rather than 100 bad quality points.

Michael Hutter

The chance of obtaining large correlation coefficients by chance is very high for small data sets, particuarly less than 10 compounds. This paper contains a formula to estimate the degree of randomness:

Determining the Randomness of Descriptors in Linear Regression Equations with Respect to the Data Size, J. Chem. Inf. Model. 51 (2011) 3099-3104

Dmitri Kireev

The "5 molecules/descriptor" rule has become a kind of an urban legend. Yet, its original formulation by Topliss and Costello was much more sophisticated than the commonly adopted version (see below the quote from the 1972 paper and the respective citation [1]). However, since 1972, there has been some progress in the field of QSAR and available hardware/software resources. Hence, to avoid chance correlations, I would highly recommend using computational model validation techniques, such as Y-randomization, test sets and cross-validation (see ref. [2] for details), rather than relying on oversimplified rules of thumb.

"Thus, for a given number of variables to be tested, the required number of observations

to avoid undue risk of chance correlations can be estimated. For example, if r2 = 0.40 is regarded as the maximum acceptable level of chance correlation then the minimum

number of observations required to test five variables is about 30, for 10 variables 50 observations, for 20 variables 65 observations, and for 30 variables 85 observations."

[1] Topliss, J. G., & Costello, R. J. (1972). Chance correlations in structure-activity studies using multiple regression analysis. Journal of Medicinal Chemistry, 15(10), 1066-1068.

[2] Tropsha, A. (2010). Best practices for QSAR model development, validation, and exploitation. Molecular Informatics, 29(6‐7), 476-488.

Thank you everyone, I'm going to review papers were suggested

J. Ignacio Bueso-Bordils

Is this "rule of thumb" valid also for discriminant analysis? Let's say we have 80 compounds (40 actives, 40 inactives): what would be a reasonable number of descriptors for a discriminant equation?

I guess this rule is only applicable to linear models, isn't it?

Mangesh Damre

Hello everyone, here I would like to add some more comment. There is no perfect answer for how many number of molecules are required to build a QSAR model. But as everyone earlier said its wise to have many molecules. Apart from having multiple molecules, quality of a model strongly depends on the distribution of molecules in training set and test set.

For molecule distribution, you can have a look in the following thread of discussion...

https://www.researchgate.net/post/How_can_we_classify_a_set_of_compounds_into_a_test_and_training_set_in_QSAR

Anand Gaurav

Agree with Mangesh Damre the more the merrier, the diverse the better.