The number of molecules is depending on the number of descriptors that are included in the QSAR calculations.
For example if you have 10 descriptors, the minimum number of molecules should be 50. In other words, at least 5 molecules for each descriptor ( 5 molecules per 1 descriptor).
I agreed to dr, rafik answer. and i want to add in it.
its minimum requirement for proper QSAR model development and try to add as much molecule as possible with different substructures. so that you can get more appropriate model for prediction.
Best of luck.
you can also check this article doi:10.1016/j.ejps.2015.08.017.
Well, it depends also on the result you want to achieve. You can draw a line with 2 points. But then it doesn't make really sense to use it to predict the endpoint for another molecule. Agreed with Ankit Ab: the more you will have different substructures, and the wider the range of values of your training set will be, the bigger your applicability domain will be.
But I add also another piece of advice: prefer the quality of your dataset over the size of it. You will end with more accurate, more reliable predictions when you have a model with 20 good quality points rather than 100 bad quality points.
The chance of obtaining large correlation coefficients by chance is very high for small data sets, particuarly less than 10 compounds. This paper contains a formula to estimate the degree of randomness:
Determining the Randomness of Descriptors in Linear Regression Equations with Respect to the Data Size, J. Chem. Inf. Model. 51 (2011) 3099-3104
The "5 molecules/descriptor" rule has become a kind of an urban legend. Yet, its original formulation by Topliss and Costello was much more sophisticated than the commonly adopted version (see below the quote from the 1972 paper and the respective citation [1]). However, since 1972, there has been some progress in the field of QSAR and available hardware/software resources. Hence, to avoid chance correlations, I would highly recommend using computational model validation techniques, such as Y-randomization, test sets and cross-validation (see ref. [2] for details), rather than relying on oversimplified rules of thumb.
"Thus, for a given number of variables to be tested, the required number of observations
to avoid undue risk of chance correlations can be estimated. For example, if r2 = 0.40 is regarded as the maximum acceptable level of chance correlation then the minimum
number of observations required to test five variables is about 30, for 10 variables 50 observations, for 20 variables 65 observations, and for 30 variables 85 observations."
[1] Topliss, J. G., & Costello, R. J. (1972). Chance correlations in structure-activity studies using multiple regression analysis. Journal of Medicinal Chemistry, 15(10), 1066-1068.
[2] Tropsha, A. (2010). Best practices for QSAR model development, validation, and exploitation. Molecular Informatics, 29(6‐7), 476-488.
Is this "rule of thumb" valid also for discriminant analysis? Let's say we have 80 compounds (40 actives, 40 inactives): what would be a reasonable number of descriptors for a discriminant equation?
Hello everyone, here I would like to add some more comment. There is no perfect answer for how many number of molecules are required to build a QSAR model. But as everyone earlier said its wise to have many molecules. Apart from having multiple molecules, quality of a model strongly depends on the distribution of molecules in training set and test set.
For molecule distribution, you can have a look in the following thread of discussion...