How can we classify a set of compounds into a test and training set in QSAR?

More Renjith Raveendran Pillai's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

Are there any instruments for studying time similar to the way it is in space?

There are a huge number of methods for studying objects in space, according to the senses (and not only). Mechanical, thermal, optical, acoustic, electrical, magnetic, based on particle beams,...

06 August 2024 7,102 0 View

How to report results of Generalised Linear Mixed Models in a journal article?

Hi everyone, If you have written or come across any papers where Generalised Linear Mixed Models are used to examine intervention (e.g., in mental health) efficacy, could you please share the...

04 August 2024 4,130 4 View

What is the best sampling strategy?

I am conducting a qualitative study that uses interviews to investigate the perceptions of teachers about a particular leadership practice and I am focusing on 3 schools which have a total number...

01 August 2024 8,457 10 View

Is the peer-reviewed publication "MedieKultur: Journal of Media and Communication Research" (E-ISSN 1901-9726, P-ISSN : 0900-9671) a legitima?

Is the peer-reviewed publication "MedieKultur: Journal of Media and Communication Research" (ISSN Online: 1901-9726, ISSN Print: 0900-9671) a legitimate and credible scholarly journal in the field...

01 August 2024 629 3 View

Will combining different research methods help in my case?

Dear colleagues, I wonder if combining different research methods, namely literature review, semi-structured interviews and structured questionnaire survey will help to ensure validity and...

30 July 2024 5,154 17 View

In terms of chaos, what is the necessary and sufficient condition for authoritarianism, permanent or temporary, to come to exist and persist?

Since 2016 Brexit, the world needed to change the thinking behind traditional democracy as the democratic landscape changed, yet traditional democratic thinkers and actors have been acting as if...

28 July 2024 6,515 1 View

How to maintain the drug concentration in cell culture study for long term?

I am working in the filed of long acting drugs with Nano formulation. I want to study the efficacy of my formulation in 2D cell cultures while this required media change on every two days. In such...

24 July 2024 8,696 0 View

Critique Calculaltion in Topological Indices and QSPR Modeling?

The first step is to analyze a 2D molecular graph and implement partitioning techniques to calculate the topological indices. Secondly impose statically tools to generate QSPR model, may or may...

24 July 2024 9,644 1 View

Are there any alternatives to PhenoScanner?

I am currently conducting a mendelian randomization study, and I was attempting to use PhenoScanner to look for potential confounders associated with the selected SNPs (any SNPs significantly...

23 July 2024 5,703 0 View

Do we put respondents data in research methods or findings section of a qualitative study?

Are both ways acceptable, or is there any convention to them? Thanks.

22 July 2024 6,988 4 View

Reisel González Pérez

Dear Renjith, there are numerous approaches to build training and test datasets (https://en.wikipedia.org/wiki/Sampling_%28statistics%29). I suggest you to read about sampling techniques (Bootstrapping, Stratified Random Sampling, Self Organized Maps, or just a random selection).

For QSAR studies I suggest you to read the paper "Does rational selection of training and test sets improve the outcome of QSAR modeling?" (http://www.ncbi.nlm.nih.gov/pubmed/23030316)

Best regards.

Mangesh Damre

Dear Renjith,

Also you can follow the simple procedure in which the test set compounds are selected manually considering the structural diversity and wide range of activity in the data set.

Consider a set of 30 compound.

First, sort the given compound on the basis of given activity values (like EC50 or IC50, etc.) in ascending order. Generally we consider the training set : test set compound ratio 4:1.

Second, after sorting the compound select every 5th compound starting from the 1st compound of the given dataset so that you will have total 6 compounds in the test set (like 1, 6, 11, 16, 21, 26) and rest of the compounds are consider as training set compounds.

You can refer the following article: http://www.sciencedirect.com/science/article/pii/S1876107013001375

http://www.sciencedirect.com/science/article/pii/S1876107013001375

Antoine Fortuné

hi Renjith,

you should try many different techniques as suggest Reisel, do many iterations of each to produce as many training sets. Each model produced will then give you informations on the robustness of your data set and it's capability to fill missing data in the chemical region of your test set.

Robert Fraczkiewicz

The guiding principle here is that the test set should be representative of the chemical space covered by the training set. It would not make sense to build model on, e.g., pyridine derivatives and test it with aliphatic hydrocarbons.

We employ the following methods to ensure similar coverage:

- Self-organizing Kohonen mapping

- K-means clustering

- Random

- Every n-th selection (as described above by Mangesh)

But we also trust our customers to come up with a suitable manual selection, thus such an option is also offered.

See http://www.simulations-plus.com/Products.aspx?pID=13&mlID=14&lmID=28

Michael Hutter

Compounds of the test set should represent the distribution of features and the distribution of activities. Features can be chemical fingerprints or usual descriptors. For small data sets hierarchical clustering is useful to select the cluster centroids as test set: These are representative and the remaining training set accounts for the diversity. Recommended ratio of compounds: 2/3 training set, 1/3 test set. Basically any approach is better than a completely random selection. N-fold cross validation can only be recommended for large data sets (>100 or so).

Pravin Ambure

Hi Renjith,

There are several ways to perform dataset division as already suggested (above comments). I would like to recommend you a freely available tool “Dataset Division” available at attached link. The tool provides three methods for dataset division, which are as follows:

1. Kennard Stone Algorithm - For more info: http://flo.nigsch.com/?p=6

2. Euclidean distance based

3. Activity sorting method - also suggested by Mangesh Damre (commented above).

Along with this you may also use a clustering tool to divide the data. A Modified k-Medoids clustering tool is also available at the attached website. If you still have any query, feel free to ask me.

http://dtclab.webs.com/software-tools

Renjith Raveendran Pillai

Thank you all for your advices and answers.