How do you create the training and test sets? Completely random or not?

More Luis F. Gouveia's questions See All

Why Do TDS and EC Increase with Larger Wastewater Volumes, While BOD and COD Decrease?

I have carried out MFC experiments on three different volumes, 50, 500 and 1000 mL of wastewater. Results after MFC treatment shows that TDS and EC are more in larger volumes of water i.e. TDS and...

09 August 2024 9,621 0 View

How to enrich pig excreta for increasing nutrient quality organically ?

Pig slurry is rich in major and minor nutrients. Is there any way to improve / Enrich its manure quality to be used in agriculture organically ? please share your knowledge.

09 August 2024 5,605 2 View

Is it possible to plot the atom-projected band structure using GPAW?

Hi, I'm currently working on a project where I need to plot the atom-projected band structure using GPAW. I've been able to calculate the band structure for my material, but I'm having trouble...

07 August 2024 269 3 View

Unusual intensity drop in some sections of chromatograms in DDA?

Hi, we have measured tryptic peptides using both DDA and DIA method on QExactive. In DDA replicates i saw unusual intensity drops occurring at the same sections of chromatograms in DDA replicates...

07 August 2024 3,218 4 View

Leaf area of tomato ?

Hi How can this equation Ln(LA) = 1.038 + 0.89 ln(X) be applied to calculate the leaf area of a tomato? Can you explain with an example and what is the substitution of Ln and ln?

06 August 2024 2,508 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

How to preform densitometry on SDS-page bands?

I ran a SDS-page of a bacterial lysate and I want to quantify protein concentration in a specific band. I was thinking of using a standards ladder or make some standards are different...

05 August 2024 9,805 3 View

XRD Analysis is showing only Calcium carbonate. It is not showing other compounds. Can anyone help me get the other compounds?

XRD Analysis is showing only Calcium carbonate. It is not showing other compounds. Can anyone help me get the other compounds

04 August 2024 3,019 3 View

In what part of the brain´s rabbit bdnf is located and include references please?

I´ve been unable to find specific information about this neurotrophin in the CNS of rabbits exclusively. There is extensive info in mice, fish and rats, but in brain´s rabbit is hard to find....

04 August 2024 762 1 View

Which solvent is better to dissolve with secondary metabolites extracted from fungi?

I work on MCF7 cell cell for anticaner purpose and I wa to do drug preperation the drug ( secondary metabolites extracted from Aspergillus) My question which solvent is better with these secodary...

03 August 2024 4,725 2 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

Hello researchers Is this a random laser or just fluorescence?

I am using Rhodamine6G as gain medium and silver nanoparticles as scatterers on a microscope slide and laser input 532 nm comes from above.

09 August 2024 9,894 2 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Md. Asadur Rahman Popular answer

Sir, there is a nice solution regarding your training-testing data splitting problem, that is k-fold training and testing. You may try 5 fold or 10 fold approach to take the decision.

Regards!

Sachin Kumar

https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function

https://cran.r-project.org/web/packages/dataPreparation/vignettes/train_test_prep.html

Is it helpful ?

Md. Asadur Rahman

Yosra Mohammed

Luis F. Gouveia try this link

https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7

Luis F. Gouveia

Thank you all for your contributions. However, I asked your help regarding a particular issue "when not to randomly divide the whole dataset into train and test sets" and I still can´t find the rational for it, namely by using stratifying or k-folding strategies.

Regards, Luis

Hubert Anysz

Me and Andrzej Foremny

have classified construction contracts to three sets. One of them was "collusion highly expected". In our 250 cases dataset there were only 9 cases matching the set "collusion highly expected". We trained the classifier with 5 of them and controled the result with 4 cases. The total number of sets in vaidating data was 70, but finding the cases of "collusion highly expected" was a core of the research. Despite the low number of training cases of this kind, we've got 3 (out of 4) true-positive. It was our decision not to train fully random.

Regards

Hubert

Dear Hubert Anysz , (long time no see..) I trust you're doing well and I appreciate your contribution. In fact that's the kind of feedback I was expecting. Although the k-fold cross-validation approach suggested by Md. Asadur Rahman works well in many cases, I would expect it to fail in the example you shared. Now, the question is: how did you picked the 5 cases to train? Was there any formal rational (based on the predictor variables, for instance) ou you just randomly pick 5 out ou those 9 "collusion highly expected" available cases?

Thanks again and best regards, Luis

Şaban Dalaman

you can use randomization before splitting train and test datasets except you have time series data or order is important. In that case there are ways to do it.

If you have imbalanced dataset then you can do oversampling or undersampling or SMOTE before randomly splitting it.If you do undersampling, you can try k-fold cross validation method.

I'm glad to meet you Luis F. Gouveia here again and answer sth (not anly asking) :-)

Yes, it was a random choice 5 out 9 to training dataset. Once the cases were classified to the "c. h. expected" we didn't make any additional assesment among them. Just random choice.

Thanks Hubert Anysz for your input. Best regards, Luis

Xudong Sun

Take a look at the two papers about creating the split considering distribution shift:

Preprint High Dimensional Restrictive Federated Model Selection with ...

and

Preprint Variational Resampling Based Assessment of Deep Neural Netwo...