What is the best way to divide a dataset into training and test sets?

More Mahmoud Omid's questions See All

Is time really equal to money?

When someone says “Time is Money” they are also referring to what is called economic cost. Economic cost is the cost of deciding to do one thing over another. Why time equals money, but money may...

03 April 2018 7,490 98 View

Should there be ZERO guns or a more control of guns by civilians?

U.S.'s gun crisis. Everyone was so shocked and horrified by Las Vegas attack News. 59 people were killed and at least 527 were hurt Sunday night when Stephen Paddock rained gunfire on...

09 October 2017 388 72 View

How to tell if someone is lying?

Current US presidential election forced me to ask this question. As we walk around we are faced with all kinds of liars: politicians, combative locals who don’t want to tell you where the bad guys...

09 October 2016 6,055 88 View

What is the most beautiful theorem in mathematics?

Is it Euler’s identity eiπ + 1 = 0 where e is Euler's number (2.718), the base of natural logarithms, i is the imaginary unit, which satisfies i2 = -1, and π (3.14) is the ratio of the...

04 May 2016 1,846 80 View

Is it acceptable to use cell phones in classroom?

To Ban or Not to Ban (cell phones, tablets and iPads) in classroom , What's Your Policy?

03 April 2016 2,105 88 View

Why are those studying Humanities face with humiliation these days?

People who are studying Humanities (philosophy, ancient and modern languages, literature, ...) have declined steadily since the 1970s. Why? Recently, I read in Internet about Republican...

02 March 2016 2,023 66 View

What is the most visited tourist attraction in your country?

Without prejudges, Eiffel Tower is perhaps the most famous monument in the world, and the icon for Paris and France, If not, what do you think is the most famous monument around the...

02 March 2016 4,117 89 View

Why do politicians lie?

The reasons politicians often lie is because the public doesn't want to hear the truth. People want to hear what they want to hear. Why do they get away with lying? Why do we let them lie to get...

02 March 2016 10,265 92 View

Would you break the law to save a loved one?

Why? Do you have an example? Should it be no law on loving a person?

31 December 2015 3,456 89 View

What was the best (good) News for you in 2015?

When reading Newspapers, watching the News or looking at Internet, it can often feel as if the world is a very sad place, with the headlines delivering a conveyor belt of bad news, tragedy and...

11 December 2015 9,713 24 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

I am trying to simulate vehicular loading on an orthotopic steel deck bridge section in ABAQUS software. The red arrow mark in the attached figure indicates the direction in which the vehicle will...

08 August 2024 719 0 View

Mahmoud Omid Popular answer

Generally, k-fold cross validation; e.g. 10-fold cross validation, is the best . For a minimal dataset, then LOO (leave one out) should be preferred.

Arturo Geigel

The justification that I have always read in the literature is that you have the 60 to 80% for training to better model the underlying distribution and then test the results with the reamining 40-20%. The rationale given is that if you lower the number of samples in the training , the samples for the model being built will have too few samples. One of the shortcomings that I have always found in these techniques is that one of the assumptions is that by random sampling you will achieve independence and also a smooth generation of samples without any bias of the dataset. There are no previous testing to see if the set is unbalanced or not, if there are correlations within samples, etc.

You are right, we need to make a more principled approach to test our assumptions in how we divide the dataset. I think that by adding additional testing before random selection is done might alleviate the problem of such assumptions.

Just a thought

Mahmoud Omid

What to do if the amount of data is limited?Maybe as Olegas Niaksu said n-fold cross-validation would be the best approach. But my main concern has not been answered yet. "How to find the optimal splitting proportion?" that can be applied to a particular problem?".. "How does it impact on MSE of the prediction accuracy estimate or accuracy of the classifier?"

Vikas Jairam Dongre

How to find the optimal splitting proportion can also be good research problem. Start from 50/50 and go on changing the sets as 60/40, 70/30, 80/20, 90/10. declare all the results and come to some conclusion. In one of my work on Devanagari numeral recognition, I used 50/50 database and got 86.8 % accuracy using MLP NN.

My problem is how to use n-fold cross-validation. NN automatically validates the result as shown in its performance graph. Can guide me regarding 5 fold or 10 fold cross validation please ?

Thank you so much.You elaborated the n-fold cross-validation in very elaborate words which I could easily understand. I will implement it in my work

Ehsan Kozegar

it depends on the size of our dataset.

If it is large enough, 66% split is a good choice (66% for training and the others for test).

if it is a moderated dataset, 10-fold cross validation (or leave-one-out) can be a good choice.

if your dataset is small then Bootstraping is good.

Mohammed Hasan Ali

I think its depend on the size of the data set and if its balance or not , also depend on the main problem needs to built many samples that is mean you should increase the training.

Ismail El Massi

k fold cross valdation is goog and very used, but yo can also make a loop if you want to devide dataset randomly

Mohammed Ali Jallal

The application of LOOCV or k-fold cross validation with a large number of folds or with a large set of data in the training phase drastically increases the training time.

To add to this, it still cannot solve the problem of leakage.

As for overfitting, if the data has some noise in it, CV can prove to be counter productive as the noise data will also be taken into account while modelling.