I have a huge set of training data devided into two class. The class no. 1 contains over 300,000,000 cases while class no. 2 has only about 2000 cases. I want to use machine learning technics to create classification model according to the data (I started working with neural networks). The question is, should I select approximately the same number of cases from class 1 as in class no. 2, so the training set would have 2000+2000=4000 cases? What will be with the rest unused data? I assume that it depends on the data but I ask for the general convention. Maybe some anomaly analysis would be appropriate in that problem?

Sincerely,

Mateusz Soliński

More Mateusz Soliński's questions See All
Similar questions and discussions