Machine learning: if proportion of number of cases in different class in training set matters?

More Mateusz Soliński's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

R. Raghunatha Sarma

Yes! It does matter. See if the following technique could help:

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html

Graham W Pulford

Hello. The number of samples matters very much in any statistical estimation technique, and machine learning (aka classification and regression theory) is no different. It's useful to start with simple least-squares regression to get a feel for how much data you need to get a given accuracy. For 1-D, the variance of the estimation error typical goes down inversely with the number of samples. In any case, a good rule of thumb is to have roughly equal number of cases for each class in supervised learning. Since you have a lot more cases in one class, you could forgo the resampling for that case when you do cross-validation and use independent samples for your case 1.

Harlan Nelson

I assume you are trying to classify your observations as coming from one of your classes or the other. Depending on your method, it will be hard to get your model to predict that an observation comes from the smaller class. If one group has three times the observations as the other, then you can lean toward the small group whenever you get a predicted probability greater than 25%. Your case is far more extreme. Everything gets easier if your classes are balanced.

Richa Batra

Yes, it does matter. Weighted SVMs can be used to account for the unequal class numbers. However, yours is an extreme case. I would go with 2000 samples from each class. Use the rest of cases from class one as an independent test set and compute how good my model performs.

Alex Pollara

Yes, an imbalance of the sort you are describing will lead to something called classifier bias.

Whether this is a good or bad thing depends on the nature of the data you are working with. Is the division of samples due to the rarity of one class in the population, or due to sampling bias? If your data set does not reflect the makeup of the population than you have a case of sampling bias and should take steps to correct it. If your data reflects the actual population and the minority class is just very rare then changing the proportion of samples from each class may lead to more false positives.

The attached paper details some methods for up-sampling minority classes or down-sampling majorities that you might find useful.

http://www.eecs.wsu.edu/~cook/pubs/icdm13.2.pdf

Conference Paper Handling Class Overlap and Imbalance to Detect Prompt Situat...

Muhammad Ali

Yes, the best way to follow equally likely principal from statistics.

Mokhaled N. A. Al-Hamadani

Yes, it affects the model that you're using since it has to be an equal number of samples in each class or something close to it.