Machine learning: if proportion of number of cases in different class in training set matters?

More Mateusz Soliński's questions See All

Is is possible to distinguish 1D signals into 2 groups using machine learning methods by taking their samples as an input (similar to CNN for images)?

Hello, I have a set of 12000 1-D signals (time series). They are measured under two conditions (Condition 1 and Condition 2, see figure). I would like to use machine learning methods to...

09 October 2018 5,735 3 View

Is there really positive impact of listening music tuned in 432 Hz (in comparison to 440 Hz)?

Could you recommend any convincing studies about positive or neutral impact of listening music in 432 Hz intonation (in comparison to 440 Hz) to human health. What is your opinion in this area?...

01 February 2018 1,415 11 View

Are there any simulators for validating reflectance pulse oximeters?

Hi, I am looking for something which would help to validate reflectance pulse oximeter. Are there any simulators which technically be able to validating such type of sensors? So far, I have found...

05 June 2016 6,575 6 View

Can anyone recomend good stress monitoring device during normal activity (enable to measure for long time, +24h)?

I started my PhD study about stress and daily activity and I need to find appropriate device for long measurements (+24). I've found two interesting devices so far: E4 wristband produced by...

03 April 2016 8,016 22 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

R. Raghunatha Sarma

Yes! It does matter. See if the following technique could help:

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html

Graham W Pulford

Hello. The number of samples matters very much in any statistical estimation technique, and machine learning (aka classification and regression theory) is no different. It's useful to start with simple least-squares regression to get a feel for how much data you need to get a given accuracy. For 1-D, the variance of the estimation error typical goes down inversely with the number of samples. In any case, a good rule of thumb is to have roughly equal number of cases for each class in supervised learning. Since you have a lot more cases in one class, you could forgo the resampling for that case when you do cross-validation and use independent samples for your case 1.

Harlan Nelson

I assume you are trying to classify your observations as coming from one of your classes or the other. Depending on your method, it will be hard to get your model to predict that an observation comes from the smaller class. If one group has three times the observations as the other, then you can lean toward the small group whenever you get a predicted probability greater than 25%. Your case is far more extreme. Everything gets easier if your classes are balanced.

Richa Batra

Yes, it does matter. Weighted SVMs can be used to account for the unequal class numbers. However, yours is an extreme case. I would go with 2000 samples from each class. Use the rest of cases from class one as an independent test set and compute how good my model performs.

Alex Pollara

Yes, an imbalance of the sort you are describing will lead to something called classifier bias.

Whether this is a good or bad thing depends on the nature of the data you are working with. Is the division of samples due to the rarity of one class in the population, or due to sampling bias? If your data set does not reflect the makeup of the population than you have a case of sampling bias and should take steps to correct it. If your data reflects the actual population and the minority class is just very rare then changing the proportion of samples from each class may lead to more false positives.

The attached paper details some methods for up-sampling minority classes or down-sampling majorities that you might find useful.

http://www.eecs.wsu.edu/~cook/pubs/icdm13.2.pdf

Conference Paper Handling Class Overlap and Imbalance to Detect Prompt Situat...