Anyone knows how to add some noise data to my classification datasets? I am confused about add the noise data into training sets or testing sets or the whole datasets...what's more,I do not clear how can I make it
Oluwarotimi seems to have the entire learning process covered. I'll just expand a bit more on the adding the noise part.
The usual type of noise that is added to a classification dataset is Gaussian noise. Provided your dataset feature/attributes comprises of real numbers, it is actually a simple process:
Fix a scale factor w
Find the standard deviation s_f of each feature f
for each instance,
for each feature value of feature f,
choose a random number x taken from the interval (-s_f, s_f)
add to that instance x / w
Note that the scale factor w determines the degree of noise that could be added to your data. Have it too low, your dataset would become too noisy and your machine learning algorithm would not converge. If w is too high, then the noise itself would be negligible.
When a fewer training data is available, one can add a small amount of noise to create a larger data set. Each time a training sample is exposed to the model, random noise is added to the input variables making them different every time it is exposed to the model. In this way, adding noise to input samples is a simple form of data augmentation. The best way is to normalize the values and somehow add noise based on Gaussian distribution.
What if I want to avoid negative values in random number generation between (-stddev, +stddev), i.e. just generate random numbers between (0,+stddev). Would that still be a valid Gaussian noise? The reason am mentioning this is because the feature types that I already have the data for (i.e. to which I need to add noise), have positive values (with only couple features, i.e. may be 10 out of 700 having some low negative values). This is purely dependent on the range of these feature values (i.e. majority of the features falling in range of (0,+real_numbers)). So am afraid if I start adding -ve noises to these values (which I indeed performed earlier), many feature values will go out of their normal operating range, i.e. I will end up generating a simulated data which my model may never ever experience in future.
Hence, I am thinking I would restrict adding noise only as random numbers in range (0, +STDDEV). Ofcourse I would divide them by scale factor 'w' and then add them to the original feature values to get the simulated values.
Note: When I said that "adding negative noise to feature values results into -ve feature values", that negative noise already took the scale factor 'w' into account.
Please let me know if going for (0,+STDDEV) along with scale factor 'w', would still preserve the Gaussian property of the noise.