Hi. I have a small tabular dataset of 130 datapoints, although these are just biological replicates of 20 samples. Each one of these samples' replicates have the same input feature values, and they only diverge in the target variable. So I have 130 different target variable values but only 20 sets of different input features.
If I consider the replicates grouped together (which seems like standard procedure) I only really have 20 samples which is a really low amount. I can compute the confidence interval of predictions from the known variability in the target variable of the replicates, and consider a Leave-One-Out strategy as a means of portraying the accuracy of my model, but I think it still falls short due to only having 20 samples.
The particularity of my case is that the input features cannot be measured for each biological replicate because most of the techniques are destructive and time consuming, hence I do have a good reason for why the dataset is like it is, but I need to increase my usable dataset size somehow.
The input features for a particular sample were computed mostly from experimental data, in which I know what is the mean and standard deviation of said feature for said sample, so I just assign the mean of that sample to all of its replicates. What I thought of doing was exchanging that fixed mean value for all replicates by random values from the gaussian distribution that can be drawn from the mean and standard deviation.
Essentially, I am trying to augment my data by the addition of noise, but this noise comes a known distribution and is a tactic to convert my biological replicates into independent samples. The presence of uncertainty of the input features for each sample is real and depends on the feature and sample so the added noise is also helping to not overtrain the model on more uncertain features, what I dont like is the fact that the input values have randomly drawn instead of given a more empirical value but I see no other way. Using the target variable information to engineer the input features is surely a data leakage that would not be approved by many.
Sorry for the ramble, my question is what do you think of this approach and has anyone ever seen this kind of strategy being used in a publication they can reference?
Thank you kindly