I have a data-set with good and malicious files. Labelled as 0 and 1. In the data-set there are mor malicious files - like 3000 vs 800 good files. Nevertheless I am getting good results with logistic regression, random forrest and SVM. I have tweak the parameters to be adjusted for imbalanced datasets. The recall is always above 0.90 and accuracy 0.86-0.9 The AUC score is always around 0.87 to 0.9 depending on the algorithm. Now my question is , in real life the files that will run through the algorithm will mostly be good (millions) and malicious one will be a few thousand. The opposite way that my data-set is. Should that be a problem for my testing?

More Petra Vukmirovic's questions See All
Similar questions and discussions