I have a dataset of network connections (KYOTO +2006) that each connection has a label of attack or normal. (It doesn't determine which kind of attack).
According to a paper that I want to simulate, I need to choose RANDOMLY and FAIRLY a set of data which contain 1 percent of attack and 99 percent of normal but they did not discuss what does FAIRLY means.
I discriminated data in to attack and normal and randomly choose 1% from attack and 99% from normal but my result was too different from the base paper and I think the problem is in Data selection?
I have a question, Is data selection important in training an IDS (intrusion detection system) and what should I choose training data to be sure it covers all kind of attack and normal data.