When you use the KDDCUP'99 dataset, do bear in mind that this data set arises from being part of a cyber attack exercise, so will not reflect what might happen in reality. If you are happy with that approach, then go ahead and use the data. However, this approach reflects one of the weaknesses in intrusion detection systems research - getting accurate intrusion data is a big challenge. This also presents a weakness in the testing approach adopted by many researchers who propose new technical solutions to this problem.
Often known intrusion techniques are just that - known. If you test a proposed solution using known techniques, all you achieve is to confirm that defences for known vulnerabilities work, which you already know. Likewise, you can prove that your new system can protect against these known vulnerabilites. The problem is, what about new exploits? How do you train your new system to deal with those? As far as I am aware, there is no known solution to this problem.
If, on the other hand, you want to consider a more realistic solution, there are two viable approaches that might help. One would be to set up a honeypot server where you attempt to record live intrusions to see how they are perpetrated. You can then analyse these and add them to your new IDS system. Another option would be to link up with some seriously good penetration testers, and have them attack your system. Remember, there are a huge number of known exploits, and who knows how many yet to be discovered, so the task you face will be vast. Another option would be to set up a honeypot server with your new IDS installed, which you re-build nightly from source, adding each new exploit as you discover it. It would be a slow, laborious process, but would be extremely effective in the long run.
You might also want to check out the default settings for all the programmes you are trying to protect using your new IDS. Most default settings are designed for maximum user friendliness for setting up. Unfortunately, this usually means bad news for security and privacy. Often laziness, lack of knowledge, or lack of motivation ends up making the attackers' job much easier.
You might find it useful to check all the security breach reports produced by the big companies such as Verizon, Trustwave, PWC, Microsoft and so on. You should also check out the OWASP Top Vulnerabilities list link below. They update the top ten list every three years to show what the latest attacks are, and have a lot more useful information on their site.
There is no doubt that much more research is needed in this area.
you may introduce learning either supervised or unsupervised in your proposed IDS algo. If you are testing your algo in dynamic networks such as MANETs then you can always justify the learning there.
Based on what I've seen, I think Deep Learning has good potential for Intrusion Detection research. One question the research might answer is whether Deep Learning currently has real-time IDS capabilities, or if it's best to use for Forensic "after the fact" Intrusion Detection. Obviously, forensic capabilities are very important to the domain in general, even though some prefer to focus on real-time IDSs.
As far as data sets, I think research in our domain has struggled due to the lack of several recent real-world and high quality data sets. Unfortunately, I've seen all the different KDD data sets so highly criticized in publications and verbally that I'd be hesitant to use them in experiments even though some say "they're the best we got".
When you use the KDDCUP'99 dataset, do bear in mind that this data set arises from being part of a cyber attack exercise, so will not reflect what might happen in reality. If you are happy with that approach, then go ahead and use the data. However, this approach reflects one of the weaknesses in intrusion detection systems research - getting accurate intrusion data is a big challenge. This also presents a weakness in the testing approach adopted by many researchers who propose new technical solutions to this problem.
Often known intrusion techniques are just that - known. If you test a proposed solution using known techniques, all you achieve is to confirm that defences for known vulnerabilities work, which you already know. Likewise, you can prove that your new system can protect against these known vulnerabilites. The problem is, what about new exploits? How do you train your new system to deal with those? As far as I am aware, there is no known solution to this problem.
If, on the other hand, you want to consider a more realistic solution, there are two viable approaches that might help. One would be to set up a honeypot server where you attempt to record live intrusions to see how they are perpetrated. You can then analyse these and add them to your new IDS system. Another option would be to link up with some seriously good penetration testers, and have them attack your system. Remember, there are a huge number of known exploits, and who knows how many yet to be discovered, so the task you face will be vast. Another option would be to set up a honeypot server with your new IDS installed, which you re-build nightly from source, adding each new exploit as you discover it. It would be a slow, laborious process, but would be extremely effective in the long run.
You might also want to check out the default settings for all the programmes you are trying to protect using your new IDS. Most default settings are designed for maximum user friendliness for setting up. Unfortunately, this usually means bad news for security and privacy. Often laziness, lack of knowledge, or lack of motivation ends up making the attackers' job much easier.
You might find it useful to check all the security breach reports produced by the big companies such as Verizon, Trustwave, PWC, Microsoft and so on. You should also check out the OWASP Top Vulnerabilities list link below. They update the top ten list every three years to show what the latest attacks are, and have a lot more useful information on their site.
There is no doubt that much more research is needed in this area.
Deep learning might be useful for more detailed attacker behaviour monitoring and analysis as part of a larger security incident and event management or data forensics solution correlating several events across an organisation.
Here is a relevant paper on a combined artificial neural network/SVM/Fuzzy C-means clustering:
Bob Duncan's answer is excellent. I've encountered many of the challenges he's commenting on.
Unfortunately, we've started building our own data sets where Bob is correct in that you need very specialized expertise (wow, it:s not easy!).
And yes, honeypots are an excellent way to generate cybersecurity data sets. Our research group is active in generating data sets, but we haven't yet implemented honeypots which are very interesting. The piece I'm trying to ponder with honeypots is how to properly include them into an experiment in a manner where their normal traffic is "realistic". The main issue with honeypots seems to be on how to compound your experiment to also include the good quality *normal traffic* you are getting along with the excellent attack data you are getting. The Kyoto data set is a good starting point for honeypots.
Getting traffic representing normal traffic is obviously a major challenge - however it is not clear to me that this can be done at all - how do you define normality? The traffic patterns will vary highly across different use cases and industries.
Yes Nils, this is certainly a critical issue overall. I very much agree with you in 2 regards:
1) As you mention the background "normal traffic" in any cybersecurity data set is very difficult to achieve especially due to privacy and legal issues. As researchers overall, we lack a good *ground truth* in our data sets.
2) Yes, *traffic patterns will vary highly across different use cases and industries* is extremely valid. I struggle with this a great deal, and there is no easy solution. Instead, maybe we should just try "baby steps" in this research and hope to achieve small wins in very limited cases and industries.
on-line detection is very important, any method including deep learning should be effective. Who can tell me which method using off-the-shelf technique or fashion technology is effective?