Some companies like Splunk sell software that does this and have been doing it for some time, using various techniques. A few pointers to Splunks work (their commands are pretty self-explanatory):
Logging is vital to the success of any IT project. With solid logging practice, you can troubleshoot errors, find patterns, calculate statistics, and communicate information easily. With the size and complexity of modern systems, performing these actions involves various analysis activities.
One of these important analysis activities is anomaly detection. What is anomaly detection, and where does it fit in all of this? That’s what this post is about. I’ll first present a succinct definition of what anomaly detection in log file analysis is. I’ll then explain the definition in detail, before discussing why it’s important for your business and introducing how it works.
Anomaly detection plays an important role in the management of modern large-scale distributed systems. Logs are widely used for anomaly detection, recording system runtime information, and errors.
Traditionally, operators have to go through the logs manually with keyword searching and rule matching. The increasing scale and complexity of modern systems, however, make the volume of logs explode, which renders the infeasibility of manual inspection. To reduce manual effort, we need anomaly detection methods based on automated log analysis.
Raw log messages are usually unstructured texts. To enable automated mining of unstructured logs, the first step is to perform log parsing, whereby unstructured raw log messages can be transformed into a sequence of structured events. Then we are able to do anomaly detection based on these sequences.
The process of log analysis for anomaly detection involves four main steps:
Log collection
Log parsing
Feature extraction
Anomaly detection
Important: The Python code to run the last three steps of the anomaly detection pipeline, as well as the log file used for the experiment, can be found on GitHub.
Log-based Anomaly Detection with Deep Learning: How Far Are We?
Software-intensive systems produce logs for troubleshooting purposes. Recently, many deep learning models have been proposed to automatically detect system anomalies based on log data. These models typically claim very high detection accuracy. For example, most models report an F-measure greater than 0.9 on the commonly used HDFS dataset. To achieve a profound understanding of how far we are from solving the problem of log-based anomaly detection, in this paper, we conduct an in-depth analysis of five state-of-the-art deep learning-based models for detecting system anomalies on four public log datasets. Our experiments focus on several aspects of model evaluation, including training data selection, data grouping, class distribution, data noise, and early detection ability. Our results point out that all these aspects have significant impact on the evaluation, and that all the studied models do not always work well. The problem of log-based anomaly detection has not been solved yet. Based on our findings, we also suggest possible future work.