Data scientists have access to a range of techniques, which can be broken down in terms of problems they solve: classification and regression. Both can be used to analyse data and provide the answer to whether a transaction was genuine or fraudulent. The typical supervised machine learning algorithms used to solve these problems are logistic regression, decision trees, random forests, and neural networks.
Logistic regression is a popular method, which determines the strength of cause and effect relationships between variables in data sets. It can be used to create an algorithm which predicts whether a transaction is ‘good’ or not.
Decision trees can be used to create a set of rules that model customers’ normal behavior and can be trained, using examples of fraud, to detect anomalies.
Random forests (boosting techniques) ensemble multiple weak classifiers into one strong classifier – they can be built using an ensemble of decision trees.
Neural networks are a powerful technique inspired by the workings of the human brain. Able to learn and adapt to patterns of normal behavior, neural networks can identify fraud in real-time.
Unsupervised techniques are based on clustering algorithms, which group similar data points together – they are used for anomaly detection. Algorithms used in the unsupervised approach are K-means clustering, Local Outlier Factor and One-Class SVM.
K-means clustering divides a dataset into clusters. The algorithm works iteratively and assigns data points to one of the predefined number of classes (k), based on the features that are in the dataset. Data points are clustered based on feature similarity.
Local Outlier Factor, is an algorithm that calculates the local density of data points and allows for identifying regions with similar density in the data set. By using the locality concept, one can distinguish points with much lower density than other neighbours. These points are outliers (fraudulent transactions)
One-Class SVM learns a function used for novelty detection. The idea of novelty detection is to detect rare events, i.e. events that happen rarely, and hence, of which you have very little samples. The problem is then that the usual way of training a classifier will not work.
Although machine learning represents a huge leap forward compared to traditional methods of fraud detection, it is not without its limitations.
Machine learning models are only as good as the data they are provided with. While financial services have access to massive data sets, there are relatively few fraudulent transactions within these, which can reduce a system’s predictive capability. There are several approaches to dealing with this problem.
following could be used to analyse data and provide the answer to whether a transaction was genuine or fraudulent. The typical supervised machine learning algorithms used to solve these problems are logistic regression, decision trees, random forests, and neural networks