The use of the margin of SVMs (or a non-linear transformation of the margin, for example a sigmoid as in Platt's method) as a measure of confidence, while widely used, has no support at all from the theory. In other words, things can be arbitrarily bad.
The problem is the hinge loss. It is possible to show that when you use the hinge loss, even with an infinite amount of data, the margin will not converge to any measure of confidence or to the conditional probabilities.
On the other hand, it is enough to use the log-loss, the squared hinge, or even the modified Huber's loss, to have the guarantee that the margin will carry some confidence information. The exact mathematical details are in
Already several answers are given here which discussed the theoretical aspects. This answer will shed light onto the practical implementation in Python. Here, we need to use “predict_proba” function. This method computes the probability that a given datapoint belongs to a particular class using Platt scaling. You can check out the original paper by Platt (http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=92D78A0432AC435DC3DADF0A86A70E1D?doi=10.1.1.41.1639&rep=rep1&type=pdf). Basically, Platt scaling computes the probabilities using the following method:
P(class/input) = 1 / (1 + exp(A * f(input) + B))
Here, P(class/input) is the probability that “input” belongs to “class” and f(input) is the signed distance of the input datapoint from the boundary, which is basically the output of “decision_function”. We need to train the SVM as usual and then optimize the parameters A and B. The value of P(class/input) will always be between 0 and 1. Bear in mind that the training method would be slightly different if we want to use Platt scaling. We need to train a probability model on top of our SVM. Also, to avoid overfitting, it uses n-fold cross validation. So this is a lot more expensive than training a non-probabilistic SVM (like we did earlier). Let’s see how to do it: