I have to agree, there is no such thing as the best classifier, though SVM has arguably been one of the most widely used approaches recently, due to some of its nice properties.
It all depends on the data and the pre-processing / feature representation.
On the other hand, sometimes you don't even know the feature space and all you're given is the distance matrix.
Some easy-to-spot data properties that might greatly affect the choice of the classifier are class imbalance and presence of noise or mislabeling.
In any case, I just wanted to say that the overall accuracy is not the only thing that matters, as one should also care about the interpretability of the models, as it is often important to explain the classification decisions on concrete examples.
Sorry to shatter your illusion, but SVM is not the "best" classifier, as no such thing exists a priori.
However, SVM has indeed some nice properties that make it highly suitable for general classification problems.
First, SVMs are known to have nice generalization abilities. This can be most easily visualized for the vanilla linear SVM case, where, sloppily speaking, the separating hyperplane is placed in the space between both classes, such that the distance between the two respective classes (aka geometrical margin) is maximized. This maximum margin hyperplane has a provably good generalization performance, as it a priori makes no sense to move the hyperplane toward one class (in some rare circumstances, however, this might result in a better decision rule!).
Second, the use of slack variables allows to tackle real-world cases, where the provided data cannot be perfectly separated by a hyperplane. In case of nu-SVM the parameter controlling the trade-off between data fit and deviation from perfect linear separability has a clear meaning in terms of the expected percentage of mis-classified training examples.
Third, instead of aiming to find a linear separating hyperplane, SVMs can make use of kernels. The latter makes it possible to shift from linear separating hyperplanes to more general decision boundaries. This is possible due to the so-called "kernel trick", where standard inner products between any two vectors are swapped by some other (conditionally) positive (semi-)definite bilinear form. This implicitly maps the original data to some (possibly infinite-dimensional) feature space, within which a linear separating hyperplane is computed. This nice feature usually comes at the cost of reduced generalization ability, as more complex decision boundaries are likely to make more mistakes in case, e.g., some data points are noisy.
Fourth, in contrast to other kernel-based methods, SVMs lead to a sparse representation of the decision boundary (our model), as the boundary itself is only dependent on certain data points known as support vectors. Sparseness in terms of features can be also achieved by using a l1-norm regularizer.
Fifth, as the SVM problem is convex, i.e., there is a unique solution to the classification problem (giving fixed hyperparameters such as kernel parameters or the trade-of-parameter C/nu), in contrast to, say, back-propagation networks.
Sixth, due to nice properties of the optimization objective, fast algorithms can be applied (e.g.: smo). Moreover, very good SVM toolboxes (libsvm, svmlite) are freely available on the net..
But of course, there are also disadvantages to SVMs.
E.g., hyperparameter tuning is cumbersome and takes time, especially if many parameters are involved. Often one has to resort to cross-validation on some hyperparameter grid, which involves repeated training and evaluation of SVMs.
Further, SVMs lack a probabilistic interpretation, which might be desired for some purpose. However, a....
However, although SVMs cannot be direclty cast in a Bayesian formalism, there are post-hoc approaches (e.g. Platt's method) that aim to estimate a normalized classification score p(y=+1|x)=1-p(y=-1|x).
All in all, SVMs are clearly worth a try, if the (asymptotically) cubic runtime (wrt. the number of support vectors - a number that highly depends on the chosen hyperparameters + the problem complexity - which is upper bounded by the number of all training examples) can be afforded and the number of hyperparameters is managable.
I personally feel there is nothing like the best classifier or best feature in pattern recognition and machine learning field. Some classifier work well for some problem/datasets and fails in other cases.
I have to agree, there is no such thing as the best classifier, though SVM has arguably been one of the most widely used approaches recently, due to some of its nice properties.
It all depends on the data and the pre-processing / feature representation.
On the other hand, sometimes you don't even know the feature space and all you're given is the distance matrix.
Some easy-to-spot data properties that might greatly affect the choice of the classifier are class imbalance and presence of noise or mislabeling.
In any case, I just wanted to say that the overall accuracy is not the only thing that matters, as one should also care about the interpretability of the models, as it is often important to explain the classification decisions on concrete examples.
I very well know of the RVM. I however would see this method as a Bayesian counterpart of the SVM rather than an SVM with Bayesian features. Otherwise, one could easily expand the list to also include the Informative vector machine etc.
But these techniques have other properties, e.g. they are usually not framed as a convex problem, hence non-convergent in general and often slower, but usually lead to even sparser models than those obtianed from vanilla SVMs.
@Arturo: I actually do think that Niklas agrees with your point of view... he supports your view that there is no best classifier and that finding the right learning machine is problem-dependent. I also fully agree that pre-processing such as feature extraction and reduction is very important, since it influences the well"well-posedness" of the classification problem. After all, classification only works if the (e.g.) vector space is expressive enough to realize the given mapping to class labels.
Firstly, I believe that the type and the performance of a classifier are dependent on your problem. I mean, usually some classifiers require some conditions, which are not met in your particular problem, as a result, that classifiers do not perform well in your case.
SVM, however, has some advantages as follow:
(1) There are no problems with local minima, because the solution is a quadratic programming (QP) problem.
(2) There are few model parameters to select.
(3) The final results are stable and repeatable.
(4) SVM represents a general methodology for many pattern recognition (PR) problems: classification, regression, feature extraction, clustering, ....
(5) SVM is a minimum memory space approach.
(6) SVM provides a method to control complexity independently of dimensionality.
(7) SVM have been shown (theoretically and empirically) to have excellent generalization capability.
I agree with previous posts. SVM is not necessarily the best classifier. It also depends on the data you are working with, the number of samples, the number of features, the imbalance between positive and negative classes, two-class vs. multi-class. I also agree on the fact that SVM is the most widely used classifier, and that is the reason I think most people think it is the best.
With my experience of working with very few unlabeled data, using unsupervised ML techniques like clustering has been far more effective than using SVMs and NNs. That said, I agree with Clerot on the "No Free Lunch" concept; it is entirely data dependent.
A similar question came up in another part of the forum. I consider it an interesting future research topic to acutally determine suitable ML approaches for a given novel type of data and learning problem.