The parameter nu is an upper bound in the fraction of training points outside the estimated region. So I think decreasing nu can cause overfitting and increasing nu can cause underfitting. Is it correct?
Both of you are right. Frank's note to not forget the kernel hyperparameters is important, as they usually have a tremendous impact upon the result, often much more than the parameter nu has. However, nu is there to tune the trade-off between overfitting and generalization, just as Sajjad said. As Frank already pointed out, nu is upper bounded by the fraction of outliers and lower bounded by the fraction of SVs and, as far as I recall, equals both in the limit. Sajjads connection about the number of outliers and generalization capabilities is also valid. The same holds for the number of SVs, as a small number of SVs points toward easier decision boundaries and vice versa.
The role of nu can be also easily interpreted by looking on the optimization problem directly. Disregarding the inequality constraints, we have something like this:
min{w} ||w||^2 + c*loss(X,y), with c=1/(n*nu),
a usual generalization versus data fit trade-off.
Taking small values of nu leads to a large c, meaning that mis-classifications (counted for in the loss function) have a large impact upon the objective. In other words, fitting the data is more important than heading for easier solutions (given by the regularizer ||w||^2). For large nus, the quadratic regularizer is more important and label deviations wrt ground-truth data are more likely to occur, hence easier solutions with a usually better generalization ability are preferred.
I think it is not that simple, you also have to account for the kernel parameters. Take a look at Schoelkopf's book "Learning with kernels", chap 8. There is a nice explanation and examples. To that definition, nu is an upper bound on the fraction of outliers and lower bound on the fraction of SVs.
For example if I use RBF kernel I should set the gamma and by increasing the gamma it may cause Overfitting. I'm wondering how about the "nu"?
I think, if "nu" is an upper bound on the fraction of outliers so increasing the nu means we permitted more samples to be outliers and so it may cause simpler system and a biased system and so Underfitting. Am I right?
Both of you are right. Frank's note to not forget the kernel hyperparameters is important, as they usually have a tremendous impact upon the result, often much more than the parameter nu has. However, nu is there to tune the trade-off between overfitting and generalization, just as Sajjad said. As Frank already pointed out, nu is upper bounded by the fraction of outliers and lower bounded by the fraction of SVs and, as far as I recall, equals both in the limit. Sajjads connection about the number of outliers and generalization capabilities is also valid. The same holds for the number of SVs, as a small number of SVs points toward easier decision boundaries and vice versa.
The role of nu can be also easily interpreted by looking on the optimization problem directly. Disregarding the inequality constraints, we have something like this:
min{w} ||w||^2 + c*loss(X,y), with c=1/(n*nu),
a usual generalization versus data fit trade-off.
Taking small values of nu leads to a large c, meaning that mis-classifications (counted for in the loss function) have a large impact upon the objective. In other words, fitting the data is more important than heading for easier solutions (given by the regularizer ||w||^2). For large nus, the quadratic regularizer is more important and label deviations wrt ground-truth data are more likely to occur, hence easier solutions with a usually better generalization ability are preferred.
Thanks a million for your advantageous comment and explanation it is pretty clear and nice.
The only thing which I'm thinking about is that if nu is lower bound on on the fraction of SVs, so increasing the nu should increase the number of SVs which may cause Overfitting! But we have concluded that increasing the nu should bring about Underfitting.
Hi Sajjad. I just realized that I mixed up nu-svm and one-class svm in above explanation. Hence the loss(X,y) is of course independent of y since there is no y in the one-class context. The main objective in oc-svm lies in the separation of data from the origin with some margin rho. The loss function used here can be interpreted with the distances of outliers from the separating hyperplane (with margin rho), where outliers define only those points that fall on the wrong side of the hyperplane (i.e. that have a margin lesser than rho).
The optimization problem looks like:
min_{w,xi,rho} 0.5*||w||^2 + c*sum_i(xi_i) - rho
where the xi_i are the outlier distances from the hyperplane and c=1/(n*nu).
I.e. we want to maximize the margin rho (last term) while having easy solutions (first term) and relatively small number of outliers (second term). The strength of the second term is inversely affected by the parameter nu (as described above), i.e. a small nu leads to large emphasis on making only small outliers and so on.
Thanks for your question whether "increasing the nu brings overfitting". Well, again, it is my fault, since my explanation was a bit flawy. A large number of SVs only points toward overfitting in the hard-margin binary SVM case, as SVs are points that uniquely define the hyperplane (i.e. those direclty on the hyperplane). In the (popular) soft-margin case, however, SVs are those hyperplane points and all outliers.
The latter carries over to one-class SVM and also suggests where the above-mentioned lower bound might come from.
One last comment: I am not sure if the terms under- and overfitting should be used in this context, as we effectively have no response variables y to fit.
I'm trying to support our discussion by implementing practical example using data sets. And about your last comment, I assume that I have access to a validation set which is contained samples from two class, Normal and Abnormal.
In my experiments by decreasing the parameter "nu" the system is going toward Overfitting. I mean by decreasing the nu, train error is decreasing and validation error is increasing.
Well, a decrease in training error is to be expected by decreasing nu, since more and more training data points fall on the correct side of the hypersphere (less outliers due to above-mentioned upper bound). But assuming a too small outlier ratio can easily result in many negatives also falling on this side of the hypersphere, causing a higher test error.
Jakoola's heuristic trick is a nice approach to estimate "gamma" and it is more logical than using 1/n as in many packages that provide an automatic estimation for this parameter.