In multilayer perceptron generally we use sigmoid activation function in hidden layer and in output layer we use linear activation function, what happens when sigmoid function is used in output layer, too?
If you use sigmoid function in output layer, you can train and use your multilayer perceptron to perform regression instead of just classification. The output layer will output continuous values instead of binary ones. In the context of classification this can also be useful if you want to have a measure of the confidence of your classification.
A sigmoid activation function is lower and upper bounded. For instance, a logistic sigmoid function is ranged between (0,1) and a hyperbolic tangent is ranged between (-1,1). We often use sigmoid activation function in the output layer when we are dealing with a classification problem instead of a regression problem, that is, when the output target is categorical.
It is common practice to use one output unit for each class and if we use a logistic sigmoid activation function for the output layer, then the result is often interpreted as the posterior probability of the class given the input p(c|x). Thus, you can see the multilayer perceptron as a discriminative model like the logistic regression models. My experience is that this approach is powerful for classification purposes but it sometimes results in an ill-calibrated model. To avoid this problem you can also check the softmax activation function.
I'm a bit confused.....the output contains the estimated values which are then compared to the targets. During training weights of the MLP are adjusted in order to minimize the difference between output and target. If the output is transformed by some function so should be the targets..at the end we carry out some nonlinear weighting of the goodness or costfunction. And this seems to me a matter of convenience, depends on the problem.
The statement isn't correct: in multilayer perceptrons the units are sigmoid functions, e.g. for the two-layer perceptron that can represent the XOR function. So it's rather the other way around; while the sigmoid functions realize a classification task, one should rather ask what a linear output function actually is useful for, since such cases might seem much more specialized. And, of course, through a classification scheme it is possible to generate probability distributions on target spaces-though the efficiency is another issue.
As far as I remember the sigmoid - for solving the XOR problem - is applied in the hidden layer whereas the output -should be treated in the same way as the target. Something I got wrong ?
Well, for solving the XOR problem, you need a hidden layer of two sigmoid units and their result is fed into another sigmoid unit, the output unit, which gives the answer. So all units are sigmoid. You could, of course, use any activation function but, for the XOR, at least, you do need to state the mapping, which amounts to giving a threshold, i.e. a sigmoid function.
Both logistic sigmoid function and hyperbolic tangent function represent a balance between the linear and the non-linear behavior. However, logistic sigmoid function has only positive values and that is a disadvantage for the network. That will require a changes in the thresholds for the activation functions to mitigate it.