Choosing activation functions for MLPs depend on a couple of things as discussed below.
1. The activation function for the output layer would depend on whether you are performing classification or regression. For binary classification (i.e. problems of two classes), the Logistic-Sigmoid function can be used with binomial cross-entropy as the cost function. For multiclass classification (i.e. problems with more than two classes), the softmax function is used with multinomial cross-entropy as the cost function. For regression problems (i.e. real-value outputs), the linear/identity function is used.
2. For hidden layer units, your options would depend on the depth of your model as follows:
(a) For shallow models (1 or 2 hidden layers), the Logistic-Sigmoid, Tangent-Sigmoid or rectified linear function can be used. Here, choosing the most appropriate activation function among the aforementioned would be via experiments.
(b) For deep models (i.e. more than 2 hidden layers), the rectified linear function would become more appropriate to alleviate the problem of vanishing gradients.
Choosing activation functions for MLPs depend on a couple of things as discussed below.
1. The activation function for the output layer would depend on whether you are performing classification or regression. For binary classification (i.e. problems of two classes), the Logistic-Sigmoid function can be used with binomial cross-entropy as the cost function. For multiclass classification (i.e. problems with more than two classes), the softmax function is used with multinomial cross-entropy as the cost function. For regression problems (i.e. real-value outputs), the linear/identity function is used.
2. For hidden layer units, your options would depend on the depth of your model as follows:
(a) For shallow models (1 or 2 hidden layers), the Logistic-Sigmoid, Tangent-Sigmoid or rectified linear function can be used. Here, choosing the most appropriate activation function among the aforementioned would be via experiments.
(b) For deep models (i.e. more than 2 hidden layers), the rectified linear function would become more appropriate to alleviate the problem of vanishing gradients.
I've used ANN MLP for prediction. I've resign from 2 hidden layers - I've couldn't reach better results that for one hidden layers. I was told that in most cases the linear function in the hidden layer is the best. There were many combination of different parametres that I've checked. Different activation functions too (linear called satlins in matlab, logistic, hyperboid tangent called tansig). Finally the best prediction accuracy was achieved for linear activation function in the hidden layer and tansig in the output layer. So many, many experiments in my case.