The selection of activation functions while training an artificial neural network is critical. Different activation functions will make you to achieve much better results for different problems.
It depends on the CNN model you are using. The best way is to check the state-of-the-art relevant to your problem.
SELU(self-normalizing exponential linear units ) is considered good in most of the cases. It also guarantees the input to the next layer will have zero mean.
It depends on the network that you are using. You can apply several activation functions and then check the errors of training, validation, and test subset. You should also monitor network performance.
My suggestions for this question are answered as follows:
If possible, the hyertangent function is selected as activation function. The reasons are as follows: (i) Including its differential are smooth nonlinearities. It makes signal processing better. (ii) Since it is bipolar, the amplitude of processing signal can be smaller such that uncertainty can be reduced for the same problem. It also indicates that robustness will be improved. (iii) The convergence of learning (or training) weight can be faster due to the smaller signal or the bipolar feature or the reduction of uncertainty.
Nevertheless, its effective output (or region) should be about 60% of its maximum, and the hardware implement (e.g., FPGA) will be difficult.
I would suggest initially the sigmoid function after reading this: https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0. Regards
Waqed Al-Mussawi Activation functions are a type of hyperparameters, and you’ll need to experiment with all of them in order to determine which works the best for your problem (Kind of hit and trial). Besides this, you can constrict your search by referring to the earlier work done in the field of your particular problem. For example, it has already been shown that tanh activations provide better results for image classification while leaky ReLUs offer satisfactory performances for temporal sequences such as video.