In a feed-forward neural network (NN), which type of neurons (linear or log-sigmoidal) must be placed in the hidden layers for pattern recognition and for regression ?

02 February 2013 13 2K Report

Pattern recognition: hidden layer of linear followed by a log-sigmoidal layer

Regression: hidden layer of log-sigmoidal followed by a linear layer

Is this scheme correct for basic examples?

The type of neurons used in NN depends on your data. If NN can approximate the dependency in the data using linear neuron, then it is better to use linear neurons. Log-sigmoidal neuron can also approximate the linear dependency in the data but you may overlearn, which is not advisable. So it is always recommended to first test with linear neuron in the hidden layer. Also, one hidden layer is generally sufficient to learn the dependency in the data. This makes the training process easy, fast, and prevents computational over burden.

Michael de Paly

I'm assuming that you are talking about typical feed forward networks, where all neurons in a layer have a connection to all neurons of the next layer. In this case linear neurons in a hidden layer of do not add anything to the capabilties of the neural network. Since they only do a linear projection they can be easily folded into the input weights of the neurons in the next layer. Linear neurons in a hidden layer may actually have a negative impact on the networks performance because they add additional weights that have to be trained .

Serban C. Musca

Hi all,

I am sorry to say but I disagree with both previous answers.

@ Vivek:

Most of the time, with linear activation functions you'll get into trouble for all those values that are very high or very low. The interest of using a sigmoid function (namely a logistic function) to compute the output of an unit from all the inputs it receives consists precisely in the fact that for very high and positive input values the output will approach the value of 1 but only asymptotically (and, likewise, for very high and negative input values the output will approach the value of 0 but only asymptotically). The beauty of the use of a logistic activation is that very big differences in the input near the upper (and likewise, lower) bound will make practically no difference in the output, while, using the derivative to compute during the backpropagation step, a very little difference near the upper (or lower) bound will be associated to a big difference in the backpropagated error. To convince yourself of this, just plot a logistic function with x on ]-infinity, +infinity[ and look at f(x), which can be made to be bound on ]0;1[.

So, I'll just always avoid linear activation functions for the hidden layer neurons.

@Michael:

"In this case linear neurons in a hidden layer of do not add anything to the capabilties of the neural network. Since they only do a linear projection they can be easily folded into the input weights of the neurons in the next layer." You are wrong on this. The capacity for a nonlinear separation depends on the NUMBER of the hidden layer neurons, not on their activation function. You can just as well get a neural net to do a nonlinear separation if you endow it with plenty of linear function-operating hidden layer neurons. The difference with the same neural net architecture that would have sigmoid function-operating hidden layer neurons is that the one with the linear function-operating hidden layer neurons will much more often fail to learn because its cost function would lead much more often to local minima where the things would get stuck.

[EDIT (see Michael's answer and mine below)]: I had it all wrong. Of course, a linear combination of linear relationships won't get you nonlinear separations ... so one does need a nonlinear activation function for the neurons in the hidden layer to enable the ANN to solve a problem that involves a nonlinear separation of classes.

@Josep:

For pattern recognition: use a 3-layer artificial neural network (ANN), with linear activation function for the input layer units and logistic activation function for the hidden layer units (as for the output layer units, you could use any of the two).

For regression: I'm no expert on this, and I thus even wonder why you'd use an ANN to make a regression, so the only interest I can see here is for carrying a nonlinear regression. In this case, you must use a linear activation function for both the input layer and the output layer units and a logistic activation function for the hidden lay...

Rao V. Vemuri

There is no point in using linear units in the hidden layer. What type of non-linearity do we use in the hidden layer? A sigmoid that varies between 0 and 1 or a hyperbolic tangent that varies between -1 and +1 are the popular choices. Make sure all the hidden units are of the same type. A standard Back Propagation algorithm tends to converge faster with -1, +1 units compared to 0, 1 units. This is a first cut.

Josep L Rossello

OK, if I don't understand badly, a layer of linear neurons previous to a sigmoidal one, does not provide additional processing capabilities since each neuron at the next layer is already doing a linear transformation of incoming information. Therefore, the only sense would be as an output layer in a network addapted for regression purposes (as an example) but for patter recognition they are unuseful.

Serban C. Musca

Quite so, indeed, Josep. If you think of a 3-layer (input, hidden, output) net, you can have a linear activation function (for instance, the identity function) on all the input layer neurons, since they just pass the information onward. But for the units in the other two layers (and necessarily for the hidden layer units) you need sigmoidal activation functions (I say functions in plural because you could use a logistic activation function for all neurons but with a parameter that would vary between the neurons -- or, just as well a unique sigmoidal activation function for all your neurons that are not in the input layer).

For regression purposes, you'd rather use sigmoidal activation function(s) only for the hidden layer neurons, because otherwise the sigmoidal activation function(s) would transform your input-output data, which is not what you'd want (otherwise, you would have to to a complementary step and express the whole thing into your original data space by using the derivative of the activation function -- which is something like scratching your left ear with your right hand). Not to mention that this would be quite challenging if you opted for a different parameter for each sigmoidal activation function (something that I do, because it generates more randomness and somehow "helps" -- though not necessarily in a significant amount, so take it as a superstition of mine).

HTH,

SCM

Michael de Paly

@Serban I have to disagree with your response.

"The capacity for a non-linear separation depends on the NUMBER of the hidden layer neurons, not on their activation function. " is not correct in the case of linear activation functions as one can easily prove by "folding" the linear neuron into the weights of the overlying layer. That means, it does not matter for the capabilities of the network if the linear neuron in the hidden layer is there or not. Just consider $k = 1...n$ linear neurons $h_k$ in the hidden layer with their outputs $h_k = \sum_i(w_{ki}a_i)$ where $w_{ki}$ are the weights of their $i = 1...m$ input neurons with their activation $a_i$. The activation of a neuron $o$ in the output layer (or in an additional overlaying hidden layer) can now be expressed as $o = f_{activation}(\sum_k( w_{ok}h_k)) = f_{act}(\sum_k( w_{ok}\sum_i (w_{ki}a_i)))$ where $f_{activation}$ is the activation function of $o$. I can now easily remove the hidden linear neurons $h_k$ and connect $o$ to the neurons $a_i$ with the new weights $w_{oi} = w_{ok}\sum_k(w_{ki})$ resulting in the activation of $o = f_{activation}(\sum_i(w_{oi}a_i))$ which is mathematically equivalent to the original network that contained the hidden linear neurons $h_k$.

Since you can do the same thing with all hidden linear neurons, a network of only linear neurons in the hidden layers collapses into a network that only consists of its output and input neurons. If the input layer also uses linear activation functions (which is more or less the standard case) the resulting network is actually only capable of linear separation contradicting your statement: "You can just as well get a neural net to do a non-linear separation if you endow it with plenty of linear function-operating hidden layer neurons."

Serban C. Musca

@Michael: You were right, all apologies! Looks like I've said something awfully stupid :s Of course you are right: a linear combination of linear relationships won't get you nonlinear separations ...

I haven't understood any of your formulas though, is that latex code?

Anyway, thank you for having corrected my mistake! (It's been time since I last thought about linear activation hidden layer neurons ...)

Cheers,

SCM

Michael de Paly

No harm done :)

Yes thats latex code. Unfortunately I don't know a good way to put formulas into postings, so I used the way I usually use to put formulas into papers. If there is a better way to do it on Researchgate I would be happy to know about it.

Muhammad Arif

Selection of type of activation function depends on data.

Achraf El Allali

SVM should perform better than NN in a classification problem. For regression you can use the SVR approach.

Ramesh Babu.N

Choose log-sigmoidal function in hidden units to address the nonlinearity in your input data & use linear function in the output layer. If your input is linear then no point in choosing a neural network-it will still worse your classification.

Moreover, agreeing with Allali-SVM will give better classification results than that of a feed forward network.

Amit Kumar

Log-sigmoidal or tan-sigmoidal neurans are useful.

Will deep learning put an end to art?

In pattern recognition applications, which function for the Neural Network output layer is the best, linear or log-sigmoidal?

Which software tools are best for enhancing diagnostic accuracy in chest X-ray imaging using image reconstruction and neural networks?

What are the current challenges and future prospects of integrating artificial intelligence into recognition systems for autonomous vehicles?

Help me download paper?

How can I extract the mathematical equation from existing Neural Network Model?

What is the current status of augmented learning in robotic surgery?

How can I improve the purity of NPC cultures derived from human iPSCs during neural rosette selection?

What is the difference between opportunity recognition in entrepreneurship literature and sensing in dynamic capabilities theory?

Is it possible to use neural network models for prediction if the sample size for the time series is very small??

I am working on a network for facial expretion recognition and I have problem with the loss function can anyone help?

What is information diffusion in the social network?How a message got viral in social network?