What is the criteria to select optimal number of hidden layer neurons in MLP neural network ?

Vikas Jairam Dongre @Vikas_Dongre

03 March 2014 31 2K Report

My problem has 81 input features and 43 targets.

Simone Scardapane Popular answer

The simplest method is probably to apply a k-fold cross validation:

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

Basically, it works like this:

1) You split your training data into k equal-sized parts (called folds). Typical numbers of k are between 3 and 10.

2) You choose a suitable number of candidate dimensionalities for your hidden layer, e.g. 40 neurons, 50 neurons, etc.

3) For each of these candidate dimensionalities, you train the network k times, using k-1 folds as training data and the k-th one as testing data.

4) You choose the number of neurons whose average testing error over the k trials of point (3) is lowest.

Simone Scardapane

The simplest method is probably to apply a k-fold cross validation:

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

Basically, it works like this:

1) You split your training data into k equal-sized parts (called folds). Typical numbers of k are between 3 and 10.

2) You choose a suitable number of candidate dimensionalities for your hidden layer, e.g. 40 neurons, 50 neurons, etc.

3) For each of these candidate dimensionalities, you train the network k times, using k-1 folds as training data and the k-th one as testing data.

4) You choose the number of neurons whose average testing error over the k trials of point (3) is lowest.

Arturo Geigel

You can also use evolutionary algorithms to determine the best configuration parameters for your problem

Seyed Reza Shahamiri

Take a look at section 4.2 of the below article. You can use the formula given:

https://www.researchgate.net/publication/259756321_Real-time_frequency-based_noise-robust_Automatic_Speech_Recognition_using_Multi-Nets_Artificial_Neural_Networks_A_multi-views_multi-learners_approach?ev=prf_pub

Article Real-time frequency-based noise-robust Automatic Speech Reco...

Raphaël Feraud

The methodology used to determine the "optimal" number of hidden neurons is based on Structural Risk Minimization (see The nature of statistical learning Theory, V. Vapnik).

You test the architectures such that each architecture is included in the next one. For instance, 10 hidden neurons \subset 20 hidden neurons \subset ... \subset 100 hidden neurons.

You evaluate the performance of each architecture using a validation set or using k-fold cross validation. When, the validation error is increasing, you can stop the evaluation and choose the architecture where the validation error is the lowest.

Simone Scardapane

@Raphael: connecting model selection in neural networks and the Structural Risk Minimization approach is a very intriguing observation.

However, the k-fold cross validation can be considered only as an heuristical approximation to SRM. To perform SRM correctly, you need a way of computing the VC dimension of a neural network, but in practice you can only compute some bounds. Moreover, the training process of a neural network involves some regularizing effects which implicitly change the solution with respect to the theoretical one (e.g. choosing small initial weights). Hence, I don't think SRM can be considered as a realistic criteria to perform model selection in this context.

Hasan Fleyeh

One rule of thumb to be observed is that the number of weights to be optimized should be less than the number of training examples.

Usually one hidden layer (possibly with many hidden nodes) is enough, occasionally two is useful.

Practical rule of thumb if n is the Number of input nodes, and m is the number of hidden nodes, then for binary/bipolar data: m = 2n and for real data: m >> 2n

Raphaël Feraud

@Simone

The SRM gives us the shape of the upper bound of the risk.

Even if you have the VC-dim of your prefered model, SRM provides just an upper bound. It say nothing more about the realisation of the risk on your set of examples, or about its expected value. We have to understand the statistical learning theory as a guideline. You can use SRM only if you beleive that the behavior of the upper bound of the risk is the same than the risk. It is right for MLP as well as SVM.

Anyway without additional knowledge, it seems a reasonable assumption. Right ?

Simone Scardapane

@Raphael

You are right, SRM can be used (in principle) for any class of learning models, and it works using an upper bound on the expected risk. However, to apply SRM you need to be able to decompose your class of hypothesis in subsets (easy in the case of neural networks), but also to be able to compute the VC dimension of each subset. This is rather simple for a linear model (e.g. a separating hyperplane), hence the basic theory of SVMs can be derived from SRM (as Vapnik showed). In the case of a non-linear model, instead, VC dimension is rather difficult to compute, and it can only be approximated through a set of assumptions. Additionally, if your optimization problem is non-convex, the algorithm you use to solve it can introduce some "biases" such that the resulting VC dimension is different from the theoretical one (early stopping and small weight initialization are both examples of this last case). So in practice applying SRM is very difficult for neural networks, and you have to resort to some heuristic approaches that can only be considered approximations to the "true" SRM, such as k-fold cross-validation. Actually, this is only a subtle point, but it seemed interesting to me in relation to your answer.

Raphaël Feraud

@Simone

It is true that the main problem with MLP is that the optimization problem is non-convex: we are not able to find the minimum, and then we are not able to bound it. It still an open problem, may be for a long time.

Vikas Jairam Dongre

Thanks all for the healthy and open discussion.

Srikanth reddy Mudireddy

Agree with @Hasan Fleyeh.

A simple suggestion for a test run:

Which in turn brings the question of number of neurons - While you are updating your knolwedge on various comments, i think it is good to run a trial run with 81 (in hidden layer 1) 62 neurons ((input + target)/2)in 2nd hidden layer. & By the time you got an idea on what to start with technical proofs (for your new thoughts), you can already see how the results of this trial run.

Dr. Lakshmi Devasena C

I think 62 neurons ((input+target)/2) in a single hidden layer will be more accurate and less time consumption.

Vikas Jairam Dongre

In my problem, inpur variables say m is 43, output targets say n are 81.

so 1. arithmetic mean (m+n)/2 is 62

and 2. geometric mean sqrt(m*n) is 59

I tested NN with 60 hidden neurons and found that result is reasonably good.

THANKS ALL FOR VALUABLE GUIDANCE.

Jeff Knisley

I agree with Simone with respect to cross-validation. With neural networks, overfitting is always an issue, and thus, the more parameters (e.g., synaptic weights), the greater the chance that over-fitting occurs. With clients, I often refer to polynomial overfitting -- i.e., if there are more coefficients than data points, then the polynomial fit will be exact on the training data and correspondingly of little value on test data. Even though the theory suggests that a hidden layer of arbitrarily large size implies a universal classifier, a neural net model with only 1 neuron in the hidden layer is (very roughly) approximately logistic regression, which is a linear classifier. Often it is surprising even to myself -- and I know what is going on -- how small the hidden layers can actually be in a given application.

But I agree that averages are good -- especially the geometric mean for the last hidden layer. But given the overfitting issues with nonlinear classifiers/regression, cross-validation is still a very good practice even after you've fixed the structure of the neural network.

Stam Nicolis

Look at the tiling algorithm by Nadal and Mézard (available from Mézard's ResearchGate page).

Michael Pukish

I've found that the answer to this question depends heavily on the training algorithm you have available -- if an EBP variant, you may be able to handle bridged and deep architectures, however, time to convergence may be prohibitive for some applications. Thus, using a k-fold approach (or any approach of your choosing) to try a network at a certain size and configuration, evaluate, then grow or shrink, may simply take too long. Using traditional LM variants, you will be able to train and converge comparatively rapidly from test to test, however, standard LM does not handle bridged architectures. It largely becomes a matter of the tool set you are able to deploy on your experiments. I can suggest a look at this paper for a comprehensive overview:

http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=06152147

Then, you are free to try the best NN training algorithm I have personally tried. It comes from our group ( .. :) .. ) but honestly, it is remarkable in its ability to efficiently handle arbitrarily-connected networks of significant depth and width. This version is compiled for WinXP, so keep that in mind and set run-time modes accordingly within other operating systems:

http://www.eng.auburn.edu/%7Ewilambm/nnt/NBNTrainer210.zip

Use the standard NBN method, for ABN (arbitrarily connected networks).

Bojan Ploj

The answer on this question is my paper paper: Border pairs method

Conference Paper Border Pairs Method – Constructive MLP Learning Classificati...

Javad Mahmoodi

If number of ur input is ni and.number of output is no u can use this formula . 2(ni +no)and maximum number of hidden layer can be set to ( k*(ni+no)-no)/(ni+no+1)k is the number of ur observation

Mahmoud K Okasha

The selection of hidden layers for the network is not straightforward. When the number of hidden layer units is too small or too large errors increase. Many methods have been developed to identify the number of hidden layer units, but there is no ideal solution to this problem [See Kermanshahi, B. & Iwamiya, H. (2002). Up to year 2020 load forecasting using neural net. Electrical Power & Energy Systems, 24, 789-797.]. I would suggest that you start with one hidden layer and gradually increase the number of layers; then attempt to find the network with the least RMSE for the residuals.

Seyed Reza Shahamiri

Take a look at section 4.2 of this paper:

http://www.sciencedirect.com/science/article/pii/S0925231213009661

Usually one hidden layer is the best.

Jamal Dargham

Have a look at this paper

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5605430&tag=1

G. Filligoi

I almost agree with Mahmoud Okasha from Al-Azhar University about the uncertain strategy to be used in order to optimize the number of hidden layers. Experience, intuition, luck would help!!!

Naveen Kumar Boiroju

Number of Hidden Neurons

One of the most important characteristics of a perceptron network is the number of neurons in the hidden layer(s). If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may over fit the data. When over fitting occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.

There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

• The number of hidden neurons should be between the size of the input layer and the size of the output layer.

• The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.

• The number of hidden neurons should be less than twice the size of the input layer.

See the attachment.

Serban C. Musca

Hi Vikas,

You may want to check out this RG thread (and, lacking of modesty, my two contributions to it, especially the first one, which is quite extensive):

https://www.researchgate.net/post/why_the_prediction_or_the_output_of_neural_network_does_not_change_during_the_test_phase?

HTH

Cheers,

SCM

David Martin Ward Powers

This question assumes just one hidden layer. Sometimes more than one can be efficient. Intuitions and background science can help - how many features or dimensions do people think are important, or are normally present in a theory or model. This sets a useful first approximation of the size of the hidden layer H, or the choke point if more layers are needed.

Typically we implicitly regard concepts and features as near convex and near linear by default.

Near linear sigmoids then create a polygonal region for the concept on the outputs, and for each feature on the hidden layer, being expressed as a conjunction of half spaces. If several regions are recognized as needing to be combined, or some features are disjunctive, then more layers can be profitable. Often these are designed to funnel into the choice point and then spread out again from there.

Some heuristics are based on the size of the output layer K, e.g. some small multiple of lg(K). Some are based on the arithmetic mean if the input and layers, F and K, or the geometric mean or lg(FK)=lg(F)+lg(K). The logarithmic number of hidden features assumes classes that get divided by near orthogonal hyperplanes, but we can also get classes that tend to split by near parallel hyperplanes in which case the number needed can be linear in the number of classes (or that can be sorted and separated in the output layer).

Unsupervised clustering or PCA/ICA can also give you an idea of H in the uncorrelated/independent cases, and may also provide useful features or useful compression. Visualization of the SVD-rotated (paired) singular vector spaces can also help recognize the structure.

When you do start trying different H, or different numbers of layers, be careful of cross-validating or using all test/training data each time, for each variant of H1..h. This will tend to overtrain the structure. Nested cross-validation should be used, particularly if automatic, e.g. employing evolutionary, genetic, swarm or colony based meta-algorithms.

Naveen Kumar Boiroju

The nodes of a decision trees may be used to determine the number of hidden neurons for a classification problem. This is again a thumb rule....

Iterative or simulation methods may be used to determine appropriate number of hidden neurons, but it depends on the sizes of the training and testing data sets.

Lucas Borges Ferreira

The number of hidden layers and nodes depends of the problem you want to model.

Take a look in the link below that you will understand better this problem dependency.

https://goo.gl/MM0eAC

If you change the dataset you will see that in more complex problens you will need more nodes/hidden layers.

Atta ur Rahman

I think there is no hard and fast rule to determine the exact number of neurons in hidden layer. It depends on many factors like size of inputs/outputs, type and size of data, output function and the learning algorithm.

Best approach is intuitive approach with hit and trial method.

Subburaj Ramasamy

find by trial and error. You start with the number as given by Winter. Then increase by 1 neuron and see the impact on cost and then decide either to increase or back track. similarly decrease by one and experiment as above

Dr R Senthilkumar

Review on Methods to Fix Number of Hidden Neurons in Neural Networks

by Mathematical problems in Engineering

https://www.hindawi.com/journals/mpe/2013/425740/

Formula for calculation

https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

Examples

https://towardsdatascience.com/beginners-ask-how-many-hidden-layers-neurons-to-use-in-artificial-neural-networks-51466afa0d3e

Badges
Science topic

More Vikas Jairam Dongre's questions See All

Is there any multi-class SVM classifier available in MATLAB?

I applied SVM classifier for my character recognition problem. It works nicely for two classes. But it can not be applied to multi class problem directly as in case of neural network. I have to...

02 March 2014 3,149 14 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Do you think can be any diamond in A type eclogites?

I want to know more about diamond ore deposits in world.

08 August 2024 1,514 0 View