The methodology used to determine the "optimal" number of hidden neurons is based on Structural Risk Minimization (see The nature of statistical learning Theory, V. Vapnik).
You test the architectures such that each architecture is included in the next one. For instance, 10 hidden neurons \subset 20 hidden neurons \subset ... \subset 100 hidden neurons.
You evaluate the performance of each architecture using a validation set or using k-fold cross validation. When, the validation error is increasing, you can stop the evaluation and choose the architecture where the validation error is the lowest.
@Raphael: connecting model selection in neural networks and the Structural Risk Minimization approach is a very intriguing observation.
However, the k-fold cross validation can be considered only as an heuristical approximation to SRM. To perform SRM correctly, you need a way of computing the VC dimension of a neural network, but in practice you can only compute some bounds. Moreover, the training process of a neural network involves some regularizing effects which implicitly change the solution with respect to the theoretical one (e.g. choosing small initial weights). Hence, I don't think SRM can be considered as a realistic criteria to perform model selection in this context.
One rule of thumb to be observed is that the number of weights to be optimized should be less than the number of training examples.
Usually one hidden layer (possibly with many hidden nodes) is enough, occasionally two is useful.
Practical rule of thumb if n is the Number of input nodes, and m is the number of hidden nodes, then for binary/bipolar data: m = 2n and for real data: m >> 2n
The SRM gives us the shape of the upper bound of the risk.
Even if you have the VC-dim of your prefered model, SRM provides just an upper bound. It say nothing more about the realisation of the risk on your set of examples, or about its expected value. We have to understand the statistical learning theory as a guideline. You can use SRM only if you beleive that the behavior of the upper bound of the risk is the same than the risk. It is right for MLP as well as SVM.
Anyway without additional knowledge, it seems a reasonable assumption. Right ?
You are right, SRM can be used (in principle) for any class of learning models, and it works using an upper bound on the expected risk. However, to apply SRM you need to be able to decompose your class of hypothesis in subsets (easy in the case of neural networks), but also to be able to compute the VC dimension of each subset. This is rather simple for a linear model (e.g. a separating hyperplane), hence the basic theory of SVMs can be derived from SRM (as Vapnik showed). In the case of a non-linear model, instead, VC dimension is rather difficult to compute, and it can only be approximated through a set of assumptions. Additionally, if your optimization problem is non-convex, the algorithm you use to solve it can introduce some "biases" such that the resulting VC dimension is different from the theoretical one (early stopping and small weight initialization are both examples of this last case). So in practice applying SRM is very difficult for neural networks, and you have to resort to some heuristic approaches that can only be considered approximations to the "true" SRM, such as k-fold cross-validation. Actually, this is only a subtle point, but it seemed interesting to me in relation to your answer.
It is true that the main problem with MLP is that the optimization problem is non-convex: we are not able to find the minimum, and then we are not able to bound it. It still an open problem, may be for a long time.
Which in turn brings the question of number of neurons - While you are updating your knolwedge on various comments, i think it is good to run a trial run with 81 (in hidden layer 1) 62 neurons ((input + target)/2)in 2nd hidden layer. & By the time you got an idea on what to start with technical proofs (for your new thoughts), you can already see how the results of this trial run.
I agree with Simone with respect to cross-validation. With neural networks, overfitting is always an issue, and thus, the more parameters (e.g., synaptic weights), the greater the chance that over-fitting occurs. With clients, I often refer to polynomial overfitting -- i.e., if there are more coefficients than data points, then the polynomial fit will be exact on the training data and correspondingly of little value on test data. Even though the theory suggests that a hidden layer of arbitrarily large size implies a universal classifier, a neural net model with only 1 neuron in the hidden layer is (very roughly) approximately logistic regression, which is a linear classifier. Often it is surprising even to myself -- and I know what is going on -- how small the hidden layers can actually be in a given application.
But I agree that averages are good -- especially the geometric mean for the last hidden layer. But given the overfitting issues with nonlinear classifiers/regression, cross-validation is still a very good practice even after you've fixed the structure of the neural network.
I've found that the answer to this question depends heavily on the training algorithm you have available -- if an EBP variant, you may be able to handle bridged and deep architectures, however, time to convergence may be prohibitive for some applications. Thus, using a k-fold approach (or any approach of your choosing) to try a network at a certain size and configuration, evaluate, then grow or shrink, may simply take too long. Using traditional LM variants, you will be able to train and converge comparatively rapidly from test to test, however, standard LM does not handle bridged architectures. It largely becomes a matter of the tool set you are able to deploy on your experiments. I can suggest a look at this paper for a comprehensive overview:
Then, you are free to try the best NN training algorithm I have personally tried. It comes from our group ( .. :) .. ) but honestly, it is remarkable in its ability to efficiently handle arbitrarily-connected networks of significant depth and width. This version is compiled for WinXP, so keep that in mind and set run-time modes accordingly within other operating systems:
If number of ur input is ni and.number of output is no u can use this formula . 2(ni +no)and maximum number of hidden layer can be set to ( k*(ni+no)-no)/(ni+no+1)k is the number of ur observation
The selection of hidden layers for the network is not straightforward. When the number of hidden layer units is too small or too large errors increase. Many methods have been developed to identify the number of hidden layer units, but there is no ideal solution to this problem [See Kermanshahi, B. & Iwamiya, H. (2002). Up to year 2020 load forecasting using neural net. Electrical Power & Energy Systems, 24, 789-797.]. I would suggest that you start with one hidden layer and gradually increase the number of layers; then attempt to find the network with the least RMSE for the residuals.
I almost agree with Mahmoud Okasha from Al-Azhar University about the uncertain strategy to be used in order to optimize the number of hidden layers. Experience, intuition, luck would help!!!
One of the most important characteristics of a perceptron network is the number of neurons in the hidden layer(s). If an inadequate number of neurons are used, the network will be unable to model complex data, and the resulting fit will be poor. If too many neurons are used, the training time may become excessively long, and, worse, the network may over fit the data. When over fitting occurs, the network will begin to model random noise in the data. The result is that the model fits the training data extremely well, but it generalizes poorly to new, unseen data. Validation must be used to test for this.
There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:
• The number of hidden neurons should be between the size of the input layer and the size of the output layer.
• The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
• The number of hidden neurons should be less than twice the size of the input layer.
This question assumes just one hidden layer. Sometimes more than one can be efficient. Intuitions and background science can help - how many features or dimensions do people think are important, or are normally present in a theory or model. This sets a useful first approximation of the size of the hidden layer H, or the choke point if more layers are needed.
Typically we implicitly regard concepts and features as near convex and near linear by default.
Near linear sigmoids then create a polygonal region for the concept on the outputs, and for each feature on the hidden layer, being expressed as a conjunction of half spaces. If several regions are recognized as needing to be combined, or some features are disjunctive, then more layers can be profitable. Often these are designed to funnel into the choice point and then spread out again from there.
Some heuristics are based on the size of the output layer K, e.g. some small multiple of lg(K). Some are based on the arithmetic mean if the input and layers, F and K, or the geometric mean or lg(FK)=lg(F)+lg(K). The logarithmic number of hidden features assumes classes that get divided by near orthogonal hyperplanes, but we can also get classes that tend to split by near parallel hyperplanes in which case the number needed can be linear in the number of classes (or that can be sorted and separated in the output layer).
Unsupervised clustering or PCA/ICA can also give you an idea of H in the uncorrelated/independent cases, and may also provide useful features or useful compression. Visualization of the SVD-rotated (paired) singular vector spaces can also help recognize the structure.
When you do start trying different H, or different numbers of layers, be careful of cross-validating or using all test/training data each time, for each variant of H1..h. This will tend to overtrain the structure. Nested cross-validation should be used, particularly if automatic, e.g. employing evolutionary, genetic, swarm or colony based meta-algorithms.
The nodes of a decision trees may be used to determine the number of hidden neurons for a classification problem. This is again a thumb rule....
Iterative or simulation methods may be used to determine appropriate number of hidden neurons, but it depends on the sizes of the training and testing data sets.
I think there is no hard and fast rule to determine the exact number of neurons in hidden layer. It depends on many factors like size of inputs/outputs, type and size of data, output function and the learning algorithm.
Best approach is intuitive approach with hit and trial method.
find by trial and error. You start with the number as given by Winter. Then increase by 1 neuron and see the impact on cost and then decide either to increase or back track. similarly decrease by one and experiment as above