I think you should try various configurations manually. I don't think a NN with more than one hidden layer is needed (this is proved in many applications) but you still have to try different number of neurons in the hidden layer by:
1- for different n (number of neurons in hidden layer) do:
2- train the Net several time and average over the validation errors
3- finally, pick the n that have best validation rate.
1- Trial and error: (The possibility of over-fitting exists)
Determination of the initial value using rule-of-thumb methods:
Rule-of-thumb methods for determining the precise number of neurons:
The number of hidden neurons should be between the size of the input layer and the size of the output layer.
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
The number of hidden neurons should be less than twice the size of the input layer.
Rule-of-thumb methods for determining the precise number of hidden neurons:
Nh=Ns / (alpha∗(Ni+No))
Ni = number of input neurons.
No = number of output neurons.
Ns = number of samples in training data set.
alpha = an arbitrary scaling factor usually 2-10.
2- N- Fold Cross Validation
In order to avoid over-fitting, it is necessary to use cross-validation to explicitly penalize overly complex models, or to test the model's ability to generalize by evaluating its performance on a set of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter.
3- Using a hybrid meta-heuristic algorithm to train feed-forward neural networks
Our ADMET Modeler does it by semi-exhaustive search (www.simulations-plus.com). It uses ensembles of MLPs, each with single hidden layer. Thus, that part is fixed. What's variable are:
1) The number of hidden neurons (N)
2) The number of inputs (I)
A user specifies the starting and end values of both N and I, as well as steps, dN and dI, on going from start to end. Hence, one gets:
N0, N0+dN, N0+2*dN, ..., N0+m*dN neurons
I0, I0+dI, I0+2*dI, ..., I0+k*dI inputs
These parameters for a (m+1) x (k+1) matrix of ANN architectures. Architectures with too many weights with respect to the size of training data are removed - these are located in the lower right corner of the matrix. All the remaining ANN ensembles are then trained one by one (we can afford it because training is very fast) and the best one is chosen.