I am working on classification problem using deep learning. I have a problem in selecting hidden layers and neurons in each hidden layers . Is the any specific procedure for selection of these things?
As also mentioned by Mohammad, there are some methods available in the literature. However, I don't believe that there could be a single solution apart from some generic open-ended rules.
I always prefer to use an empirical method, where various options are tried and resulting errors are observed. Then, the best one is selected. This method would work for almost all kinds of hyperparameters.
To date, there is no commonly accepted methodology exists to configure the structure of DNNs (no of nodes, no of layer). This issue often forces one to blindly choose the structure of DNNs. DNN model selection has recently attracted intensive research where the goal is to determine an appropriate structure for DNNs with the right complexity for given problems. It is evident that a shallow NN tends to converge much faster than a DNN and handles the small sample size problem better than DNNs. In other words, the size of DNNs strongly depends on the availability of samples. You can keep this in mind while selecting the no of hidden layer.
Next question, how to determine the no of node? One approach that we can do is by implementing a flexible structure DNN where it starts the learning process from scratch and can automatically add or prune a hidden node while training. The decision to grow or prune nodes can be determined by examining the possible underfitting and/or overfitting situation of a network. We can grow the node while underfitting, whereas several nodes can be pruned to resolve overfitting situation. We suggest you to take a look at our publications for the detail.
Chapter Autonomous Deep Learning: Continual Learning Approach for Dy...
Article DEVDAN: Deep Evolving Denoising Autoencoder
Conference Paper Automatic Construction of Multi-layer Perceptron Network fro...
The approaches stated above by Mohammad Amin Motamedi are often based on rules of thumb, so they only show good performance with some specific problems. Pruning is also an interesting approach but kind of hard to implement and one cannot ensure, that neurons are removed, which would have shown activity eventually at a later point of time while training.
There are only two ways I know of to find the best model architecture with a very high probability. Note, that those approaches both are highly computationally demanding and time-consuming respectively.
1. Consider the bias variance tradeoff (https://en.wikipedia.org/wiki/Bias–variance_tradeoff): You start training the network with only one hidden layer (as one layer is sufficient for a vast majority of problems) and very few neurons. Then after the training has finished, you add another neuron and start training again. Repeat this procedure until the validation data produces worse performance. (At the beginning, the goodness of training and validation will both increase. At a certain point only the training verification will show better results with a higher number of neurons, but the goodenss of validation will decrease ) If you increase the number of neurons in one hidden layer to a specific extent (let's say 100 Neurons) and the the overall model performance still doesn't suffice, then you add another hidden layer and start all over again with very few neurons in both layers, which you slowly increase. This does'n t mean that you should always increase the number of neurons one by one. If the problem is highly complex, you can skip 5 Neurons at every step and then, if you notice a decrease in goodenss of validation, step back to adjusting only one neuron. So summarized, this method could theoretically find the absolute optimum of hidden layers and neurons, if the neural net training would always find a global optimal set of parameters. Since neural networks are usually trained by algorithms, which are only able to find a local optimal set of network parameters, since they are gradient-based, the aforementioned method is not able to find the absolute optimal model structure. Nevertheless this method gets pretty close to the optimal structure (let's say 1-2 Neurons deviation ), since the local optimal training algorithms usually can find a local optimum, which is pretty close to the global optimal network parameters. A variation of this method is to conduct it the other way round. So you start with a high number of neurons and slowly decrease it until the goodness of validation decreases. As you can imagine, this method is highly time consuming, but you can find the optimal structure pretty precisely.
2. Use a global optimization algorithm to find the optimal network structure. This approach can be a lot faster than the first approach if the problem is highly complex and a high number of hidden layers and neurons is mandatory. The approach works as follows: You can use a global optimization algorithm (e.g. particle swarm optimization, evolutionary algorithm) to vary the number of neurons and hidden layers. Similar to the above, each variation of network structure requires a full training of the current network variation. Also similar to the above, the found solution depends on the network's local optimal training algorithms, which means, that the solution can deviate a few neurons from the global optimal solution. The optimization goal is to increase both training verification and the Validation statistics.
Please note, that this approach can find a solution more quichly for highly complex problems, which also require a complex netwirk structure, but for most problems, the first approach is the one, which sould be chosen.
In conclusion, finding the optimal network structure is a demanding and time-consuming task, which should not be underestimated. Therefore using rules of the thumb can be a reasonable thing to do.
Another easy method especially used in classification is the early stopping method, where an unnecessarily complex network structure is chosen in advance and the training is aborted prematurly as soon as the goodness of validation decreases while training. This is also a good method if the network model later does not have to meet any computational time requirements.
I hope I could clear up some things, but sadly the method to choose depends -as often- on the problem and the application itself.