In any deep learning procedure in AI problems we effectively find the weights by some kind of a minimization process. This process can reach a local minimum and thus the weights can not be the best we could find. How is that avoided?
Hello. This is a good question. My understanding of deep learning is that it is a very large non-linear parameter estimation problem with certain structural constraints in terms of the connectivity of the neurons from one layer to the next, as well as positivity constraints enforced via rectified linear units. Gradient descent is used to optimise the network. As far as I know, there are no convergence guarantees and no convexity results for this type of optimisation problem. Therefore it is to be expected that only local convergence will result.
How can it be avoided? This would be a good research topic. I would start with a simple problem that you can analyse with a deep learning neural network and see what data and initialisation points in the parameter space lead to various convergence behaviours.
Hello. This is a good question. My understanding of deep learning is that it is a very large non-linear parameter estimation problem with certain structural constraints in terms of the connectivity of the neurons from one layer to the next, as well as positivity constraints enforced via rectified linear units. Gradient descent is used to optimise the network. As far as I know, there are no convergence guarantees and no convexity results for this type of optimisation problem. Therefore it is to be expected that only local convergence will result.
How can it be avoided? This would be a good research topic. I would start with a simple problem that you can analyse with a deep learning neural network and see what data and initialisation points in the parameter space lead to various convergence behaviours.
A good technique is to force arbitrarily major change to some of the coefficients and leave it to continue its new convergence. The most effective changes are the ones that correspond to bias nodes that should be close to zero.
One thing about deep learning is there is going to be a very high number of parameters to tune. What this means is your function lives in a high-dimensional space of parameters. In Andrew's coursera DL specialization, he mentions that at this high dimensional space, because we have so many parameters, we will be dealing with saddle points more often than local minima. and the optimizers we use now are quite capable of navigating through saddle points. So, in practical cases, you will most likely not be stuck at a local minima.
However, this is all intuition and is not quite verifiable. It is a nice theoretical question to ask and I think is a good research topic. One way to avoid them is to use multiple starting points for you algorithm, or slightly change some of the co-efficients after the algorithm is converged to see whether it will again converge to same point.