You don't want the all the initial weights to be zero, because then you are not breaking any symmetries in the network structure. (Initial bias weights of zero are fine, though).
You also don't want all the initial weights to be positive, since (on average) half the weights after training will be negative.
Assuming a logistic activation function and normalised input data, we don't set initial weights to be too large, since the derivatives are then small, and learning will take longer.
So the standard approach is to choose random uniform initial weights between -1 and 1. Some of the negative initial weights will become positive, and some of the positive ones will become negative, but we don't know which ones (unless we do some unsupervised learning as a pre-process), so we just guess.
Note that some packages allow more sophisticated heuristics, such as initial weights bewtween -k and k where k is sqrt(6/(number node inputs + number node outputs)) for tanh activation nodes.
But the basic idea is the same: trained neural nets have negative weights, so we initialise with some negative weights.
You don't want the all the initial weights to be zero, because then you are not breaking any symmetries in the network structure. (Initial bias weights of zero are fine, though).
You also don't want all the initial weights to be positive, since (on average) half the weights after training will be negative.
Assuming a logistic activation function and normalised input data, we don't set initial weights to be too large, since the derivatives are then small, and learning will take longer.
So the standard approach is to choose random uniform initial weights between -1 and 1. Some of the negative initial weights will become positive, and some of the positive ones will become negative, but we don't know which ones (unless we do some unsupervised learning as a pre-process), so we just guess.
Note that some packages allow more sophisticated heuristics, such as initial weights bewtween -k and k where k is sqrt(6/(number node inputs + number node outputs)) for tanh activation nodes.
But the basic idea is the same: trained neural nets have negative weights, so we initialise with some negative weights.
I am quoting from @ Inanc Gumus because he explained this issue very nicely..
"
Imagine that someone has dropped you from a helicopter to an unknown mountain top and you're trapped there. Everywhere is fogged. The only thing you know is that you should get down to the sea level somehow. Which direction should you take to get down to the lowest possible point?
If you couldn't find a way to the sea level and so the helicopter would take you again and would drop you to the same mountain top position. You would have to take the same directions again because you're "initializing" yourself to the same starting positions.
However, each time the helicopter drops you somewhere random on the mountain, you would take different directions and steps. So, there would be a better chance for you to reach to the lowest possible point.
This is what is meant by breaking the symmetry. The initialization is asymmetric (which is different) so you can find different solutions to the same problem.
In this analogy, where you land is the weights. So, with different weights, there's a better chance of reaching to the lowest (or lower) point.
Also, it increases the entropy in the system so the system can create more information to help you find the lower points (local or global minimums)."