Dear RGs,
During model training, we can dump both loss, the range of the weights, etc. One indicator of interest is the L2 norm of the gradient.
like its maximal/minimal value, its distribution and so on so forth.
My question is, say I find 0.5% of the variables get L2 norm of gradients less than 1e-3 during iteration in an epoch, can I claim the model is not learning efficiently. One more phenomena I get is the vibration of the accuracy instead of ascending among iteration, my batch size is about 1/5 of the train set.
any hint and appreciate beforehand
Lu