Currently I am dealing with gradient descent momentum (GDM) in neural network for the color properties of a polymer. The obtained results present the non-monotonicity. What are the reasons for non-monotonicity in GDM.
I suppose your question is about a non monotonous behabiour of the error as training proceeds
stochastic gradient descent are intrinsically non monotonous but at small scale only
if you have strong non monotonicity on long time scales (error going down, then strongly going up, then down again etc), it might just be that your momentum term is too high
check your implementation with a zero momentum term, then add a small momentum term to see if it either speeds up the training and/or improves the performance
choose the value of this momentum term from the best performance on a validation set
momentum in SGD smooths updates by combining gradients from several last mini-batches (running average). The update becomes something like 0.8*g_t + (0.8)^2*g_t-1 + (0.8)^3*g_t-2 ..., where g_t are gradients from the mini-batches and 0.8 is the momentum. Normally, SGD would use just the last gradient.
Reasonable momentum will improve convergence speed. Also, it may lead to better results, as the momentum can help to avoid local minima.
Generally, you have to be careful with SGD. High learning rate, especially combined with high momentum, will result in unstable convergence, and may even result in complete divergence of your solution. What is "high" depends on your network. All this assumes, your implementation of SGD is correct.