Simple and pragmatical: draw a matrix with several values for the learning rate and for the moment (e.g. learning rate in the range 0.005 to 0.040 in 0.005 steps, and moment in the range 0.30 to 0.80 in 0.05 steps, just examples) and test them all. If you have access to a cluster or a nice GPU, it is easy to implement parallelization to speed up the process. Warning: I already had situations where the best learning rate was around 1e-7.
Try to make learning rate adaptive instead of a constant. Various Lyapunov stability based adaptive learning rate (ALR) methods exist which adjust the value of learning rate in each iteration so that learning speed gets optimized. If you want to use fix value of learning rate then it should be around 0.0025 and include a momentum term. As closer the value of learning rate to 1 speed of learning would be fast but system may become unstable.