Simple and pragmatical: draw a matrix with several values for the learning rate and for the moment (e.g. learning rate in the range 0.005 to 0.040 in 0.005 steps, and moment in the range 0.30 to 0.80 in 0.05 steps, just examples) and test them all. If you have access to a cluster or a nice GPU, it is easy to implement parallelization to speed up the process. Warning: I already had situations where the best learning rate was around 1e-7.