Sometimes the activation functions matter. Vanishing or exploding gradient used to be a common thing before the use or ReLU. The sigmoid functions that were popular before ReLU were prone to gradient vanishing since their derivative approached zero during the optimization of network parameters. Anyway, this behavior can also happen when the final activation function is not appropriate to the task and to the loss function that is being optimized. E.g., if you want your output to be non-negative, you should make sure that the activation function in the final layer does not produce negative results.
Also, sometimes weird behavior of gradients may happen due to NaNs or zeros in the input data, or if a division by zero is possible - so checking your data is always a good idea.
If everything seems ok, gradient clipping, as suggested above, may sometimes help, as well as providing a different learning rate per layer.