I am working on LSTM networks (RNN) and read all about LSTM and its different variations.

My problem is about the forget gate and its dynamics. I know WHY we use those gates (e.g. forget/input/output) but do not know HOW they prevent LSTM from getting stuck in its common problem (vanishing gradient).

can any one describe dynamics of forget gate and the way it helps LSTM networks for long term dependencies?

Similar questions and discussions