Hi. I am working on a prediction task. I have a bunch of images with bounding boxes and I want to predict the future bounding boxes. I am using a bi-directional encoder-decoder RNN with an attention mechanism. The size of the hidden layer is 512 and the number of layers is 3.
The input to the RNN encoder is a tensor of size (seq_len, batch_size, input_size). For the moment, I am using a batch_size and seq_len of 5. On the other hand, the input_size is 74 (I extract a feature vector from each image in a sequence and concatenate with a bounding box vector).
So, my question is, is my approach reasonably correct so far? While training my neural network, the loss function starts at an incredibly high value and decreases very slowly (and sometimes does not decrease at all). Is my architecture flawed or am I making mistakes in my design?