I have a dataset 12 videos. Each video is comprised of 179 frames. On these frames, I have applied ResNet-50 to extract features, and I received (179,7,7,2048) features. As far I know,
179=Total number of frames
2048=total number of features generated from a frame
7*7=kernel size / Filter size
Now I have to train my model using convLSTM by passing the features extracted through ResNet-50. And I know that input shape for convLSTM is
batch_shape + (channels, conv_dim1, conv_dim2, conv_dim3)
OR
batch_shape + (conv_dim1, conv_dim2, conv_dim3, channels)
So what should be the input shape for convLSTM and how can I apply the output of ResNet-50 to the convLSTM?
Regard