I have to combine both audio and video data into a variable. For this purpose I am using LSTM, but dimension of my data is 5-D and LSTM accepts only 3-D input. I am using the following paper:
https://wlv.openrepository.com/bitstream/handle/2436/622981/IF2019.pdf?sequence=2
I have also used convLSTM instead of LSTM, but every time google colab memory crashed.
Audio Features: 508, 10,300, 353,1
Video feature: 508, 10, 300, 353, 1