I am trying to build a joint classifier for Bimodal Sentiment Analysis which takes two modalities(audio and video files) as inputs. Any suggestions, how can I concatenate the below audio and video features to train a CNN based deep learning models?
Audio features:
X_aud = np.asarray(aud_data)
y_aud = np.asarray(aud_labels)
X_aud.shape, y_aud.shape
((1440, 40), (1440,))
Video features:
X_img = np.asarray(image_data)
y_img = np.asarray(img_labels)
X_img.shape, y_img.shape
((11275, 256, 512, 3), (11275,))
Any help would be highly appreciated. Thanks In Advance!