I am trying to build a joint classifier for Bimodal Sentiment Analysis which takes two modalities(audio and video files) as inputs. Any suggestions, how can I concatenate the below audio and video features to train a CNN based deep learning models?

Audio features:

X_aud = np.asarray(aud_data)

y_aud = np.asarray(aud_labels)

X_aud.shape, y_aud.shape

((1440, 40), (1440,))

Video features:

X_img = np.asarray(image_data)

y_img = np.asarray(img_labels)

X_img.shape, y_img.shape

((11275, 256, 512, 3), (11275,))

Any help would be highly appreciated. Thanks In Advance!

Similar questions and discussions