How can I implement late fusion using audio and video features ?

07 August 2020 2 1K Report

I am trying to build a joint classifier for Bimodal Sentiment Analysis which takes two modalities(audio and video files) as inputs. Any suggestions, how can I concatenate the below audio and video features to train a CNN based deep learning models?

Audio features:

X_aud = np.asarray(aud_data)

y_aud = np.asarray(aud_labels)

X_aud.shape, y_aud.shape

((1440, 40), (1440,))

Video features:

X_img = np.asarray(image_data)

y_img = np.asarray(img_labels)

X_img.shape, y_img.shape

((11275, 256, 512, 3), (11275,))

Any help would be highly appreciated. Thanks In Advance!

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Weak DAPI staining after immunohistochemistry - how to improve?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Dirty and clean?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?