Thank you so much for your answer and recommendation. I worked on this research project back In Feb. The dataset you mentioned doesn't have label information for text and image. Below you can find the work.
Conference Paper Multimodal Classification Fusion in Real-World Scenarios