Depending on the exact data types and shapes, it may be possible for you to develop a custom architecture using a library such as TensorFlow. Please check the below work, specifically the section 2.2.3. for a custom architecture combining a CNN for a matrix-encoded molecule SMILES (you would possibly use a similar CNN for image, albeit with a different architecture) merged with a multilayer perceptron for numerical data.
You could use a similar approach to this, if you think it would work, combining two different types of networks suitable to your data.
Hope this helps!
Article On Approximating the pIC50 Value of COVID-19 Medicines In Si...
In conducting research with a mixed dataset comprising both image and text data, the key lies in harmonizing diverse modalities for comprehensive insights. After defining a clear research objective, collected datasets undergo separate preprocessing, including resizing and normalization for images, and cleaning and tokenization for text. The integration of these datasets demands thoughtful consideration, often involving multi-modal learning techniques or the creation of a unified dataset. Feature extraction becomes pivotal, extracting pertinent information from both modalities. Model selection is crucial, opting for architectures that can effectively handle diverse data types. Training the model involves finding the right balance, with subsequent evaluation using metrics tailored to each modality. Interpreting results becomes a nuanced task, understanding how the model leverages both image and text for informed predictions or insights. In the end, documentation and reporting encapsulate the entire process, providing a comprehensive understanding of the research methodology and findings.
Conducting research with a mixed dataset containing both image and text data involves defining clear research objectives, collecting and preprocessing diverse annotated data, integrating image and text features through a chosen model architecture (such as Multi-Modal Neural Networks or Transformers), training the model on the mixed dataset, and evaluating its performance using appropriate metrics. Fine-tuning and optimization may follow, considering ethical considerations and transparency in decision-making. The final steps involve interpreting and visualizing learned representations and communicating research findings through publications or presentations, recognizing the need for a solid understanding of computer vision and natural language processing concepts throughout the process.