Is there a classic process or model for multimodal fusion using text and tabular data as input to the model? Note that the text corpora can be directly used as input to the model and should not be preprocessed into structured data before being combined with tabular data.