For image+text without video, how is pre-training of Multimodal Large Language Model generally done?
Choice-1: Transform image to text, and then input all the text to LLM?
Choice-2: Transform image to discrete tokens and input the tokens to LLM together with the text.
Or other choices?