20 August 2024 4 6K Report

For image+text without video, how is pre-training of Multimodal Large Language Model generally done?

Choice-1: Transform image to text, and then input all the text to LLM?

Choice-2: Transform image to discrete tokens and input the tokens to LLM together with the text.

Or other choices?

More Tong Guo's questions See All
Similar questions and discussions