I understand that Large Language Models are usually biased towards English language, reflecting mainly English language contexts and corpora. On the other hand, they are claimed to have brought up "emerging" multilingual features which were understood to be helpful and therefore further fostered by their authoring companies. This brings up a lot of questions:

  • How does multilingualism in interactive generative AI usually work? Is there an extra-layer, translating user prompts and LLM-results before the actual model gets to work?
  • There are LLMs trained on multilingual data (Teuken-7B). Do these use the same mechanisms as conventional LLMs?
  • Can one know whether adding, let's say, Hebrew training data, changes the way, the model works when used in English? Even a little?
  • How come the LLMs don't confuse tokens that are present in different languages?
More Dinah Schöneich's questions See All
Similar questions and discussions