What factors affect machine translation (MT) quality? I’m looking for human, scientific (published research), state-of-the-art, specific reflections, not AI-generated, impressionistic, older, general discussions.

I often hear about the quantity of resources being the crux of the issue. However, my hunch is that language pair, and more precisely language combination (directionality), is also an influencing factor. Say you're translating from Japanese (high-context language) into French (low-context language). In Japanese, you don't need to specify gender, number, etc. In French, you need that information, which means you'll have to make a guess (and take a chance), perform external research, ask the client, etc., but anyway, you probably won't find the answer within the source text (ST). Arguably, a MT system cannot make good decisions in that sort of context. Whereas, if you translate from Spanish into French, most of the information you need for the French target text (TT) can be retrieved directly from the Spanish ST.

When I researched the question in 2017-2018, it was clear from the literature that linguistic distance was a relevant factor in MT quality. For example: "Machine translation (MT) between (closely) related languages is a specific field in the domain of MT which has attracted the attention of several research teams. Nevertheless, it has not attracted as much attention as MT between distant languages. This is, on the one side, due to the fact that speakers of these languages often easily understand each other without switching to the foreign language. […] Another fact is that MT between related languages is less problematic than between distant languages…" (Popović, Arčan & Klubička, 2016, p. 43).

But what now in 2023, soon 2024, with LLMs and recent improvements on NMT? Thank you!

More Etienne Lehoux-Jobin's questions See All
Similar questions and discussions