I have a set of paired texts in English and Spanish. I used the punkt tokenizer with pre-trained packages that can be found in the NLTK package for python. It works effectively however I want to control for the natural variance that exists between languages (some languages need more sentences to express things or more words).

For this, I obtained a chapter from Harry Potter (English and Spanish) translations and calculated words (e.g., English - 1000, Spanish - 1200) and sentences in that text. The text is expected to be faithful translation so in theory what's left is the natural variation.

My dataset looks something like this:

lang sentences

en 100

es 200

en 300

es 400

en 500

es 600

The sentences obtained by the control Harry Potter are 55 for English and 58 for Spanish.

Is there a metric or a way to take into account the natural variance between the two and adjust the "sentences" column with this?

More Michael Tsikerdekis's questions See All
Similar questions and discussions