I have a set of paired texts in English and Spanish. I used the punkt tokenizer with pre-trained packages that can be found in the NLTK package for python. It works effectively however I want to control for the natural variance that exists between languages (some languages need more sentences to express things or more words).
For this, I obtained a chapter from Harry Potter (English and Spanish) translations and calculated words (e.g., English - 1000, Spanish - 1200) and sentences in that text. The text is expected to be faithful translation so in theory what's left is the natural variation.
My dataset looks something like this:
lang sentences
en 100
es 200
en 300
es 400
en 500
es 600
The sentences obtained by the control Harry Potter are 55 for English and 58 for Spanish.
Is there a metric or a way to take into account the natural variance between the two and adjust the "sentences" column with this?