For instance, layout of the sentences (i.e. knowing that a specific sentence is a bullet and it is correlated to another sentence that is stating the scope of the bullets). Moreover, lots of NLP parsers break if the mechanism delivers broken sentences (i.e. in a way that regex could not know if there is indeed a breakline in the source document and therefore unable to clean the text). This type of broken sentences also may disturb when obtaining embeddings, since it takes into account neighbor words.