For instance, layout of the sentences (i.e. knowing that a specific sentence is a bullet and it is correlated to another sentence that is stating the scope of the bullets). Moreover, lots of NLP parsers break if the mechanism delivers broken sentences (i.e. in a way that regex could not know if there is indeed a breakline in the source document and therefore unable to clean the text). This type of broken sentences also may disturb when obtaining embeddings, since it takes into account neighbor words.

More Marcio Ferreira Moreno's questions See All
Similar questions and discussions