01 December 2020 5 6K Report

I am currently working on an NLP experiment. The dataset that I am using is 'Essays' in which there are essays written by participants. These essays contains plenty of misspelled words and sentences without punctuations.

As I want to use this dataset for deep learning NLP model implementation which takes into account vocabulary to match with the words, I need the clean data with at least 'full stops' and correctly spelled words.

Right now I am manually adding full stops but would really like to know the best practices around this.

What are the commonly followed ways for data preprocessing in this case to improve the quality of dataset?

More Pallavi Jog's questions See All
Similar questions and discussions