I want to know about the emerging topics in the field of "Document analysis and recognition" and its related areas such as : pre-processing, document Layout analysis, OCR technologies, ... etc.
Text classification is one of the most interested area in text analysis. To apply machine learning algorithms to classify text documents, it is need to convert the documents into matrix vectors. Due to large amount of features in text document vectors ( each word represent a feature) the performance of classic machine learning methods will be reduced. This problem is know as curse of dimensionality. Feature selection methods can be applied as pre-processing methods to select most prominent features. To this end for example you can extend the following recent methods over text data analysis:
Some recent and intesting topics and trends in natural language processing are sumarized in the presentation available at http://ttic.uchicago.edu/~mbansal/papers/nlp_shortcourse_bansal_ut2015.pdf
You can go deeper in each topic through the references.
One of the most emerging topic in the field of document analysis and recognition is Word Spotting. Word Spotting is an alternative of the OCR because OCR does not always generate accurate results when treating ancient manuscript documents.
I'm doing some thinking about the value documents add to qualitative studies (interviews and ethnography). I'm also interested in the discourses implied by text and images.
I'm writing a book on this for Routledge which will be out next year. I'm also setting up an onlinne network of people who do research with documents , which will go live in June if you fancy joining? My email is [email protected]