10 October 2012 22 8K Report

On the face of it, topic modelling, whether it is achieved using LDA, HDP, NNMF, or any other method, is very appealing. Documents are partitioned into topics, which in turn have terms associated to varying degrees. However in practice, there are some clear issues: the models are very sensitive to the input data small changes to the stemming/tokenisation algorithms can result in completely different topics; topics need to be manually categorised in order to be useful (often arbitrarily, as the topics often contain mixed content); topics are "unstable" in the sense that adding new documents can cause significant changes to the topic distribution (less of an issue with large corpi).

In the light of these issues, are topic models simply a toy for NLP/ML researchers to show what can be done with data, or are they really useful in practice? Do any websites/products make use of them?

Similar questions and discussions