Hi everyone,

I need to perform a topic analysis on various corpora of documents and I need a procedure that can be applied to all of these corpora independently in a standard way. 

These are the characteristics of the corpora:

  • the number of documents in each corpus will hardly be more the 500 and most of the times is around 50;
  • documents are generally very shot (from 20 to 200 words most fo the times);
  • each corpus is independent and analyses will never be done merging corpora, but only performed within each corpus;
  • the language of documents will be homogeneous within each corpus, but it may vary between corpora;
  • the number of topics is unknown a priori, and topics will be different in every corpus.

 Specifically, I’m looking for a procedure that:

  • automatically detects the best number of recurrent topic in each corpus, but that it is also able to take into account that some documents may have “peculiar” topics that are not represented in any other document. These are not of interest and may be seen as a kind of “residuals”. If these peculiar, single-document topics are identified as further topics by the model it is fine too;
  • gives for every document a % for all the identified recurrent topics, plus a % that is “residual” from them. Otherwise, also the single-document topics have to be identified and scored in each document.

if I understand the LDA models well, they don’t allow this “residual” part and the sum of the %-score of the topics is always 1. Moreover, they are not good in identifying single-document topics and the result for these “outcast” documents is somehow a uniform score for all the topics, even though none of them is truly present in the document.

Are there other topic analysis models that better fit with my task or I misunderstood the LDA models?

Thank you very much!

Massimiliano

More Massimiliano Grassi's questions See All
Similar questions and discussions