I am having a hard time finding some reference concerning best practices in topic modeling. I have a corpus comprising 200 medium length documents (2000 words each), is it enough to perform a topic model analysis with Latent Dirichlet Allocation?
That's a good question. I have seen various reports on this but I don't think there is a magic number. I think the general rule of thumb is the more the better (my background is LSA, not LDA). The documents themselves sound like they have a decent size and you may also want to consider, perhaps, segmenting those into smaller documents. Would be an interesting experiment; I have seen research that has taken approaches like and come up with interesting results.
I am currently conducting certain evaluations on different extensions of LDA (the knowledge based in particular). I have two types of data with one having documents in hundreds and the other in thousands.
The one with thousands of documents are producing more reliable values as per human evaluations. In case of comparing different models (through topic coherence) it tell them apart to a good degree, where they tends to converge when the documents are in hundreds.
It doesn't take long either with base-line LDA. Yes with knowledge-based, performance is a concern.
Number of documents required for training - is also related to the number of topics your LDA is going to learn. Intuitively if you have 200 topics and only 1 or 2 samples representing those topic it would be hard for LDA to learn the distribution across topics. If you will look at research building the simulation data to test their bayesian models, they clearly reveal that number of documents per topic is important.
I am also working with the LDA model with 3400 quotations, considered a quotation as a document. I feel like it is very difficult to differentiate topics with a small corpus and small size of document. I am looking forward to a better solution.
I read this literature because I was also curious about understanding the best K-topic number to use on LDA model. I have not applied their method though, but, it's close to what I've been trying.
Article A heuristic approach to determine an appropriate number of t...
Sample Size for Latent Dirichlet Allocation of Constructed-Response Items (Page 263 in Quantitative Psychology -- Marie Wiberg, Dylan Molenaar, Jorge González, Ulf Böckenholt, Jee-SeonKim) -- Has a good table with cutoff values