Dear all,

As part of my research I am currently working on a Topic Modelling approach to detect different topics occurring in various public spheres with regard to certain traded commodities.

I have collected a corpus consisting of a relatively large newpaper archive (~25,000 documents), but I also have scraped other content from different sources (academia, press releases etc...) The smallest subcorpus consists of just over 500 documents (political speeches). These documents are already pre-filtered (search criteria), so there will not be a large set of divergent topics.

My problem is this: If I calculate an LDA model for the entire corpus, then certain niche topics only present in the smaller sub-corpora might not get "detected".

I have though of a number of different approaches to solve this, but I am still not sure how to go about it.

Should I use a STM (structural topic model) instead and include the source type as explanatory metadata? (would this actually help? Then I would have to run the model in R, which is slower than Python's topic modeling libraries...). Can I calculate separate models on each sub-corpus and merge them according to distance/similarity measures (Seems difficult with a sub-corpus of ~500 documents). Could I calculate separate models, always including the smaller sub-corpora and different samples of the larger ones and then merge these? (Seems like giving too much weight to smaller corpora?).

Thanks for your help! I am not a computer science person, so please take it easy ;)

Best,

Finn

More Finn Mempel's questions See All
Similar questions and discussions