How to conduct Topic Modelling with Sub-Corpora of differing sizes?

12 June 2020 0 6K Report

Dear all,

As part of my research I am currently working on a Topic Modelling approach to detect different topics occurring in various public spheres with regard to certain traded commodities.

I have collected a corpus consisting of a relatively large newpaper archive (~25,000 documents), but I also have scraped other content from different sources (academia, press releases etc...) The smallest subcorpus consists of just over 500 documents (political speeches). These documents are already pre-filtered (search criteria), so there will not be a large set of divergent topics.

My problem is this: If I calculate an LDA model for the entire corpus, then certain niche topics only present in the smaller sub-corpora might not get "detected".

I have though of a number of different approaches to solve this, but I am still not sure how to go about it.

Should I use a STM (structural topic model) instead and include the source type as explanatory metadata? (would this actually help? Then I would have to run the model in R, which is slower than Python's topic modeling libraries...). Can I calculate separate models on each sub-corpus and merge them according to distance/similarity measures (Seems difficult with a sub-corpus of ~500 documents). Could I calculate separate models, always including the smaller sub-corpora and different samples of the larger ones and then merge these? (Seems like giving too much weight to smaller corpora?).

Thanks for your help! I am not a computer science person, so please take it easy ;)

Best,

Finn

Badges
Science topic

Similar topics
Geoscience
Cartography
Mapping
Maps

More Finn Mempel's questions See All

Can you correct wrong reference, please?

Our paper (Surlyk, Alsen, Piasecki, 2024) is referred to as published in n Norsk Geologisk Tidsskrift. The correct journal title is: Geologisk Tidsskrift. It is a sister journal to Bulletin of...

17 April 2024 2,474 2 View

In monogomous breeders with overlapping generations do you include the offspring in SPAGeDI analysis of Fst, Relatedness, and Sigma?

I'm assessing geneflow, relatedness and dispersal patterns in social mole-rats. They live in groups (4-15 animals) with high reproductive skew, 1 reproductive female and 1-2 reproductive males,...

08 October 2023 3,903 0 View

How can I derive particle displacement for Acoustics in FE (comsol) ?

Dear Researchers, I am looking into deriving the particle displacement for pressure acoustics in the frequency domain or transient domain. If I solve a acoustics problem in comsol I get the...

07 November 2022 7,313 2 View

Statistics advice: can I combine two variables into one (sex and reproductive status)?

Hello all, I am analyzing body mass between small mammals and testing if it varies between sexes, reproductive status (breeding/non-breeding), and season (wet/dry). Would it be alright to have one...

15 November 2021 9,707 3 View

Experimental Design: How deal with pseudreplicates with immunological and clinical-chemical paramaeters?

Hello everybody, I am investigating the influence of a toxinon various clinical-chemical and haematological parameters in fish. I have three treatments (control, low dose, high dose) in my...

24 February 2021 1,720 1 View

Split GFP11 - need to be on termini?

I am interested in working with the split GFP system (GFP1-10 on on protein, GFP11 strand fused to another). Almost all examples I have seen have fused the GFP11 strand to the N or C terminus....

08 December 2020 3,317 0 View

How do I make nano-sized particles of Polyethylene?

I am attempting to make particles in the range of 50-500 nm from Low density polyethylene. My current protocol is to dissolve 0.1 g of LDPE in xylene at 100 C, then drip the resultant solution...

08 November 2020 7,970 1 View

Scholarship on fear of relaxation?

Does anyone know of any work dealing with fear of relaxation? I have seen this come up with patients and have found exposure principles a helpful addition to treatment. I've been searching around...

02 October 2020 9,564 3 View

How do I account for affects on distance travelled in different shaped study sites in animal dispersal study? Should I use R or something else?

Hi all, Due to quarantine my uni is closed and I can't get any hands on help with analysis for a manuscript. My study is looking at dispersal distances of small mammals in two populations and how...

23 May 2020 6,455 0 View

A basic implementation of multichannel data sonification?

I am looking to sonify multichannel data (up to 11 channels) of time series data simultaneously. Is anyone aware of basic software to achieve this? Either standalone software or packages/toolboxes...

27 February 2019 3,500 0 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View