How can I mine terms of a domain?

More Kims Kams's questions See All

How to download or collect bibliographic data?

hi, i want to ask how i can download or collect bibliographic data from academic search engines like DBLP, citeSeer, etc. Like if i want to download all publications (conference or journals) for a...

05 June 2019 3,223 3 View

What are Cultural Aspects of opinion mining?

dear colleagues, i want to ask if there is any one who can suggest me some paper or book where i can read about challenges or future work directions of opinion mining? I am really interested in...

02 March 2016 9,170 6 View

How to compute entropy of a word in a given text?

How to compute entropy of a word in a given text? Can someone give a solved example please? Please give example of mutual information too if possible. Thanks

09 October 2015 9,086 0 View

How to create an ideal profile representative of all?

Dear Fellows, I have descriptive profiles of some 100 persons and I have extracted some words from these descriptions which represent profile of each person. Now I want to create one...

05 June 2015 4,873 5 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

I recently came across an anatomy text by Carl Moller that was published in 1915 but it is in German or Dutch neither of which I can understand. I would like to know if there is an English...

10 August 2024 4,347 1 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

How to convert a privately loaded document into a public document?

I attempted to make a privately uploaded text public but a window appeared that said an error occurred. There was no explanation provided as to why there was an error or what might be done to...

05 August 2024 8,025 7 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

How can I interpret the data without the need of solving it manually?

How can I interpret the data gathered without solving?

03 August 2024 9,054 3 View

J. Harry Caufield

Are you familiar with the GATE project? I've read that its ANNIE plugin will handle tokenizing tasks well.

https://gate.ac.uk/

Bastian Entrup

calculate the Log-likelihood (G2, see Dunning, T. 1993. “Accurate methods for the statistics of surprise and coincidence.” Computational Linguistics 19(1): 61–74.) ratio for every word (or term candidate) of your domain (compared to a set of texts that is "domain idependent", e.g. news texts). Those candidates with the highest values are most likely to be terms of the domains.

Cf. Pedersen et al. 1996: Significant Lexical Relationships, Proceedings of the 13th National Conference on Artificial Intelligence

Jan Hajič, jr.

I definitely agree with Bastian Entrup: you will need a large (or at least comparably large) set of domain-independent documents.

Generally, know your data - what exactly your domain is, what kind of language the authors use, how many sources your data comes from and how distinct they are from each other, etc. - and then build your domain-independent dataset so that it is as close as possible to your in-domain data in all aspects *except* for the domain you're interested in. So, if you have sports tweets, your domain-independent documents should also be tweets (but not about sports); if you have news articles, your domain-independent documents should also be news articles (ideally from the same set of sources), etc.

Then, you can use almost any kind of significance testing; it probably should be one that can deal with rare events (some words you may wish to test that are good indicators of a domain may be very infrequent). The article Bastian referred to (accessible at CiteSeer: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.2725) has a good point. Another interesting article that deals with the problem is by Robert Moore: On Log-Likelihood-Ratios and the Significance of Rare Events (http://research.microsoft.com/pubs/68957/rare-events-final-rev.pdf).

The articles here have so far been focused on collocation extraction applications, but for your purposes, maybe you could try looking for some keyword extraction methods. There is also a lot of work done on domain adaptation in the Speech Recognition community; what they call instance-based domain adaptation may help you.

Julian Vasilev

You may make a simple Pivot table within MS Excel to calculate the occurrence of each term. You may separate your studied period into several parts and make the same Pivot tables and compare them. After that you may also make K-means clustering.

R.V.S. Lalitha

Use SVM classifiers to map feature vectors.

Also this can be analyzed using R graphs

Alejandro Figueroa

Good Question. If you have only sport documents, u can find easily by counting frequencies and removing stop-words. However, I think this will give many false positives, thus it is good to deal with a massive collection of documents.

May be, you can use some latent topic models, if you have documents on diverse topics, also u can consider LSA and PLSA. Give it a try!

Kims Kams

Thanks to all of you :) Thanks a lot

Olivier Parisot

Dear Kims,

Topic modelling (with MALLET, for exmaple: http://mallet.cs.umass.edu/) could be a solution for your needs.

Kind regards,

Olivier PARISOT

Thanks Philippe it helps alot