If I have set of documents of a certain domain (like sports) and I would like to mine words representing this domain. What techniques can I use for thus purpose? Thanks.
calculate the Log-likelihood (G2, see Dunning, T. 1993. “Accurate methods for the statistics of surprise and coincidence.” Computational Linguistics 19(1): 61–74.) ratio for every word (or term candidate) of your domain (compared to a set of texts that is "domain idependent", e.g. news texts). Those candidates with the highest values are most likely to be terms of the domains.
Cf. Pedersen et al. 1996: Significant Lexical Relationships, Proceedings of the 13th National Conference on Artificial Intelligence
I definitely agree with Bastian Entrup: you will need a large (or at least comparably large) set of domain-independent documents.
Generally, know your data - what exactly your domain is, what kind of language the authors use, how many sources your data comes from and how distinct they are from each other, etc. - and then build your domain-independent dataset so that it is as close as possible to your in-domain data in all aspects *except* for the domain you're interested in. So, if you have sports tweets, your domain-independent documents should also be tweets (but not about sports); if you have news articles, your domain-independent documents should also be news articles (ideally from the same set of sources), etc.
Then, you can use almost any kind of significance testing; it probably should be one that can deal with rare events (some words you may wish to test that are good indicators of a domain may be very infrequent). The article Bastian referred to (accessible at CiteSeer: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.105.2725) has a good point. Another interesting article that deals with the problem is by Robert Moore: On Log-Likelihood-Ratios and the Significance of Rare Events (http://research.microsoft.com/pubs/68957/rare-events-final-rev.pdf).
The articles here have so far been focused on collocation extraction applications, but for your purposes, maybe you could try looking for some keyword extraction methods. There is also a lot of work done on domain adaptation in the Speech Recognition community; what they call instance-based domain adaptation may help you.
You may make a simple Pivot table within MS Excel to calculate the occurrence of each term. You may separate your studied period into several parts and make the same Pivot tables and compare them. After that you may also make K-means clustering.
Good Question. If you have only sport documents, u can find easily by counting frequencies and removing stop-words. However, I think this will give many false positives, thus it is good to deal with a massive collection of documents.
May be, you can use some latent topic models, if you have documents on diverse topics, also u can consider LSA and PLSA. Give it a try!