What is the best method for classifying short texts and microblogs to categories like sport, politics,....?

I think, that you already have to distinguish between "genre type" and "topic type".

For instance, your examples "sport" and "politics" are "topic types": a list of such types enables you to classify short or long texts, micro-blogs, etc. following the content (the topics) in your corpus.

Contrarily, "narrative", "description", "comment", "answer", "question", etc. are all "genre types". A list of those genre types enables you to classify your corpus texts with respect of a specific genre.

In crossing both approaches - topic type and genre type - you probably will obtain a good classification of your corpus of textts (however, there are several other criteria which can ameliorate a basic topic/genre-classification).

W3 offers in its recommandation for an ontology of media resources (febrauary 2012) a lot of "strategies" of hwo to deal with yyour problem (http://www.w3.org/TR/mediaont-10/).

Mustapha Bouakkaz

You can use kmeans, or you can wait some months until l puublish my new and fast approache

Suzannah Hastings

This guide claims to be a 'basic tool' so maybe more of what you are looking for? http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/

Chris Biemann

For classifying politics vs. sports etc., use stopwords (the, a, an, in, of ...) and stopword sequences.

I know this sounds extremely simplistic, but we found these to be the strongest predictors for those newspaper categories, and you should definitively try this as a baseline.

For more sophisticated approaches, word embeddings (LSA, LDA, Deep Learning) might actually give you a good signal, but those should not be applied to stopwords, so you'd have something complementary.

Mustapha Bouakkaz

How we can calculate the semantic similarity between terms?

Chris Biemann

Hi Mustapha, for semantic similarity from large text collections, try the JoBimText project at http://sourceforge.net/projects/jobimtext/

Gopalakrishna Palem

"Topic modeling" methodologies such as LDA (latent dirichlet allocation) are good for this purpose.

A topic in that sense is considered as a probability distribution over a collection of words

and a topic model is a formal statistical relationship between a group of observed and

latent (unknown) random variables that specifies a probabilistic procedure to generate

the topics.

Check this example: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Thank you,

http://gopalakrishna.palem.in/

Mustapha Bouakkaz

I propos a new approach to aggregat keywords extracted from a corpus but the problem is i don't find a larg corpus to test my approach. What i need exactly is a matrix document x terms that contain in rows the documents and in the colums we find the terms (keywords) and insid we find the frequency of each term in each document

Hassan Monfared

Hi +Chris , as I read in most papers, the stop words must be removed in preprocessing steps (except some words like 'ok', 'not', ... in opinion mining ).

How can we use them for classifying short texts ( tweets, comments, short reviews,...) ?

Is there an English Translation of the Carl Moller text: ZUR VERGLEICHENDEN ANATOMIE DER SILURIDEN?

Do you know best mines of western part of Afghanistan?

How to convert a privately loaded document into a public document?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

Need help with my research project on open source SIEM and machine learning?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

How to choose the journal?

A Question about Phd thesis?

How to change the version of the article full-text pdf file?

I have no added any resarch paper yet but showing three paper? how to delete it?