I think, that you already have to distinguish between "genre type" and "topic type".
For instance, your examples "sport" and "politics" are "topic types": a list of such types enables you to classify short or long texts, micro-blogs, etc. following the content (the topics) in your corpus.
Contrarily, "narrative", "description", "comment", "answer", "question", etc. are all "genre types". A list of those genre types enables you to classify your corpus texts with respect of a specific genre.
In crossing both approaches - topic type and genre type - you probably will obtain a good classification of your corpus of textts (however, there are several other criteria which can ameliorate a basic topic/genre-classification).
W3 offers in its recommandation for an ontology of media resources (febrauary 2012) a lot of "strategies" of hwo to deal with yyour problem (http://www.w3.org/TR/mediaont-10/).
This guide claims to be a 'basic tool' so maybe more of what you are looking for? http://miriamposner.com/blog/very-basic-strategies-for-interpreting-results-from-the-topic-modeling-tool/
For classifying politics vs. sports etc., use stopwords (the, a, an, in, of ...) and stopword sequences.
I know this sounds extremely simplistic, but we found these to be the strongest predictors for those newspaper categories, and you should definitively try this as a baseline.
For more sophisticated approaches, word embeddings (LSA, LDA, Deep Learning) might actually give you a good signal, but those should not be applied to stopwords, so you'd have something complementary.
I propos a new approach to aggregat keywords extracted from a corpus but the problem is i don't find a larg corpus to test my approach. What i need exactly is a matrix document x terms that contain in rows the documents and in the colums we find the terms (keywords) and insid we find the frequency of each term in each document
Hi +Chris , as I read in most papers, the stop words must be removed in preprocessing steps (except some words like 'ok', 'not', ... in opinion mining ).
How can we use them for classifying short texts ( tweets, comments, short reviews,...) ?