There are many ways of implementing text categorization, the simplest Machine learning technique could be based on Bag-of-Words representation and subsequently performing K-Nearest neighborhood classification to do the classification if the class labels are known. Otherwise, one can perform unsupervised categorization using clustering techniques.
There are many ways of implementing text categorization, the simplest Machine learning technique could be based on Bag-of-Words representation and subsequently performing K-Nearest neighborhood classification to do the classification if the class labels are known. Otherwise, one can perform unsupervised categorization using clustering techniques.
I think SVM is the best for this purpose. SVM is designed for handling high dimension data and if you want to consider the text as bag of words the dimension of your data will be high.
I have a good experience with the "Bipropagation" and "Border pairs method". This are two learning method for the MLP that are much better than backpropagation. Description of both methods is on my RG page.
Our paper "WordICA - Emergence of linguistic representations for words by independent component analysis" in Natural Language Engineering (2010) describes an unsupervised method in a quite detailed manner including preprocessing steps. Independent Component Analysis has proved to be an efficient method for extracting meaningful features in an automatic manner and seems to be superior to the widely used LSA method. For supervised learning, there are many good options including SVM.
I agree that SVM is the best choice for text categorization. This algorithm has bound on the error of generalization (error on test examples) which does not depend on dimension of the input space (which is high in the case of represenatation of textual documents). Also, algorithm handles unrelevant and stop words well. Linear version of algorithm is good enough because documents are lineary separable because of the high dimension of data representation.