Please guide me about designing test and training test for text classification algorithm. i am beginner to classification algorithms, guide me please in this regard?
First you have to construct the dictionnary i.e. the set of all words in your corpus without repetition. Second, you have to construct the features vectors (for both training and testing). At first step, I recommand to use bag of words representation (with binary representation 1 if the word exists 0 otherwise). Then you construct the classifier using libsvm (available in many languages) for example and you save the .model file. Finaly, you import this file for the test. In case you want to use Naive Bayes, I recommand WEKA.
You can try an unsupervised approach for building dictionary, hence start with know words and always update the columns in the dictionary, I advice you give clustering experiment a trail over classification otherwise, you can classify with document angular proximity such as cosine similarity
I assume you have the document repository, with a class assigned to each document.
You need to decide the features of the documents that you think can discriminate between classes, and extract those features for all documents. So if you have N documents and you decide to extract f features, then your data set is N X f matrix.
From this set, you can use 2/3 documents (rows) as training and 1/3 as test set. This method is called Hold-and-test. 2/3 documents are chosen randomly.
I suggest you use Weka (free software for data mining), which saves you the botheration of constructing explicit test/train set. The options in Weka allow you to easily classify you data with the selected algorithm. Documentation is available.
I assume you have the document repository, with a class assigned to each document.
You need to decide the features of the documents that you think can discriminate between classes, and extract those features for all documents. So if you have N documents and you decide to extract f features, then your data set is N X f matrix.
From this set, you can use 2/3 documents (rows) as training and 1/3 as test set. This method is called Hold-and-test. 2/3 documents are chosen randomly.
I suggest you use Weka (free software for data mining), which saves you the botheration of constructing explicit test/train set. The options in Weka allow you to easily classify you data with the selected algorithm. Documentation is available.
Thanks all of you for helping me. I am new in this field please guide me step wise so that i can understand how i can select features if i take books as entire domain.
the classification is a problem on its own , I mean how to build the initial categories and for books librarians have a good system.
Now to go from fixed categories and to classify text using some algorithms the following paper gives a good overview of the full process, from preprocessing to classification. As mentioned you can use Weka to do all (it is not the only possibility, if you work in Python for example this language has some package for NLP).