I am classifying large PDF documents on the basis of different terms (features) these documents have. I want to apply a supervised learning algorithm for this purpose. please guide me in this regard by providing fruitful comments and feedback.
No classifying algorithm have any privilege on the others.You should try all of them and after comparison of the results, you will get the best algorithm.
Thanks to all, these answers helped me alot. I have one more question, can anyone guide me how to design training and test set? weather i should design it manually or system will create it from corpus. guide me about the starting steps which i need to do for designing own classifier. moreover, is there any built in classifiers available in PHP language?
I would look into the k-fold cross validation technique (Wikipedia has a good starting point to learn about the topic: https://en.wikipedia.org/wiki/Cross-validation_(statistics) ) to generate training and test sets from a corpus. This is a more robust way to evaluate your algorithm independently of training and test set selection.
As far as I know, there are no good classification libraries for PHP. From a systems design standpoint, PHP is more of a web application language and usually this these types of algorithm would happen on the backend with the results feeding back to the web framework. If you want to get started quickly, I'd look into the NLTK and PYML libraries for python.
And in terms of you original question, Sayed is right in that there is no inherent advantage in any particular classification algorithm as they all have implicit assumptions that make use of properties of your vector model. The key is getting a vector model with the appropriate feature combinations that work well with a particular classifier. You might also want to consider ensemble learning methods (for example boosting) that can take the results of many simple classifiers to obtain a single result.
Jonathan Fishbein, can i implement such type of algorithms in PHP. as i am designing a book recommender system and there are many users involved in that system. can i use weka and other tools in PHP language?
It is of course possible to write your algorithms in PHP since its ultimately a scripting language, but I would advise against it. PHP was design for and is best at creating dynamic content for web applications. Your not going to find many libraries helping you along the way (like Weka which is Java based) and any that you do fine will be loosely supported, incomplete and not very good.
From a system architecture perspective, most machine learning applications do the actual machine learning independently on the backend and then save the results for quick access for the users. For example, a python background task can run classifying a large corpus of books and then save each book and its resulting category somewhere (file, database, etc.)and a PHP fronted system can then pick uno those results and display them to the user.
NB, SVM and kNN all due well. Like most ML applications, feature selection is key. In this case the bag of words approach is the most common, and then you have to make choices on how the words are cleaned on non-sense words, stemmed, counted (frequencies, TFIDF), and possibly combined (n-grams). Please see the following articles for good reviews and ideas: