I am building a system that automatically tells the category of any website, by just putting the URL as an input. Of course, this is a classical classification problem. I was wondering:

1. What input data should I use for classification (what part of the website is most informative of its category - home page content, meta keywords, domain name, a mixture of the above?)

2. What classification algorithm to use, so that the processing time of a new website is minimal. I am considering Bayesian filters (one for each category) but it is not the most computationally effective one( since it will try to evaluate each website as many times as categories there are). Another option I am considering are neural networks, or maybe even SVM.

Any suggestions are welcome!

Similar questions and discussions