I want to cluster webpages based on industry sectors. I have dataset of 400 websites(Electronics,Chemical,Hydraulics and aerospace)100 for each category. I am highly focused on products.
Sonfack Serge thank you so much for deep insight. I want to apply content based web page clustering regardless of semantics or meaning of whole text. For example company A and Company C both are producing hydraulics products and company B is producing electronics products so my goal is to cluster the websites based on product or technology. so as a result A and C both will share same cluster. This is the idea.
Base on the frequency of word from the vocabulary of a category(hydraulic product vocab, technology product vocab).
For example site A and site C have vocabulary for hydraulic product for about 70% (you can use a threshold of 65% ) and less for technology product vocabulary.
You can also make use of the keywords for the sites
2 - if you have a data base of label sites for each category then, you can use BOW or TF transformation and apply supervised learning
Yes, I have prepared already prepared labeled data set as well but I wanted to try unsupervised learning instead. However, I will try with this approach as well. Thank you for your kind response.