7 Questions 11 Answers 0 Followers
Questions related from Panei San
Hello everyone, I want to know what threshold method can be used in web page cleaning. I mean web page cleaning that is removing boilerplate and extracting main content from a web page. Can you...
11 November 2014 2,201 1 View
Hello, please can you share info with me about how to count the stop words and tokens for text. I would like clarification with examples. Thanks
11 November 2014 7,304 4 View
Hello everyone, I want to some keywords for web page classification such as news, sport and etc. I want these keywords for matching and for training the classification. May you help me how to get...
11 November 2014 2,744 0 View
I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads,...
10 October 2014 8,890 13 View
I wanna know how to find the HTML web pages data set? Can you help me?
09 September 2014 8,614 3 View
Hello, everyone I am interesting the Content Extraction from HTML web pages. Now I use the HTML tags for dividing the block of web page and use the tag-to-text ratio and anchor-text-to-text ratio...
09 September 2014 5,351 7 View
Now I use CETR dataset but most web page don't have correct html format. And then I don't want to use JTidy . Because I propose my research that is not used DOM. Therefore, I can't use this JTIdy...
09 September 2014 951 0 View