I would like to parse a webpage and extract meaningful content from it. By meaningful, I mean the content (text only) that the user wants to see in that particular page (data excluding ads, banners, comments etc.) I want to ensure that when a user saves a page, the data that he wanted to read is saved, and nothing else.
In short, I need to build an application which works just like Readability. ( http://www.readability.com ) I need to take this useful content of the web page and store it in a separate file. I don't really know how to go about it.
I don't want to use API's that need me to connect to the internet and fetch data from their servers as the process of data extraction needs to be done offline.
There are two methods that I could think of:
1. Use a machine learning based algorithm
2. Develop a web scraper that could satisfactorily remove all clutter from web pages.
Is there an existing tool that does this? I came across the boilerpipe library ( http://code.google.com/p/boilerpipe/ ) but didn't use it. Has anybody used it? Does it give satisfactory results? Are there any other tools, particularly written in java which do this kind of web scraping?
If I need to build my own tool to do this, what would you guys suggest to go about it?
Since I'd need to clean up messy or incomplete HTML before I begin its parsing, I'd use a tool like Tidy ( http://www.w3.org/People/Raggett/tidy/ ) or Beautiful Soup ( http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ) to do the job.
But I don't know how to extract content after this step.
Thanks a lot!