is there a specific usecase you want to solve? Depending on this, it may be easier to recommend something good.
I know for example of yacy, which contains a crawler and has a very open API to get a lot of informations from the crawled sites. It is actually a distributed Search Engine, so it may be overkill and maybe does not even provide you with the data you are really interested in.
If you are interested in more specific stuff, and you have some programming experience, you can use things like the PHP Goutte project, or the more basic Guzzle one. Similar should be available in other programming languages.