Hi,
maybe someone knows where I can find a webpage dataset to Information extraction evaluation. I need a set like a:
- domain_1 = { {web_page_1, {relevant entities}}, ..., { {web_page_2, {relevant entities} }
I created a wrapper induction algorithm with based on domain's web pages. This algorithm can extract an important entity from these pages (for example from domain about movies they from each page extract information like film title, actors names etc.) . I created a reference dataset (I labeled 3 domain and 200 documents). But maybe there is an another better reference dataset?
Maybe someone know where I can find a software to comparation with my solution (semi-supervised information extraction from web pages based on html structure) ?