I developed an information retrieval system for my master thesis, and I need to evaluate my term weighting function. My function takes into consideration paragraphs of the document when performing calculations. So, to evaluate it, I need a publicly available dataset with relevance judgments (query-document pairs) where each document consists of multiple paragraphs, and not just one single paragraph as in "Reuters" dataset for example, and there should be a ground truth for the document retrieval task, so I can compare my system's results to it.
Can you suggest to me such a dataset?
In the image is a sample where document structure meets my needs.