Your question is a bit unclear - do you mean datasets that you can use to benchmark your methods? I'll assume yes for the answer.
The best way to go about this is using common datasets such as Iris classification dataset or Boston housing dataset. Still, if these seem "boring" to you there are other ways to go about it.
By observing public dataset repositories such as kaggle or my favourite - UCI ML Repository (http://archive.ics.uci.edu/ml/index.php) you can find a large number of datasets - a lot of them will have a paper noted that describes the dataset collection process.
If you take that paper and plug it into Google Scholar, or ResearchGate you will find papers that cited it - some of which will utilize ML methods and display the results achieved by the authors.And those are the results you can compare to the methods you are benchmarking.
I like this approach for benchmarking novel methods for a few reasons. First, it's a direct application on a realistic dataset, as oposed to data you may generate yourself. Second, the research into the papers which cited the dataset allows you to find out about the state of the art for the problem you're observing. Last, but not least - if you achieve state of the art results on the dataset you may have something that is a potential publication.
---
In case your question meant - how to benchmark a dataset itself, there are a few things I like doing. First is checking the distribution of the variables in it. Are they normally or uniformely distributed, or do most of them lay in a certain part of the range. Second is checking for outliers. For non-complex datasets this is easy to do - just plot a histogram of the data within it! Second thing I like checking is the correlation between the inputs and outputs. You may want to use all the data in the training process of your models - but large amounts of input variables may significantly slow down the process. This is why, eliminating the data with poor correlation may help, especially if you are working on something time-sensitive.
Higher complexity datasets may require more advanced statistical analysis - but even this much will help you determine the quality.
Your question is a bit unclear - do you mean datasets that you can use to benchmark your methods? I'll assume yes for the answer.
The best way to go about this is using common datasets such as Iris classification dataset or Boston housing dataset. Still, if these seem "boring" to you there are other ways to go about it.
By observing public dataset repositories such as kaggle or my favourite - UCI ML Repository (http://archive.ics.uci.edu/ml/index.php) you can find a large number of datasets - a lot of them will have a paper noted that describes the dataset collection process.
If you take that paper and plug it into Google Scholar, or ResearchGate you will find papers that cited it - some of which will utilize ML methods and display the results achieved by the authors.And those are the results you can compare to the methods you are benchmarking.
I like this approach for benchmarking novel methods for a few reasons. First, it's a direct application on a realistic dataset, as oposed to data you may generate yourself. Second, the research into the papers which cited the dataset allows you to find out about the state of the art for the problem you're observing. Last, but not least - if you achieve state of the art results on the dataset you may have something that is a potential publication.
---
In case your question meant - how to benchmark a dataset itself, there are a few things I like doing. First is checking the distribution of the variables in it. Are they normally or uniformely distributed, or do most of them lay in a certain part of the range. Second is checking for outliers. For non-complex datasets this is easy to do - just plot a histogram of the data within it! Second thing I like checking is the correlation between the inputs and outputs. You may want to use all the data in the training process of your models - but large amounts of input variables may significantly slow down the process. This is why, eliminating the data with poor correlation may help, especially if you are working on something time-sensitive.
Higher complexity datasets may require more advanced statistical analysis - but even this much will help you determine the quality.