How is datasets good?

Your question is a bit unclear - do you mean datasets that you can use to benchmark your methods? I'll assume yes for the answer.

The best way to go about this is using common datasets such as Iris classification dataset or Boston housing dataset. Still, if these seem "boring" to you there are other ways to go about it.

By observing public dataset repositories such as kaggle or my favourite - UCI ML Repository (http://archive.ics.uci.edu/ml/index.php) you can find a large number of datasets - a lot of them will have a paper noted that describes the dataset collection process.

If you take that paper and plug it into Google Scholar, or ResearchGate you will find papers that cited it - some of which will utilize ML methods and display the results achieved by the authors.And those are the results you can compare to the methods you are benchmarking.

I like this approach for benchmarking novel methods for a few reasons. First, it's a direct application on a realistic dataset, as oposed to data you may generate yourself. Second, the research into the papers which cited the dataset allows you to find out about the state of the art for the problem you're observing. Last, but not least - if you achieve state of the art results on the dataset you may have something that is a potential publication.

---

In case your question meant - how to benchmark a dataset itself, there are a few things I like doing. First is checking the distribution of the variables in it. Are they normally or uniformely distributed, or do most of them lay in a certain part of the range. Second is checking for outliers. For non-complex datasets this is easy to do - just plot a histogram of the data within it! Second thing I like checking is the correlation between the inputs and outputs. You may want to use all the data in the training process of your models - but large amounts of input variables may significantly slow down the process. This is why, eliminating the data with poor correlation may help, especially if you are working on something time-sensitive.

Higher complexity datasets may require more advanced statistical analysis - but even this much will help you determine the quality.

Sandi Baressi Šegota

Your question is a bit unclear - do you mean datasets that you can use to benchmark your methods? I'll assume yes for the answer.

The best way to go about this is using common datasets such as Iris classification dataset or Boston housing dataset. Still, if these seem "boring" to you there are other ways to go about it.

---

Higher complexity datasets may require more advanced statistical analysis - but even this much will help you determine the quality.

How to confirm the site-directed mutagenesis result without performing NGS?

Gromacs first step of minimization ?

Can I multiplex a mouse monoclonal and rat primary for IHC (mouse brain tissue)?

Dear fellow researchers studying in molecular biology, are there any tips on how to avoid plasmid dimerization when extracting plasmids?

What are the frameworks or methodologies to examine written academic ELF?

Please inform me about the International Conference on Plant Biology?

Is there any standard format or protocol for plant photo-plate creation?

Unification of General Relativity and Quantum Mechanics is possible ?

Radial Basis Functions Based Differential Quadrature Method for One Dimensional Heat Equation "required matlab code" and provide code in discription?

The journal change publisher, will my published paper get indexed?

I need the datasets of Microgrid for system identification?

Which file formats are accepted for supplementary material?

Dataset of synchronized cardiac angiography and ECG?

How to Select the most suitable machine learning algorithm depending on the characteristics of the given dataset ?

How to use evolutionary algorithms with real parameters in ryu sdn controller with large scale?

How to use NCBI datasets ?

How do I access .vcf files without an R statistical package?

Which is the best approach for anomaly detection in scanned image data set?

"Hello, I am trying to find public datasets containing FTIR spectra of blood samples (both healthy and disease-related)?

Analysis of MHC-I and II alleles with CNVs and unassigned loci?