I want to use data analysis to figure out how to predict the factors affecting the incidence of brain and bone cancer, but my challenge is the reliability of the data provided on the kaggle.com website. Is it possible to trust the data?
1) Is the description section enough to work with the data?
2) How many people have participated in the contest?
3) Is the dataset correctly formatted?
4) Who is the provider?
5) How old is the dataset?
6) Is the amount of the data enough for your work?
Try to judge the dataset based on these questions. However, if you find a published paper based on the dataset in a good journal/conference, I think you can rely on it.
Kaggle ensure that the competition setter is free to share the data. Data quality is not their responsibility. So you need to look assess the competence and intentions of the data provider.
With epidemiological data there are other questions to examine. Is the dataset relevant to the population you will apply the results to? The lead risk factors in the US and Iran may be quite different. One real and extreme version of this is a dataset within which patients with asthma had lower incidence of death from pneumonia that those without. (The reason was that those with asthma had active health care.) So even if the data have been carefully prepared, they might not be relevant outside the original context.